Drift-Aware Monocular Localization Based on a Pre-Constructed Dense 3 D Map in Indoor Environments

Recently, monocular localization has attracted increased attention due to its application to indoor navigation and augmented reality. In this paper, a drift-aware monocular localization system that performs global and local localization is presented based on a pre-constructed dense three-dimensional (3D) map. In global localization, a pixel-distance weighted least squares algorithm is investigated for calculating the absolute scale for the epipolar constraint. To reduce the accumulative errors that are caused by the relative position estimation, a map interaction-based drift detection method is introduced in local localization, and the drift distance is computed by the proposed line model-based maximum likelihood estimation sample consensus (MLESAC) algorithm. The line model contains a fitted line segment and some visual feature points, which are used to seek inliers of the estimated feature points for drift detection. Taking advantage of the drift detection method, the monocular localization system switches between the global and local localization modes, which effectively keeps the position errors within an expected range. The performance of the proposed monocular localization system is evaluated on typical indoor scenes, and experimental results show that compared with the existing localization methods, the accuracy improvement rates of the absolute position estimation and the relative position estimation are at least 30.09% and 65.59%, respectively.


Introduction
The emergence of wireless communication and the global positioning system (GPS) has ignited the idea of personal navigation systems (PNSs).PNSs have localization and navigation functions that provide users with positional information via a mobile terminal, such as a smartphone or a tablet personal computer (PC).Owing to the global navigation satellite system (GNSS) [1,2], position information is easy to obtain in outdoor environments, but in indoor environments, because of the shielding effect of structures, GNSS is incapable of providing reliable position services to a user.Therefore, the development of an accurate and stable indoor localization method that does not depend on satellite signals has become a research hotspot in recent years.
A variety of techniques are being investigated for indoor localization, and a typical category of these is radio signal-based methods, including WiFi-based [3,4], radio frequency identification (RFID)-based [5,6], and ultra wideband (UWB)-based methods [7,8].However, an obvious drawback of signal-based localization systems is that moving objects or pedestrians affect signal strength or flight time, which makes localization results unreliable and even leads to localization failure.Moreover, some anchor points need to be installed and calibrated before localization is established, in radio signal-based systems.Once the system is enabled, the anchor points cannot be moved.The installation and maintenance of the infrastructure at the service of these localization systems are costly.In consideration of these drawbacks, a stable and cost-efficient indoor localization method is desired for providing users with quality location-based services.
With the popularization of smart mobile terminals, cameras as vision sensors become standard equipment embedded into almost all smartphones and tablet PCs, which provides an opportunity to develop image-based localization, i.e., visual localization.For a visual localization system, the user merely needs to capture an image or an image sequence by a mobile terminal, and uploads the images to the server via a wireless network.Then, the positions of the user can be estimated by localization algorithms at the server.Finally, positional information is sent to the user's mobile terminal.The process of visual localization is realized without any infrastructure deployment.Before the implementation of visual localization, an off-line database that contains pose-tagged database images should be created, which has been studied in existing research [9,10].
Visual localization can be divided into two categories [11]: global localization and local localization.Typically, a visual localization system first estimates initial camera positions with global localization methods.Then, local localization methods are used to track the camera iteratively, namely, to perform subsequent localization.
In a monocular localization system, the key step of global localization is to compute the rotation matrix and translation vector between the query camera and the database camera.However, the absolute scale of the translation vector cannot be calculated based only on a query image and a database image, which is called the scale ambiguity problem [12].There are many existing studies on solving this problem, which use for example, planar motion constraints and camera heights [13][14][15].By taking advantage of database images, the absolute scale can be estimated by epipolar geometry or solving the Perspective-n-Points (PnP) problem.The epipolar-based method [16,17] can avoid computing the absolute scale by introducing multiple epipolar constraints between the query and database cameras.A prerequisite of this method is that, for a query image, more than one matched database images must be found to obtain multiple epipolar constraints.An alternative approach is based on solving the PnP problem that implicitly estimates the absolute scale by triangulation of the points in the real world [18].However, this approach suffer from the errors that are caused by triangulation.
The development and popularization of RGB-D sensors makes it possible to conveniently measure the 3D positions of visual feature points.Then, the PnP-based method can be directly used for localization without triangulation [19,20].In addition, a closed-form solution [21] for the scale ambiguity is studied based on the least squares approach, in which four available methods for calculating the absolute scale are involved.According to the moving direction of the query camera, one of the available methods should be chosen, which is a promising idea for absolute scale estimation.However, two deficiencies exist in this closed-form solution: the first is that there is no mention of how to obtain the 3D position of visual feature points, and the second is that the recognition method for the camera moving direction is not specified in [21].
For a monocular localization system, local localization is executed based on initial localization.Once the first two positions of the query camera have been acquired, subsequent query camera positions can be estimated by performing triangulation and solving the PnP problem.The EPnP [22] as a non-iterative solution to the PnP problem is commonly employed in local localization.However, the incremental frame-to-frame position estimation inherently accumulates errors, i.e., localization drifts.Because the user trajectory in monocular localization usually does not form a closed loop in most cases, the global optimization techniques, such as g2o (general graph optimization) [23] and global bundle adjustment [24], cannot be applied in drift correction.Thus, local optimization methods such as local bundle adjustment, are used to refine camera poses and 3D points that incorporate the information extracted from multiple query images [25,26].However, the accumulative errors caused by local localization cannot be effectively detected and reduced by the existing approaches.With the increase of the trajectory length, the drifts caused by position iteration significantly increase, which will ultimately lead to the result that accumulative errors exceed the maximum permissible value.
In consideration of the shortcomings of the existing approaches, an improved monocular localization system is proposed, which focuses on the absolute scale estimation and drift detection.The proposed system contains two modes: the global localization (i.e., the absolute position estimation) mode, and the local localization mode that contains the relative position estimation and drift detection.The main contributions of this paper are stated as follows: (1) A pixel-distance weighted least squares (PWLS) algorithm is investigated to calculate the absolute scale of the translation vector between query and database cameras in global localization.
The proposed PWLS algorithm fully considers the effect of the camera's direction of movement in scale estimation, and employs pixel-distance weights to control the contributions to the results of the absolute scale estimation in different moving directions.(2) To effectively detect and reduce the drifts, a map interaction-based drift detection method is introduced in local localization.For the introduced drift detection method, the estimated feature points obtained by triangulation for relative position estimation are also used for drift detection, but unreliable estimated feature points are removed.With the aid of the linear characteristics of visual feature points that are stored in the pre-constructed dense 3D map, some line models are established.Taking advantage of line models and the estimator MLESAC [27], a line model-based MLESAC (LM-MLESAC) algorithm is proposed for the drift estimation.Because some estimated feature points contain significant errors caused by triangulation, the LM-MLESAC algorithm is executed to reject these estimated feature points to improve the drift estimation performance.(3) By combining the global and local localization methods, a novel integrated monocular localization system is realized with the drift detection method.With the aid of the drift detection method, the average errors of the system are limited within an expected range.
The remainder of this paper is organized as follows: Section 2 discusses the global localization (i.e., the absolute position estimation).Section 3 studies the map interaction-based drift detection for local localization, including the relative position estimation, the line model establishment, and the line model-based MLESAC algorithm.In Section 4, the performances of the global and local localization methods are investigated.Finally, conclusions are presented in Section 5.

Absolute Position Estimation Based on the Dense 3D Map
The framework of the proposed monocular localization system is shown in Figure 1.The proposed system contains two modes: global localization and local localization.Local localization consisting of the relative position estimation and drift detection will be discussed in Section 3. The absolute position estimation, i.e., global localization, aims at acquiring query camera positions in the indoor coordinate system on the basis of pose-tagged database images which are stored in the pre-constructed 3D map.Global localization can be divided into two parts: scale-free localization using the epipolar constraint, and absolute scale estimation based on the proposed PWLS algorithm.The global localization method is utilized for both the initial position estimation and position correction, once the drift distance exceeds the threshold in local localization.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 3 of 28 In consideration of the shortcomings of the existing approaches, an improved monocular localization system is proposed, which focuses on the absolute scale estimation and drift detection.The proposed system contains two modes: the global localization (i.e., the absolute position estimation) mode, and the local localization mode that contains the relative position estimation and drift detection.The main contributions of this paper are stated as follows: (1) A pixel-distance weighted least squares (PWLS) algorithm is investigated to calculate the absolute scale of the translation vector between query and database cameras in global localization.The proposed PWLS algorithm fully considers the effect of the camera's direction of movement in scale estimation, and employs pixel-distance weights to control the contributions to the results of the absolute scale estimation in different moving directions.(2) To effectively detect and reduce the drifts, a map interaction-based drift detection method is introduced in local localization.For the introduced drift detection method, the estimated feature points obtained by triangulation for relative position estimation are also used for drift detection, but unreliable estimated feature points are removed.With the aid of the linear characteristics of visual feature points that are stored in the pre-constructed dense 3D map, some line models are established.Taking advantage of line models and the estimator MLESAC [27], a line modelbased MLESAC (LM-MLESAC) algorithm is proposed for the drift estimation.Because some estimated feature points contain significant errors caused by triangulation, the LM-MLESAC algorithm is executed to reject these estimated feature points to improve the drift estimation performance.
(3) By combining the global and local localization methods, a novel integrated monocular localization system is realized with the drift detection method.With the aid of the drift detection method, the average errors of the system are limited within an expected range.
The remainder of this paper is organized as follows: Section 2 discusses the global localization (i.e., the absolute position estimation).Section 3 studies the map interaction-based drift detection for local localization, including the relative position estimation, the line model establishment, and the line model-based MLESAC algorithm.In Section 4, the performances of the global and local localization methods are investigated.Finally, conclusions are presented in Section 5.

Absolute Position Estimation Based on the Dense 3D Map
The framework of the proposed monocular localization system is shown in Figure 1

Scale-Free Localization Using the Epipolar Constraint
Before presenting the monocular localization approach, it is necessary to review the dense 3D map, which is called the visual map in our previous works [10].The visual map constructed by RGB-D (RGB-depth) sensors, e.g., the Microsoft Kinect, is an extended dense 3D map that contains essential elements for vision-based localization, such as 3D point clouds, database images, and the corresponding poses of the database camera.In this paper, the camera on the RGB-D sensor that is used to construct the visual map is called the database camera, and the camera on the user's smart terminal is called the query camera.In consideration of the computational complexity, only a subset of the images captured by the query camera are used for localization which are called query images.The keyframe selection method proposed in [25] is employed to select query images for localization.
In the visual map, each visual feature on the database image associates with a 3D point in the global coordinate system, i.e., the indoor coordinate system, which makes it realizable to obtain the 3D position of the visual feature.As shown in Figure 2, the indoor scene can be fully reconstructed by the visual map, and at the same time of the visual map construction, the poses of the RGB-D sensor (i.e., the database camera) are recorded and stored in the map.In the process of the visual map construction, the RGB-D sensor captures an RGB image and a depth disparity image, simultaneously.Then, the 3D point clouds are aligned using the visual map construction algorithm [10].When the visual map construction has been completed, the 3D position of each visual feature point in indoor scenes is determined in the indoor coordinate system.

Scale-Free Localization Using the Epipolar Constraint
Before presenting the monocular localization approach, it is necessary to review the dense 3D map, which is called the visual map in our previous works [10].The visual map constructed by RGB-D (RGB-depth) sensors, e.g., the Microsoft Kinect, is an extended dense 3D map that contains essential elements for vision-based localization, such as 3D point clouds, database images, and the corresponding poses of the database camera.In this paper, the camera on the RGB-D sensor that is used to construct the visual map is called the database camera, and the camera on the user's smart terminal is called the query camera.In consideration of the computational complexity, only a subset of the images captured by the query camera are used for localization which are called query images.The keyframe selection method proposed in [25] is employed to select query images for localization.
In the visual map, each visual feature on the database image associates with a 3D point in the global coordinate system, i.e., the indoor coordinate system, which makes it realizable to obtain the 3D position of the visual feature.As shown in Figure 2, the indoor scene can be fully reconstructed by the visual map, and at the same time of the visual map construction, the poses of the RGB-D sensor (i.e., the database camera) are recorded and stored in the map.In the process of the visual map construction, the RGB-D sensor captures an RGB image and a depth disparity image, simultaneously.Then, the 3D point clouds are aligned using the visual map construction algorithm [10].When the visual map construction has been completed, the 3D position of each visual feature point in indoor scenes is determined in the indoor coordinate system.For global localization, the best-matched database image with the query image should firstly be retrieved in the pre-constructed visual map.During the process of visual map construction, the visual features on database images are described by ORB (oriented FAST and rotated BRIEF) descriptors.As an alternative to the SIFT (scale invariant feature transform) or the SURF (speeded up robust feature) descriptors, the ORB descriptor is tested to be rotation invariant and resistant to noise, and it is effective and stable for visual localization [28].The visual features on the query image are also described by ORB descriptors.The Bag-Of-Features (BOF) [29] approach is used in the visual map construction to determine a closed loop of the database camera trajectory [10].Therefore, the BOF approach can be utilized in this paper to retrieve the matched database images with the query image.Moreover, the best-matched database image can be determined by the 1-precision criterion [30] based on the ORB descriptors.The best-matched database image contains sufficiently matched visual features with the query image.
The epipolar geometry is a typical method for recovering the rigid transformation between two cameras by two-dimensional correspondences, i.e., feature correspondences.Let  For global localization, the best-matched database image with the query image should firstly be retrieved in the pre-constructed visual map.During the process of visual map construction, the visual features on database images are described by ORB (oriented FAST and rotated BRIEF) descriptors.As an alternative to the SIFT (scale invariant feature transform) or the SURF (speeded up robust feature) descriptors, the ORB descriptor is tested to be rotation invariant and resistant to noise, and it is effective and stable for visual localization [28].The visual features on the query image are also described by ORB descriptors.The Bag-Of-Features (BOF) [29] approach is used in the visual map construction to determine a closed loop of the database camera trajectory [10].Therefore, the BOF approach can be utilized in this paper to retrieve the matched database images with the query image.Moreover, the best-matched database image can be determined by the 1-precision criterion [30] based on the ORB descriptors.The best-matched database image contains sufficiently matched visual features with the query image.
The epipolar geometry is a typical method for recovering the rigid transformation between two cameras by two-dimensional correspondences, i.e., feature correspondences.Let K D and K Q represent the calibration matrices of the database and query cameras, respectively.The position coordinates X D and X Q of the matched descriptors on the database image and the query image can be normalized by: The relationship between the normalized coordinates of matched descriptors in the query image and the database image can be described as: where E is the essential matrix.The essential matrix is used to present the camera transformation parameters up to an unknown scale between the database and query cameras in the following form: where t E represents the translation vector and R E denotes the rotation matrix.
[t E ] × is the skew-symmetric matrix of t E = [t x , t y , t z ] T .The essential matrix E is determined by the normalized position coordinates of the matched ORB descriptors on the query and database images.The essential matrix can be computed by the five-point algorithm [31].The translation vector t E and the rotation matrix R E can be extracted from the essential matrix E using singular value decomposition.
To clearly express spatial relationships, four coordinate systems are defined which are the indoor coordinate system OXYZ, the query camera coordinate system O Q X Q Y Q Z Q , the database camera coordinate system O D X D Y D Z D , and the query image coordinate system O I X I Y I .As shown in Figure 3, the epipolar constraint reflects the relative transformation between the query and database cameras.The scale-free position relationship between the query camera and the database camera can be described as: where P D and P R Q denote the 3D positions of database and query cameras, respectively.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 5 of 28 The relationship between the normalized coordinates of matched descriptors in the query image and the database image can be described as: (2) where E is the essential matrix.The essential matrix is used to present the camera transformation parameters up to an unknown scale between the database and query cameras in the following form: where E t represents the translation vector and E R denotes the rotation matrix.
The essential matrix E is determined by the normalized position coordinates of the matched ORB descriptors on the query and database images.The essential matrix can be computed by the five-point algorithm [31].The translation vector E t and the rotation matrix E R can be extracted from the essential matrix E using singular value decomposition.
To clearly express spatial relationships, four coordinate systems are defined which are the indoor coordinate system OXYZ , the query camera coordinate system , and the query image coordinate system I I I O X Y .As shown in Figure 3, the epipolar constraint reflects the relative transformation between the query and database cameras.The scale-free position relationship between the query camera and the database camera can be described as: where D P and R Q P denote the 3D positions of database and query cameras, respectively.

Absolute Scale Estimation Based on the PWLS Algorithm
Because the translation vector E t that is exacted from the essential matrix is always a unit vector, the query camera position can only be recovered from the database camera position up to an absolute scale.This means that without the absolute scale, the definite position of the query camera cannot be obtained.As shown in Figure 4, a smile logo on the wall is captured by the query camera and database camera.Based on the epipolar constraint, the rotation matrix E R and translation vector E t between the query and database cameras can be computed from the correspondences of image features.Because the database camera positions are known and stored in the visual map, due to the projective principle, the relative positions of the query camera can be estimated, but only up to an unknown scale, which is the scale ambiguity problem.The number of possible query camera positions, such as the positions Q  p , Q  p and Q  p , is infinite.However, the translation vectors that correspond to these possible positions are satisfied with:

Absolute Scale Estimation Based on the PWLS Algorithm
Because the translation vector t E that is exacted from the essential matrix is always a unit vector, the query camera position can only be recovered from the database camera position up to an absolute scale.This means that without the absolute scale, the definite position of the query camera cannot be obtained.As shown in Figure 4, a smile logo on the wall is captured by the query camera and database camera.Based on the epipolar constraint, the rotation matrix R E and translation vector t E between the query and database cameras can be computed from the correspondences of image features.Because the database camera positions are known and stored in the visual map, due to the projective principle, the relative positions of the query camera can be estimated, but only up to an unknown scale, which is the scale ambiguity problem.The number of possible query camera positions, such as the positions p Q , p Q and p Q , is infinite.However, the translation vectors that correspond to these possible positions are satisfied with: where s 1 , s 2 and s 3 are different scales of the translation vector.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 6 of 28 where s 1 , s 2 and s 3 are different scales of the translation vector.According to the above analysis, only if the absolute scale of the translation vector is determined can the definite position of the query camera be obtained.Therefore, in this paper, an absolute scale estimation method is proposed, which is based on the pre-constructed visual map, and more specifically, it is based on the best-matched database image and the corresponding point cloud.In the pre-constructed visual map, each pixel on the database image corresponds to a 3D point in the indoor coordinate system.Thus, the 3D positions of visual features in an indoor scene are known.As shown in Figure 5, a query image is captured by the user in an office room, and then the best-matched database image is retrieved in the visual map.In this way, the point cloud that corresponds to the best-matched database image can be found, and the 3D positions of the matched visual features on the query image are obtained.After finding the best-matched database image and the corresponding point cloud, it is convenient to obtain the image point position matrix of n matched features on the According to the above analysis, only if the absolute scale of the translation vector is determined can the definite position of the query camera be obtained.Therefore, in this paper, an absolute scale estimation method is proposed, which is based on the pre-constructed visual map, and more specifically, it is based on the best-matched database image and the corresponding point cloud.In the pre-constructed visual map, each pixel on the database image corresponds to a 3D point in the indoor coordinate system.Thus, the 3D positions of visual features in an indoor scene are known.As shown in Figure 5, a query image is captured by the user in an office room, and then the best-matched database image is retrieved in the visual map.In this way, the point cloud that corresponds to the best-matched database image can be found, and the 3D positions of the matched visual features on the query image are obtained.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 6 of 28 where s 1 , s 2 and s 3 are different scales of the translation vector.According to the above analysis, only if the absolute scale of the translation vector is determined can the definite position of the query camera be obtained.Therefore, in this paper, an absolute scale estimation method is proposed, which is based on the pre-constructed visual map, and more specifically, it is based on the best-matched database image and the corresponding point cloud.In the pre-constructed visual map, each pixel on the database image corresponds to a 3D point in the indoor coordinate system.Thus, the 3D positions of visual features in an indoor scene are known.As shown in Figure 5, a query image is captured by the user in an office room, and then the best-matched database image is retrieved in the visual map.In this way, the point cloud that corresponds to the best-matched database image can be found, and the 3D positions of the matched visual features on the query image are obtained.After finding the best-matched database image and the corresponding point cloud, it is convenient to obtain the image point position matrix of n matched features on the After finding the best-matched database image and the corresponding point cloud, it is convenient to obtain the image point position matrix X Q = u Q , v Q of n matched features on the query image and the corresponding 3D point position matrix X P = [x, y, z], where u The position of a certain feature can be represented by a homogeneous coordinate X i Q = [u i , v i , 1] T , and the corresponding 3D point can be represented by a homogeneous coordinate X i P = [x i , y i , z i , 1] T .According to the camera model [21], the relationship between the image point and the corresponding 3D point is defined as: where n d is the projective depth factor and s is the unknown absolute scale.The rotation matrix R E and the translation vector t E which is extracted from the essential matrix can be described as: ) Equation ( 6) also can be rewritten as: Based on the 2D-to-3D correspondence implied in Equation ( 9), three equality relationships can be obtained: Substitute Equation (12) into Equation (10) and Equation (11) to remove the depth factor n d : According to Equation ( 13), it can be found that the absolute scale s depends on four aspects: the position u i in the X I -direction on the query image, the rotation matrix R E , the translation vector t E , and the coordinate X i P .However, different from Equation ( 13), Equation ( 14) demonstrates that the absolute scale s is determined by the position v i instead of the position u i .It indicates that the absolute scale depends on the image positions of visual features not only in the X I -direction but also in the Y I -direction.For a pair of the query image and the best-matched database image, in most cases, the motion of matched visual features occurs both in the X I -direction and the Y I -direction in the query image coordinate system O I X I Y I .However, the direction in which the motion extends further depends on how the user moves and holds the smartphone.
When there is only one matched visual feature between the query image and the best-matched database image, the absolute scale can be determined by Equation ( 13) or (14).However, in practice, there is more than one matched visual feature.
Thus, the absolute scale estimation is an over-determined problem.The most common method to solve this problem is the least squares method [32], which is a traditional method in regression analysis for approximating the solution of an over-determined problem.
The least squares method assumes that all observations make an equal contribution to estimation results.However, for the absolute scale estimation, the moving distances of the matched visual features in different image directions make different contributions to the result of the scale estimation.Specifically, for one visual feature on the image, if the moving distance d X in the X I -direction is greater than the moving distance d Y in the Y I -direction, d X should make more contributions to the scale estimation, and vice versa.This is because a greater moving distance is more robust to the noise caused by feature detection and matching, which is beneficial to the absolute scale estimation.Therefore, fully considering of camera moving directions and pixel moving distances, a pixel-distance weighted least squares method is proposed for estimating the absolute scale.
For solving the estimation problem with the least squares method, the general measurement model for the least squares method can be written as: where y represents a measurement vector that contains noise, H is a known observation matrix, v is some additional noise, and x is the state variable to be estimated.Given an estimated variable x of x, the error ε can be expressed as: For the absolute scale estimation, according to Equations ( 13) and ( 14), the image position errors of a certain matched visual feature can be calculated by: where ), (r 12 − u i r 32 ), (r 13 − u i r 33 )], and M 2 = [(r 21 −v i r 31 ), (r 22 − v i r 32 ), (r 23 − v i r 33 )].M 1 X i P and M 2 X i P are treated as the measurement values, and s is the state variable that needed to be estimated by the measurement values.For the total number of n matched visual features between the query image and the best-matched database image, the error vector ε can be calculated by: where the measurement vector is: and the observation vector is: For a pair of the query image and the best-matched database image, the pixel-distance weight in the image direction X I or Y I is the ratio of the moving distance in this direction to the whole moving distance of all matched visual features in the image plane.As there are n matched features between the query image and the database image, and their positions on the query and database images are the pixel-distance weights for a matched visual feature can be calculated by: where w i 1 and w i 2 are the weights for the scale estimation in the image directions X I and Y I , respectively.For all the n matched visual features, the weighted matrix can be represented by: Then, the scale estimation problem is transformed into a weighted least squares form [33] in which the sum of the squared errors is minimized with the matrix W using the measurement vector y and the observation vector H according to the cost function: where ŝ is the estimated value of the absolute scale s.To minimize J with respect to ŝ, it is necessary to compute its partial derivative and set it to zero: Since W is a diagonal matrix, its transpose can be obtained by: Then, the estimated value ŝ of the absolute scale can be calculated by: Having performed the PWLM algorithm, the absolute position p A Q of the query camera can be calculated with the database camera position p D by: The absolute position of the query camera is determined by three factors: (1) the position p D of the database camera, which is stored in the visual map, (2) the rotation matrix R E and translation vector t E extracted from the essential matrix, and (3) the estimated value ŝ of the absolute scale.
In the initial localization stage, the camera positions corresponding to the first two query images are estimated by global localization, and the two absolute positions are used for the relative position estimation.In addition, the absolute position estimation is implemented once the drifts caused by the relative position estimation exceed the given drift threshold, which will be discussed in Section 3.

Map Interaction-Based Drift Detection for Local Localization
In this section, a map interaction-based drift detection method is presented for local localization.The relative position estimation is achieved based on the triangulation and solving the PnP problem.During local localization, drift detection is applied to the keyframes to estimate the drift distance, which will be particularly discussed in this section.

Relative Position Estimation Based on Triangulation and PnP
After the initial positions of the query camera has been estimated by the first two query images, the subsequent camera positions can be acquired by the relative position estimation.The relative position estimation is implemented based on two technologies: (1) the position estimation of common features by triangulation, and (2) the relative position estimation of the query camera by solving the PnP problem.The process of the relative position estimation does not depend on the pre-built visual map, but it iteratively acquires the query camera positions step by step.Since no image retrieval process is involved in the relative position estimation, the time consumption is lower than that of the absolute position estimation.
The main advantage of global localization is that there are no accumulative errors because the current camera position does not depend on the previous camera positions.However, it is necessary to retrieve the best-matched database image to the query image, which increases the computing time on image retrieval.Therefore, in the proposed monocular localization system, only the first two positions of the query camera are estimated by global localization.From the third query image, the relative position estimation method is executed until the drift exceeds the given threshold.
The relative position estimation method is illustrated in Figure 6.The first two positions p 1 Q and p 2 Q of the query camera are calculated by the global position estimation method using the two query images, the best-matched database images, and the positions p i D and p j D of the database camera.From the third query image, the 3D positions of visual features in the query image cannot be directly obtained from point clouds.Therefore, these 3D positions must be estimated based on the previous two query images and the corresponding query camera positions.First, the common features, namely the matched visual features, are found in the first two query images.Then these features are matched with the visual features in the third query image.In this way, the common features that are contained within the first three query images are selected for triangulation.As shown in Figure 6, the database image interval l in is defined as the distance between the two successive positions of the database camera.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 10 of 28 estimation.In addition, the absolute position estimation is implemented once the drifts caused by the relative position estimation exceed the given drift threshold, which will be discussed in Section 3.

Map Interaction-Based Drift Detection for Local Localization
In this section, a map interaction-based drift detection method is presented for local localization.The relative position estimation is achieved based on the triangulation and solving the PnP problem.During local localization, drift detection is applied to the keyframes to estimate the drift distance, which will be particularly discussed in this section.

Relative Position Estimation Based on Triangulation and PnP
After the initial positions of the query camera has been estimated by the first two query images, the subsequent camera positions can be acquired by the relative position estimation.The relative position estimation is implemented based on two technologies: (1) the position estimation of common features by triangulation, and (2) the relative position estimation of the query camera by solving the PnP problem.The process of the relative position estimation does not depend on the pre-built visual map, but it iteratively acquires the query camera positions step by step.Since no image retrieval process is involved in the relative position estimation, the time consumption is lower than that of the absolute position estimation.
The main advantage of global localization is that there are no accumulative errors because the current camera position does not depend on the previous camera positions.However, it is necessary to retrieve the best-matched database image to the query image, which increases the computing time on image retrieval.Therefore, in the proposed monocular localization system, only the first two positions of the query camera are estimated by global localization.From the third query image, the relative position estimation method is executed until the drift exceeds the given threshold.
The relative position estimation method is illustrated in Figure 6.The first two positions Q 1 p and Q 2 p of the query camera are calculated by the global position estimation method using the two query images, the best-matched database images, and the positions i D p and j D p of the database camera.From the third query image, the 3D positions of visual features in the query image cannot be directly obtained from point clouds.Therefore, these 3D positions must be estimated based on the previous two query images and the corresponding query camera positions.First, the common features, namely the matched visual features, are found in the first two query images.Then these features are matched with the visual features in the third query image.In this way, the common features that are contained within the first three query images are selected for triangulation.As shown in Figure 6, the database image interval in l is defined as the distance between the two successive positions of the database camera.The estimated feature point is defined as a 3D point with the estimated position of the common feature, and the database feature point is a 3D point with the true position of the common feature that is stored in the visual map.Triangulation is a common method for finding the positions of The estimated feature point is defined as a 3D point with the estimated position of the common feature, and the database feature point is a 3D point with the true position of the common feature that is stored in the visual map.Triangulation is a common method for finding the positions of common features in 3D space, which is selected as an appropriate approach for computing the 3D positions of estimated feature points.In this paper, the ray that starts from the camera lens center and passes through to the corresponding image point of the visual feature is called the line of sight.If the configuration of the cameras and the camera intrinsic parameters are known, the intersection of two lines of sight that correspond to the same visual feature theoretically can be computed, which is the 3D position of the visual feature in space.However, due to the noise that is caused by correspondence detection of visual features, the two lines of sight may not intersect at one point.Therefore, an optimization algorithm for estimating the 3D positions of visual features in triangulation is needed.
Suppose a pair of image points T in the first two query images constitute a point correspondence, and the two image points are projected from the same visual feature in space.The homogeneous coordinate vectors m 1 and m 2 of the point correspondence can be calculated by: where f 0 is a scale constant, which is approximately equal to the image size [34,35].The point correspondence relationship can be expressed in terms of the homogeneous coordinate vectors m 1 and m 2 : where F is the fundamental matrix that can be computed based on matched visual features, i.e., the common features, in the first two query images [24].Note that m 1 and m 2 represent the true positions of the matched features in the first and second query images, respectively.However, in practice, the measured correspondence point positions m 1 and m 2 , which are obtained by visual feature matching, may not coincide with m 1 and m 2 due to the noises caused by feature detection.Therefore, the measured correspondence points cannot strictly satisfy Equation (31).The optimal correction method, which was proposed in [35], is employed to estimate the 3D positions of common features between the first two query images.The optimal correction is an extended method of the first-order correction [36] for triangulation from two views.
In the relative position estimation, the 3D positions of common features for solving the PnP problem on the current query image are always triangulated based on the previous two query images and the corresponding positions of the query camera.The 3D positions of common features, i.e., the positions of estimated feature points, obtained by triangulation are used both to solve the PnP problem and detect drifts.
According to the result of the initial localization, the query camera positions p 1 Q and p 2 Q that correspond to the two query images are known.As shown in Figure 7, the 3D positions of common features, namely, the intersection of the two lines of sight that correspond to the same visual feature, such as p 1 , p 2 , p 3 , and p t , can be triangulated based on the query camera positions p 1  Q and p 2 Q , and the image positions of the common features in the first two query images.With the image positions and the 3D positions of common features, the query camera position corresponding to the third query image can be computed by means of solving the PnP problem.In practice, the number of common features among the first three query images is always greater than three, which can satisfy the requirement for solving for the camera pose relationship by utilizing 3D-to-2D correspondences.In this paper, an effective method, namely, the EPnP method [22], is employed to acquire the query camera positions.As shown in Figure 7, a set of three ordered query images are selected as an example to illustrate the relative position estimation method.For the first two query images, query camera positions are estimated by global localization, as discussed earlier.Then, the common features between the first two query images and the third query images are found, and the 3D positions of the common features are acquired by triangulation, where k is the index of the common features and c n is the total number of the common features.For the EPnP method, four virtual control points are required to express the 3D points in the form of a weighted sum.One of the virtual points is chosen as the centroid of k p , and the rest are chosen as the principal directions of the 3D points k p .The solution of the EPnP method is the 3D coordinates l C z of the virtual control points in the camera coordinate system.The transformation relationship, which is represented by the rotation matrix T R and translation vector T t , between the indoor coordinate system and the camera coordinate system, can be solved by [37]: Taking advantage of T R and T t , the camera position that corresponds to the third query image can be obtained.From the third query image, the camera positions are estimated by triangulation and solving the PnP problem until the drifts are detected and exceed the given threshold.Considering the cumulative errors that are generated in the process of position iterations, the local bundle adjustment [38,39] is utilized to optimize the poses of the camera in the relative position estimation.

Line Model Establishment for Drift Detection
Since the relative position estimation inevitably leads to cumulative errors due to position iterations, for the most visual SLAM systems, cumulative errors are commonly reduced by global optimization methods on the basis of closed loops.However, the global optimization methods cannot be employed in indoor localization scenarios, because user trajectories usually cannot form closed loops.Therefore, a drift detection method is introduced based on indoor map interaction for reducing cumulative errors (i.e., localization drifts).The main advantages of the introduced method are that there is no need to form closed loops, and that the drift correction is automatically triggered once the drift distance exceeds the given threshold.Based on the drift detection method, a switching strategy is achieved for monitoring cumulative errors.In the process of localization, when the estimated drift value is greater than the given threshold, the local localization mode will be interrupted and switch to the global localization mode to reduce accumulative errors.As shown in Figure 7, a set of three ordered query images are selected as an example to illustrate the relative position estimation method.For the first two query images, query camera positions are estimated by global localization, as discussed earlier.Then, the common features between the first two query images and the third query images are found, and the 3D positions of the common features are acquired by triangulation, where k is the index of the common features and n c is the total number of the common features.For the EPnP method, four virtual control points z l W : l = 1, 2, 3, 4 are required to express the 3D points in the form of a weighted sum.One of the virtual points is chosen as the centroid of p k , and the rest are chosen as the principal directions of the 3D points p k .The solution of the EPnP method is the 3D coordinates z l C of the virtual control points in the camera coordinate system.The transformation relationship, which is represented by the rotation matrix R T and translation vector t T , between the indoor coordinate system and the camera coordinate system, can be solved by [37]: Taking advantage of R T and t T , the camera position that corresponds to the third query image can be obtained.From the third query image, the camera positions are estimated by triangulation and solving the PnP problem until the drifts are detected and exceed the given threshold.Considering the cumulative errors that are generated in the process of position iterations, the local bundle adjustment [38,39] is utilized to optimize the poses of the camera in the relative position estimation.

Line Model Establishment for Drift Detection
Since the relative position estimation inevitably leads to cumulative errors due to position iterations, for the most visual SLAM systems, cumulative errors are commonly reduced by global optimization methods on the basis of closed loops.However, the global optimization methods cannot be employed in indoor localization scenarios, because user trajectories usually cannot form closed loops.Therefore, a drift detection method is introduced based on indoor map interaction for reducing cumulative errors (i.e., localization drifts).The main advantages of the introduced method are that there is no need to form closed loops, and that the drift correction is automatically triggered once the drift distance exceeds the given threshold.Based on the drift detection method, a switching strategy is achieved for monitoring cumulative errors.In the process of localization, when the estimated drift value is greater than the given threshold, the local localization mode will be interrupted and switch to the global localization mode to reduce accumulative errors.
There are three steps in the map interaction-based drift detection method: (1) establishing line models by database feature points; (2) removing unreliable features from the common feature set, and (3) computing the drift distance by the proposed ML-MLESAC algorithm.Before introducing the drift detection method, some line models should be established to find the inliers of the estimated feature points for the drift distance estimation.Because the line segments in the line models are fitted by the database feature points that are with well linear characteristics, the line segments are applicable to exclude the outliers of the estimated feature points.
To reduce the time consumption, a subset of the query images are selected as the keyframes for drift detection.The keyframes are employed to determine whether the relative position estimation should be interrupted and the absolute position estimation should be activated.This decision is based on the interaction between the pre-constructed dense 3D map and the estimated feature points.The strategy for keyframe selection is to choose the frames from every t key query images.The drift detection interval t key depends on the accuracy and efficiency requirements of localization.
As shown in Figure 8, if the t th query image is selected as a keyframe, the t th position of the query camera is solved by the previous two camera positions.Specifically, the common features are selected among the current t th query image, the previous (t − 1) th and (t − 2) th query images.In the relative position estimation, the estimated feature points, which have been used to solve the PnP problems, are triangulated from the previous two query images.That is, the positions of the estimated feature points are determined by the two factors: the image positions of the common features in the previous two query images, and the corresponding positions of the query camera.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 13 of 28 There are three steps in the map interaction-based drift detection method: (1) establishing line models by database feature points; (2) removing unreliable features from the common feature set, and (3) computing the drift distance by the proposed ML-MLESAC algorithm.Before introducing the drift detection method, some line models should be established to find the inliers of the estimated feature points for the drift distance estimation.Because the line segments in the line models are fitted by the database feature points that are with well linear characteristics, the line segments are applicable to exclude the outliers of the estimated feature points.
To reduce the time consumption, a subset of the query images are selected as the keyframes for drift detection.The keyframes are employed to determine whether the relative position estimation should be interrupted and the absolute position estimation should be activated.This decision is based on the interaction between the pre-constructed dense 3D map and the estimated feature points.The strategy for keyframe selection is to choose the frames from every key t query images.The drift detection interval key t depends on the accuracy and efficiency requirements of localization.
As shown in Figure 8, if the t th query image is selected as a keyframe, the t th position of the query camera is solved by the previous two camera positions.Specifically, the common features are selected among the current t th query image, the previous    In practice, the pedestrian walks on the ground, and at the same time, positions of the pedestrian are estimated by the visual localization method.The drifts caused by the relative position estimation mainly appear and accumulate in the X -and Y -directions.Therefore, in drift detection, only the X -and Y -coordinates of the 3D feature points are considered, which means that the 3D feature points are represented by their projections (i.e.2D points) on the XOY plane.Specifically, in this section, the estimated feature point is defined as a 2D point on the XOY plane, whose coordinates are same as the X -and Y -coordinates of the estimated 3D position of the common feature.In addition, the database feature points are defined as the 2D points with the same X -and Y - coordinates of the true 3D position of the common feature.The true positions of the common features can be found in the visual map by the same procedure as query image retrieval in the database (as described in Section 2.1).However, the search area is limited within a determined range near the keyframe.Each estimate feature point (shown as a purple dot in Figure 8) is associated with a database feature point (shown as a green dot in Figure 8) in the visual map.In practice, the pedestrian walks on the ground, and at the same time, positions of the pedestrian are estimated by the visual localization method.The drifts caused by the relative position estimation mainly appear and accumulate in the Xand Y-directions.Therefore, in drift detection, only the Xand Y-coordinates of the 3D feature points are considered, which means that the 3D feature points are represented by their projections (i.e.2D points) on the XOY plane.Specifically, in this section, the estimated feature point is defined as a 2D point on the XOY plane, whose coordinates are same as the Xand Y-coordinates of the estimated 3D position of the common feature.In addition, the database feature points are defined as the 2D points with the same Xand Y-coordinates of the true 3D position of the common feature.The true positions of the common features can be found in the visual map by the same procedure as query image retrieval in the database (as described in Section 2.1).However, the search area is limited within a determined range near the keyframe.Each estimate feature point (shown as a purple dot in Figure 8) is associated with a database feature point (shown as a green dot in Figure 8) in the visual map.
As stated, the indoor scene is reconstructed by the visual map that contains a mass of 3D points in the OXYZ coordinate system.If only the Xand Y-coordinates of the 3D points are considered, each 3D point stored in the visual map corresponds to a 2D point on the ground, i.e., the XOY plane.For all 3D points stored in the visual map, the corresponding 2D points form a floor plan of the indoor scene.
As shown in Figure 8, the points in the database feature point set S D , as a subset of the 2D points used to form the floor plan, are associated with the common features.Considering all the points in the set S D , breakpoints of the database feature points are detected by the covariance propagation method [40].When the breakpoints are obtained, fitted line segments can be acquired by the least squares method using the database feature points in S D between the breakpoints.Then, the line models are established based on the database feature points.A line model is composed of two elements: (1) a fitted line segment, and (2) a set of database feature points that are used for fitting the line segment.The direction of the line model is defined as the direction of the fitted line segment.
Before discussing the drift detection, it is necessary to analyze the effect of the user motion direction on the triangulation performance.For indoor localization scenarios, it often appears in a typical situation that a user who is holding a smartphone moves down a corridor and looks straight ahead.At the same time, user positions are estimated using the query images that are captured by the smartphone.As shown in Figure 9, two sequential query images are chosen to illustrate how to remove the unreliable features from the common feature set S C .The database feature points that are stored in the visual map are shown as pentacles with marks 1, 2, and 3 in Figure 9.The line l jcc joins the centers of the query camera, thereby indicating the moving direction of the user on the XOY plane.Each common feature in the set S C relates to two projection points in the (t − 2) th and the (t − 1) th query images.For a common feature, when the corresponding database feature point is closer to the line l jcc , the distance d dis , as shown in Figure 9, between the projection points is shorter.This phenomenon is also mentioned in [41], whose author emphasized that when the visual feature are close to the line l jcc , the depths of these features are difficult to triangulate.More seriously, if the features lie on the line l jcc , their depths cannot be calculated by triangulation.Therefore, to improve the performance of drift detection, a strategy is proposed for removing the unreliable common features with inaccurately estimated positions.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 14 of 28 As stated, the indoor scene is reconstructed by the visual map that contains a mass of 3D points in the OXYZ coordinate system.If only the X -and Y -coordinates of the 3D points are considered, each 3D point stored in the visual map corresponds to a 2D point on the ground, i.e., the XOY plane.For all 3D points stored in the visual map, the corresponding 2D points form a floor plan of the indoor scene.
As shown in Figure 8, the points in the database feature point set D S , as a subset of the 2D points used to form the floor plan, are associated with the common features.Considering all the points in the set D S , breakpoints of the database feature points are detected by the covariance propagation method [40].When the breakpoints are obtained, fitted line segments can be acquired by the least squares method using the database feature points in D S between the breakpoints.Then, the line models are established based on the database feature points.A line model is composed of two elements: (1) a fitted line segment, and (2) a set of database feature points that are used for fitting the line segment.The direction of the line model is defined as the direction of the fitted line segment.Before discussing the drift detection, it is necessary to analyze the effect of the user motion direction on the triangulation performance.For indoor localization scenarios, it often appears in a typical situation that a user who is holding a smartphone moves down a corridor and looks straight ahead.At the same time, user positions are estimated using the query images that are captured by the smartphone.As shown in Figure 9, two sequential query images are chosen to illustrate how to remove the unreliable features from the common feature set This phenomenon is also mentioned in [41], whose author emphasized that when the visual feature are close to the line jcc l , the depths of these features are difficult to triangulate.More seriously, if the features lie on the line jcc l , their depths cannot be calculated by triangulation.Therefore, to improve the performance of drift detection, a strategy is proposed for removing the unreliable common features with inaccurately estimated positions.Supposing that there is a common feature contained in the keyframe (the t th query image) and the previous two query images, the position of the database feature point that corresponds to the common feature is on the XOY plane.The 2D positions , p and , p of the query camera are associated with the previous   Supposing that there is a common feature contained in the keyframe (the t th query image) and the previous two query images, the position of the database feature point that corresponds to the common feature is p D = [x D , y D ] on the XOY plane.The 2D positions p t−2 C = x t−2 C , y t−2 C and of the query camera are associated with the previous (t − 2) th and (t − 1) th query images.Based on the camera positions, the normal vector n ⊥icc of the line l jcc can be computed by . Then, the distance d ⊥jcc between the database feature point and the line l jcc can be calculated by: The position relationship between the database feature point and line l jcc can be represented by the angle α es : Setting an angle threshold α ex for excluding unreliable features from the common feature set S C , if α es ≤ α ex , the feature that corresponds to α es is removed from S C .In this paper, the angle threshold α ex is set to be π/12.According to the angle threshold, an excluded area can be obtained on the XOY plane, and for the database feature points in this area, such as the database feature point with mark 2 in Figure 9, the corresponding common features are removed from the set S C .
In addition, if the database feature points are far from the query camera, the corresponding common features are also treated as unreliable features and excluded from the set S C .The triangulation that is achieved by two query cameras can be treated as a stereo triangulation.Thus, the line between the query cameras can be regarded as the baseline.As mentioned in [26,42], only the nearby visual features whose associated depths are less than 40 times the baseline can be safely triangulated.Therefore, in this paper, this threshold (i.e., 40 times the distance between the query cameras) is chosen for defining distant features.If the distance between the database feature point the query camera exceeds the threshold, the corresponding common feature is removed from S C .After removing these two types of common features, namely, the features that are close to the line l jcc or far from the query camera, the remaining features in S C are used for the drift distance estimation.

Drift Distance Estimation by the Line Model-Based MLESAC Algorithm
Each keyframe associates with one common feature set S C and two point sets: the database feature point set S D and the estimated feature point set S E .For a common feature in the set S C , there must be a corresponding database feature point in the set S D whose position is stored in the visual map, and a corresponding estimated feature point in the set S E whose position is acquired by triangulation.Based on different line models, the estimated feature point set S E can be divided into several subsets which are called estimated feature point subsets.Specifically, a subset S E contains the estimated feature points that are associated with the database feature points that belong to the same line model.Therefore, there must be a line model that corresponds to an estimated feature point subset.
As shown in Figure 10, there are some estimated feature points (shown as purple dots) in the subset S E that are associated with the database feature points (shown as green dots) belonging to the same line model.The database feature points stored in the visual map are captured by the RGB-D sensor, and therefore the database feature points in the same line model have well linear characteristics.However, due to the errors caused by triangulation, the linear characteristics of estimated feature points are weakened.Therefore, it is feasible to find the inliers (i.e., the estimated feature points with less errors) by taking advantage of the well linear characteristics of database feature points.ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 16 of 28 The aim of the proposed LM-MLESAC algorithm is to fit the points in each estimated feature point subset to a line, and then based on the fitted line, to find the inliers of the estimated feature points.The advantages of the proposed LM-MLESAC algorithm are (1) that a distance function is introduced for fitting the estimated feature points to a line, and (2) that a drift estimation algorithm is investigated based on the fitted line.

Suppose the subset E
S contains e n estimated feature points, i.e.,   represents the position of the estimated feature point in the form of . For an estimated feature point, e.g., the point i E X , it should be determined whether it is an inlier when two points X can be expressed as: (35) where f        0, 2 is the intersection angle between the directions of the fitted line and the line model.
Following the idea of MLESAC [27], for an estimated feature point, the probability distribution functions of errors by an inlier and an outlier are expressed as an unbiased Gaussian distribution and a uniform distribution which can be described as follows: where  is the standard deviation, and v is the search area of the given environment.All estimated feature points in E S are distributed in the search area v , which can be calculated by: max( ) min max( ) min (37) where  max( ) and  min( ) are the functions for computing the maximum value and the minimum value of a given set of values, respectively.
Combining the Gaussian distribution in i p d ( ) and the uniform distribution out i p d ( ) , a mixture model of the error probability distribution can be described as: The aim of the proposed LM-MLESAC algorithm is to fit the points in each estimated feature point subset to a line, and then based on the fitted line, to find the inliers of the estimated feature points.The advantages of the proposed LM-MLESAC algorithm are (1) that a distance function is introduced for fitting the estimated feature points to a line, and (2) that a drift estimation algorithm is investigated based on the fitted line.
Suppose the subset S E contains n e estimated feature points, i.e., represents the position of the estimated feature point in the form of For an estimated feature point, e.g., the point X i E , it should be determined whether it is an inlier when two points the subset S E are randomly sampled to fit a line.
A function d i is introduced based on the line model as a measurement for calculating the distance between the estimated feature point and the fitted line.As shown in Figure 10, the distance function d i depends on the length of the line segment between the estimated feature point and the fitted line, and the direction of the segment is perpendicular to the direction of the line model.The distance function d i for the estimated feature point X i E can be expressed as: where θ f ∈ [0, π/2] is the intersection angle between the directions of the fitted line and the line model.Following the idea of MLESAC [27], for an estimated feature point, the probability distribution functions of errors by an inlier and an outlier are expressed as an unbiased Gaussian distribution and a uniform distribution which can be described as follows: where σ is the standard deviation, and v is the search area of the given environment.All estimated feature points in S E are distributed in the search area v, which can be calculated by: where max(•) and min(•) are the functions for computing the maximum value and the minimum value of a given set of values, respectively.Combining the Gaussian distribution p in (d i ) and the uniform distribution p out (d i ), a mixture model of the error probability distribution can be described as: where γ is the mixing parameter and its initial value is set to be 0.5.When two estimated feature points in S E are randomly sampled to fit a line, the remaining (n e − 2) points in S E form a new set S E .The specific steps for computing the mixing parameter γ and the standard deviation σ are summarized in Algorithm 1.Since the (n e − 2) points in the set S E are independent, the likelihood of these points satisfies: where 2 is the likelihood of an estimated feature if it is an inlier, and p i outlier = (1 − γ)/v is the likelihood if it is an outlier.Because p S E is usually a small real value, for computational feasibility, the negative log likelihood is used to represent the likelihood of the points as follows: The likelihood L nll corresponds to the randomly sampled feature pair (e.g., X p E and X q E ) which is used to fit a line.To find the best-fitted line l * , repeated random trials of feature pair sampling in S E are implemented until L nll reaches L * nll , where: Taking advantage of the best-fitted line, inliers in S E can be selected by comparing the distance d i of each estimated feature point with the threshold d in = 1.96σ [27].An estimated feature point is an inlier if it satisfies d i ≥ d in .
Based on different line models, every estimated feature point in the set S E is checked for whether it is an inlier.All inliers are used to calculate the drift distance d dri f t .The drift distance d dri f t indicates accumulative errors that are caused by the relative position estimation, which is defined as: where n in is the total number of the inliers, X i E is the 2D position of an estimated feature point, and X i D is the 2D position of the corresponding database feature point.The specific processes of the drift detection method are shown in Algorithm 2. Select a keyframe (e.g., the t th query image) from the query image sequence; 2.
Find common features between the keyframe, the (t − 1) th and the (t − 2) th query images, and put them in the set S C ; 3.
Remove unreliable common features from the set S C ; 4.
Calculate the best-fitted lines by the line model-based MLESAC algorithm using the estimated feature points, and then select the inliers of the estimated feature points; 5.
Compute the drift distance d drift by the inliers.
A threshold d th is given to monitor the localization drifts, where d dri f t ≥ d th indicates that the estimated drift distance exceeds the threshold, in which case the localization system must switch to the global localization mode from the local localization mode.
As a summary, Figure 11 shows an overall process of the proposed monocular localization system.The system starts by performing the absolute position estimation, which yields the initial camera positions.Then, the subsequent positions of the camera are acquired by the relative position estimation, in which process the keyframes are chosen from the query image sequence and used to compute the drift distance d dri f t .If d dri f t is greater than the threshold d th , the absolute position estimation is activated to reduce the drifts, and after that the relative position estimation is resumed.As a switching strategy, drift detection is executed on the keyframes to determine whether the system should switch to the absolute position estimation to correct user positions or continue the relative position estimation.As a result, the cumulative errors of the localization system are limited within an expected range.
where in n is the total number of the inliers, As a summary, Figure 11 shows an overall process of the proposed monocular localization system.The system starts by performing the absolute position estimation, which yields the initial camera positions.Then, the subsequent positions of the camera are acquired by the relative position estimation, in which process the keyframes are chosen from the query image sequence and used to compute the drift distance drift d .If drift d is greater than the threshold th d , the absolute position estimation is activated to reduce the drifts, and after that the relative position estimation is resumed.As a switching strategy, drift detection is executed on the keyframes to determine whether the system should switch to the absolute position estimation to correct user positions or continue the relative position estimation.As a result, the cumulative errors of the localization system are limited within an expected range.

Simulation Results
To evaluate the performance of the proposed monocular localization system, an extensive experiment was conducted in different indoor scenes, such as an office, a corridor, and a gymnasium.The experiment was divided into two parts: the performance evaluation of global localization, and the performance evaluation of local localization.In addition, the performance of the drift detection method was evaluated in the local localization part.
For each experimental scene, the dense 3D map (namely, the visual map) was constructed using the method studied in [10].RGB-D sensors, such as the Kinect, were applied to construct the map.The RGB-D sensor, whose poses were stored in the visual map, was used as the database camera for localization.In monocular localization, the RGB images that were captured by the RGB-D sensor served as database images, and the corresponding poses of the database camera were devoted to

Simulation Results
To evaluate the performance of the proposed monocular localization system, an extensive experiment was conducted in different indoor scenes, such as an office, a corridor, and a gymnasium.The experiment was divided into two parts: the performance evaluation of global localization, and the performance evaluation of local localization.In addition, the performance of the drift detection method was evaluated in the local localization part.
For each experimental scene, the dense 3D map (namely, the visual map) was constructed using the method studied in [10].RGB-D sensors, such as the Kinect, were applied to construct the map.The RGB-D sensor, whose poses were stored in the visual map, was used as the database camera for localization.In monocular localization, the RGB images that were captured by the RGB-D sensor served as database images, and the corresponding poses of the database camera were devoted to estimating the query camera positions.To precisely evaluate the accuracy of the proposed monocular localization system, a smartphone was placed on a tripod to capture query images at each test point, and the manually measured positions of the test points were used as ground truth to calculate localization errors.
The selected scenes for experiments were typical indoor scenes with different visual and structural complexities, which could fully evaluate the performance of the proposed monocular localization system.
All data processing was performed in MATLAB 2017A (Mathwork, Natick, MA, USA) with an Core i5 CPU (Intel, Santa Clara, CA, USA) and an 8 GB RAM (Kingston, Fountain Valley, CA, USA).

Performance Evaluation of Global Localization
Estimating the absolute query camera positions requires database images and the corresponding database camera positions, which are provided by the visual map.However, the position accuracy of the database camera has an effect on the global localization performance, i.e., if there are some errors in the database camera positions, these errors will be incorporated into the query camera positions.To avoid this effect, the true positions of the database camera were utilized to estimate the query camera positions.The true positions of the database camera were manually measured in the process of the visual map construction with a step distance of 10 centimeters.In the practical implementation of monocular localization, there is a phenomenon that the localization accuracy is related to the density of the database images.A more dense distribution of database images is conducive to an improvement of the localization accuracy.Therefore, in the global localization experiment, three cases, whose the database image intervals (l in ) were set to be 30 cm, 50 cm, and 100 cm, were selected for testing the localization accuracy.In each cases, fifty test points were selected in different environments (i.e., an office, a gymnasium, and a corridor) for evaluating the performances of the local localization methods.Some examples of database images in different experimental environments are shown in Figure 12.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 19 of 28 estimating the query camera positions.To precisely evaluate the accuracy of the proposed monocular localization system, a smartphone was placed on a tripod to capture query images at each test point, and the manually measured positions of the test points were used as ground truth to calculate localization errors.The selected scenes for experiments were typical indoor scenes with different visual and structural complexities, which could fully evaluate the performance of the proposed monocular localization system.All data processing was performed in MATLAB 2017A (Mathwork, Natick, MA, USA) with an Core i5 CPU (Intel, Santa Clara, CA, USA) and an 8 GB RAM (Kingston, Fountain Valley, CA, USA).

Performance Evaluation of Global Localization
Estimating the absolute query camera positions requires database images and the corresponding database camera positions, which are provided by the visual map.However, the position accuracy of the database camera has an effect on the global localization performance, i.e., if there are some errors in the database camera positions, these errors will be incorporated into the query camera positions.To avoid this effect, the true positions of the database camera were utilized to estimate the query camera positions.The true positions of the database camera were manually measured in the process of the visual map construction with a step distance of 10 centimeters.In the practical implementation of monocular localization, there is a phenomenon that the localization accuracy is related to the density of the database images.A more dense distribution of database images is conducive to an improvement of the localization accuracy.Therefore, in the global localization experiment, three cases, whose the database image intervals ( in l ) were set to be 30 cm, 50 cm, and 100 cm, were selected for testing the localization accuracy.In each cases, fifty test points were selected in different environments (i.e., an office, a gymnasium, and a corridor) for evaluating the performances of the local localization methods.Some examples of database images in different experimental environments are shown in Figure 12.Before estimating the query camera positions, the most similar database images must be retrieved according to the query image.However, the correctness of the database image retrieval affects the localization accuracy.To eliminate the localization errors that are caused by mismatches in image retrieval, the most similar database images were selected by using the nearest ground-truth database images, which is a typical method for localization experiments [18].In addition, taking advantage of this method, the running time of image retrieval was excluded from the processing time, and moreover, the scale of the indoor scene did not affect the performance of localization methods.
To effectively evaluate the performance of the proposed global localization method, some representative localization methods were also implemented under the same conditions, such as the PnP method [19,20], the least squares method [21], the 2DTriPnP (2D version of triangulation and perspective n-point) method [18], and the epipolar geometry method [16,17].Each localization method was implemented in the three cases with different database image densities.By measuring the Euclidean distances between the true and estimated positions of the query camera on the XOY plane, the position errors by various localization methods were obtained.To demonstrate the performance improvement of the proposed method, an accuracy improvement rate im i was introduced, which had the following form: Before estimating the query camera positions, the most similar database images must be retrieved according to the query image.However, the correctness of the database image retrieval affects the localization accuracy.To eliminate the localization errors that are caused by mismatches in image retrieval, the most similar database images were selected by using the nearest ground-truth database images, which is a typical method for localization experiments [18].In addition, taking advantage of this method, the running time of image retrieval was excluded from the processing time, and moreover, the scale of the indoor scene did not affect the performance of localization methods.
To effectively evaluate the performance of the proposed global localization method, some representative localization methods were also implemented under the same conditions, such as the PnP method [19,20], the least squares method [21], the 2DTriPnP (2D version of triangulation and perspective n-point) method [18], and the epipolar geometry method [16,17].Each localization method was implemented in the three cases with different database image densities.By measuring the Euclidean distances between the true and estimated positions of the query camera on the XOY plane, the position errors by various localization methods were obtained.To demonstrate the performance improvement of the proposed method, an accuracy improvement rate i im was introduced, which had the following form: where e p and e c denote the average errors by the proposed method and the comparative method, respectively.The average errors, the maximum errors and the improvement rates of global localization are shown in Table 1.In this paper, the abbreviations, i.e., Avg., Max., and Impro., were used to denote the average errors, the maximum errors and the improvement rates, respectively.As shown in Table 1, in terms of the average localization errors by different methods, the performance of the proposed method, the PnP method, and the least squares method evidently outperformed the other two methods in the three cases.The reason was that the three methods took advantage of both visual and depth information to estimate query camera positions, whereas only visual information, namely, the database images without depth information, was used in the 2DTriPnP and epipolar geometry methods.Compared with the PnP method, the least squares method and the proposed method achieved better accuracy, especially in the condition of a less dense distribution of database images, as in Case 3, which demonstrates that the two direct scale estimation methods have the advantage in global localization.Because the camera moving direction was fully considered in the absolute scale estimation, the accuracy improvements of the proposed method reached at least 30.09% in all experimental cases, compared with the least squares method.
Figure 13 shows the cumulative distribution functions (CDFs) of the localization errors by the PnP, least squares, 2DTriPnP, and epipolar geometry, and the proposed methods.According to the results shown in Figure 13 and Table 1, the proposed method outperformed the other methods in terms of maximum and average localization errors.Specifically, the maximum localization errors by the proposed method was limited within 0.4 meters under the condition of different database image densities.The accuracy improvement rates in the three cases were at least 39.03%, 39.21%, and 30.09%, compared with other localization methods.Moreover, in Case 1 (database images with an interval of 30 cm), the proposed method achieved the best performance, and the average localization error reached 8.31 cm, which satisfied the requirements of most location-based services.
  where p e and c e denote the average errors by the proposed method and the comparative method, respectively.The average errors, the maximum errors and the improvement rates of global localization are shown in Table 1.In this paper, the abbreviations, i.e., Avg., Max., and Impro., were used to denote the average errors, the maximum errors and the improvement rates, respectively.As shown in Table 1, in terms of the average localization errors by different methods, the performance of the proposed method, the PnP method, and the least squares method evidently outperformed the other two methods in the three cases.The reason was that the three methods took advantage of both visual and depth information to estimate query camera positions, whereas only visual information, namely, the database images without depth information, was used in the 2DTriPnP and epipolar geometry methods.Compared with the PnP method, the least squares method and the proposed method achieved better accuracy, especially in the condition of a less dense distribution of database images, as in Case 3, which demonstrates that the two direct scale estimation methods have the advantage in global localization.Because the camera moving direction was fully considered in the absolute scale estimation, the accuracy improvements of the proposed method reached at least 30.09% in all experimental cases, compared with the least squares method.
Figure 13 shows the cumulative distribution functions (CDFs) of the localization errors by the PnP, least squares, 2DTriPnP, and epipolar geometry, and the proposed methods.According to the results shown in Figure 13 and Table 1, the proposed method outperformed the other methods in terms of maximum and average localization errors.Specifically, the maximum localization errors by the proposed method was limited within 0.4 meters under the condition of different database image densities.The accuracy improvement rates in the three cases were at least 39.03%, 39.21%, and 30.09%, compared with other localization methods.Moreover, in Case 1 (database images with an interval of 30 cm), the proposed method achieved the best performance, and the average localization error reached 8.31 cm, which satisfied the requirements of most location-based services.To comprehensively evaluate the performance of the proposed global localization method, position errors with respect to multiple confidence levels (25%, 50%, 78%, and 95%) are shown in Table 2.The position error with respect to the 50% confidence level was defined as a median error [18].The median errors by the proposed global localization method were 7.40 cm, 11.10 cm, and 14.88 cm, which meant, in the three cases, that 50% of the position errors were less than 7.40 cm, 11.10 cm, and 14.88 cm, respectively.For the 95% confidence level, the position errors by the proposed method were limited within 30 cm in all cases.In practical localization applications, an user usually captures query images facing the wall and keeps a distance of more than 30 cm from the wall.When position errors are less than 30 cm, the user position will not be located in the other side of the wall.This is an advantage of the proposed method in that the user is able to identify which room he/she is in.
For the proposed global localization method, since the absolute positions of the query camera were estimated based on the dense 3D map, the localization performance was affected by the accuracy of 3D point clouds acquired by the Kinect (Microsoft, Redmond, USA).The accuracy of 3D point clouds depended on two main aspects: (1) the inherent errors of the sensors equipped on the Kinect, and (2) the errors of calibration between visual and depth cameras.According to existing research [43], the random errors of the depth sensor on the Kinect increase with measuring distances, and the maximum error of the depth sensor is 4cm.As mentioned previously, the study of visual map construction has been achieved in our previous works [10].Before the map construction, the visual camera and the depth camera were calibrated by the method proposed in [44], and in this method, the accuracy of camera calibration was also investigated.To comprehensively evaluate the performance of the proposed global localization method, position errors with respect to multiple confidence levels (25%, 50%, 78%, and 95%) are shown in Table 2.The position error with respect to the 50% confidence level was defined as a median error [18].The median errors by the proposed global localization method were 7.40 cm, 11.10 cm, and 14.88 cm, which meant, in the three cases, that 50% of the position errors were less than 7.40 cm, 11.10 cm, and 14.88 cm, respectively.For the 95% confidence level, the position errors by the proposed method were limited within 30 cm in all cases.In practical localization applications, an user usually captures query images facing the wall and keeps a distance of more than 30 cm from the wall.When position errors are less than 30 cm, the user position will not be located in the other side of the wall.This is an advantage of the proposed method in that the user is able to identify which room he/she is in.
For the proposed global localization method, since the absolute positions of the query camera were estimated based on the dense 3D map, the localization performance was affected by the accuracy of 3D point clouds acquired by the Kinect (Microsoft, Redmond, USA).The accuracy of 3D point clouds depended on two main aspects: (1) the inherent errors of the sensors equipped on the Kinect, and (2) the errors of calibration between visual and depth cameras.According to existing research [43], the random errors of the depth sensor on the Kinect increase with measuring distances, and the maximum error of the depth sensor is 4cm.As mentioned previously, the study of visual map construction has been achieved in our previous works [10].Before the map construction, the visual camera and the depth camera were calibrated by the method proposed in [44], and in this method, the accuracy of camera calibration was also investigated.

Performance Evaluation of Local Localization
The local localization experiments were executed in the following experimental scenes: an office, a gymnasium, and a corridor.According to the room areas, the user trajectories (l u ) were selected with different lengths of: 10 m, 15 m, and 20 m.For each trajectory, the test points were uniformly selected with a step distance of 10 centimeters.Thus, there were 100, 150, and 200 test points selected in the office, the gymnasium, and the corridor, respectively.The true positions of test points were manually measured for estimating distance errors by drift detection and position errors by local localization.
In existing works, two methods are typically used to perform local localization: the EPnP method [22] and the EPnP in combination with local bundle adjustment (BA) method [25,26].In this paper, the two typical methods and the proposed method were tested under the same conditions to evaluate the performance of the proposed method.In addition to the position errors by local localization, the distance error e d of the drift estimation was calculated at each test point by the following equation: where d dri f t is the estimated drift distance, and d cam is the distance between the estimated and true positions of the query camera.For the drift estimation, d dri f t is computed by the LM-MLESAC algorithm.A general idea of the drift distance estimation is that, without removing the unreliable common features and outliers of the estimated feature points, all features in the set S C are employed to calculate localization drifts.Based on this idea, a general drift estimation method was implemented as a comparative method.The distance errors by the two drift estimation methods in different scenes are shown in Table 3.In the three scenes, the average and maximum distance errors by the proposed method were significantly less than those of the general method.The reason is that the proposed method removes the unreliable common features and the outliers of the estimated feature points.Therefore, the inliers yield a better performance of the drift estimation.A greater distance error in the drift estimation will falsely trigger global localization, especially when the drift threshold is set to be a small value, which will lead to an increase in the time consumption and poor user experience.
The cumulative distribution functions of distance errors by the proposed drift estimation algorithm are shown in Figure 14.In different experimental scenes, the performance of the proposed drift estimation algorithm was less affected by the lengths of the user trajectories.In the gymnasium scene, the drift estimation performance was better than in the other two scenes, because the gymnasium had a simple structure, and distinctive visual features were distributed on the wall.Based on the experiment, we conclude that, for the drift estimation, a simple building structure will yield a better result.To further evaluate the performance of the proposed drift estimation algorithm, the distance errors with respect to different confidence levels (25%, 50%, 78%, and 95%) are shown in Table 4. Since the distance errors by the drift estimation were limited within 30 cm in most situations, the recommended threshold th d should be equal to or greater than 30 cm.If the threshold th d was less than 30 cm, global localization would be frequently triggered due to the drift estimation errors.The local localization experiment of the proposed method was carried out based on the initial (global) localization.In the process of local localization, drift detection was implemented on the keyframes.Once the drift distance exceeded the given threshold, global localization was triggered on the condition that the interval of database images was set to be 30 cm (i.e., in l =30 cm).For the office scene, the gymnasium scene, and the corridor scene, the drift thresholds ( th d ) were set to be 30 cm, 40 cm, and 60 cm, respectively.In each experimental scene, the same initial positions obtained by the proposed global localization method were offered to various local localization methods.The drift detection interval key t was set to be 10 frames.
The performances of various local localization methods in different experimental scenes are shown in Table 5.The local localization experiment was also a comprehensive experiment on the proposed monocular localization system, because global (initial) localization and drift detection were included in the local localization experiment.With the aid of drift detection and global localization, the position errors by the local localization method were significantly reduced and the accumulative errors did not increase along with the lengths of the trajectories.For local localization, the query camera positions were iteratively calculated, which inevitably led to accumulative errors.For To further evaluate the performance of the proposed drift estimation algorithm, the distance errors with respect to different confidence levels (25%, 50%, 78%, and 95%) are shown in Table 4. Since the distance errors by the drift estimation were limited within 30 cm in most situations, the recommended threshold d th should be equal to or greater than 30 cm.If the threshold d th was less than 30 cm, global localization would be frequently triggered due to the drift estimation errors.The local localization experiment of the proposed method was carried out based on the initial (global) localization.In the process of local localization, drift detection was implemented on the keyframes.Once the drift distance exceeded the given threshold, global localization was triggered on the condition that the interval of database images was set to be 30 cm (i.e., l in = 30 cm).For the office scene, the gymnasium scene, and the corridor scene, the drift thresholds (d th ) were set to be 30 cm, 40 cm, and 60 cm, respectively.In each experimental scene, the same initial positions obtained by the proposed global localization method were offered to various local localization methods.The drift detection interval t key was set to be 10 frames.
The performances of various local localization methods in different experimental scenes are shown in Table 5.The local localization experiment was also a comprehensive experiment on the proposed monocular localization system, because global (initial) localization and drift detection were included in the local localization experiment.With the aid of drift detection and global localization, the position errors by the local localization method were significantly reduced and the accumulative errors did not increase along with the lengths of the trajectories.For local localization, the query camera positions were iteratively calculated, which inevitably led to accumulative errors.For example, in terms of the average errors by the PnP method, there was an increase from 58.86 cm in the office scene to 122.80 cm in the corridor scene.If the length of the trajectory was further extended, the position errors would continue to increase, which would ultimately leads to an unacceptable position error for indoor localization.Owing to the drift detection method, in the three experimental scenes, the maximum position errors were limited within 35 cm, 46 cm, and 68 cm, which obviously outperformed the comparative methods.The position errors and the CDFs of the position errors by various local localization methods are shown in Figure 15.With the drift thresholds of 30 cm, 40 cm, and 60 cm in the office scene, the gymnasium scene, and the corridor scene, the maximum errors by the proposed method were limited within 34.73 cm, 45.22 cm, and 67.69 cm, respectively.In these scenes, the maximum errors slightly exceeded the drift threshold.There are two main factors that led to this problem.First, not every query image is used for drift detection.Therefore, the drifts cannot be strictly limited within the given threshold.Second, the estimated drift distance contains errors, which results in the unsuccessful activation of global localization when the true drift distance is larger than the threshold.However, this problem did not seriously affect the performance of local localization, as demonstrated by the average errors by the proposed method.In addition, the experimental results also indicated that the local bundle adjustment algorithm contributed to the increase in the localization accuracy to an extent.
ISPRS Int.J. Geo-Inf.2018, 7, x FOR PEER REVIEW 24 of 28 example, in terms of the average errors by the PnP method, there was an increase from 58.86 cm in the office scene to 122.80 cm in the corridor scene.If the length of the trajectory was further extended, the position errors would continue to increase, which would ultimately leads to an unacceptable position error for indoor localization.Owing to the drift detection method, in the three experimental scenes, the maximum position errors were limited within 35 cm, 46 cm, and 68 cm, which obviously outperformed the comparative methods.The position errors and the CDFs of the position errors by various local localization methods are shown in Figure 15.With the drift thresholds of 30 cm, 40 cm, and 60 cm in the office scene, the gymnasium scene, and the corridor scene, the maximum errors by the proposed method were limited within 34.73 cm, 45.22 cm, and 67.69 cm, respectively.In these scenes, the maximum errors slightly exceeded the drift threshold.There are two main factors that led to this problem.First, not every query image is used for drift detection.Therefore, the drifts cannot be strictly limited within the given threshold.Second, the estimated drift distance contains errors, which results in the unsuccessful of global localization when the true drift distance is larger than the threshold.However, this problem did not seriously affect the performance of local localization, as demonstrated by the average errors by the proposed method.In addition, the experimental results also indicated that the local bundle adjustment algorithm contributed to the increase in the localization accuracy to an extent.To clearly illustrate the performance of the proposed local localization method, position errors with respect to multiple confidence levels (25%, 50%, 78%, and 95%) are shown in Table 6.According to the experimental results, the median errors by the proposed method were 17.46 cm, 19.99 cm, and 30.19 cm, corresponding to the 30-cm, 40-cm, and 60-cm drift thresholds, respectively.The experimental results demonstrated that the accumulative errors could be effectively reduced by the proposed localization method.Moreover, in all experimental scenes, the average and maximum errors by the proposed method were limited within an acceptable range that satisfied the requirements of most indoor localization and navigation applications.For the proposed localization system, the average computation time of the absolute position estimation, drift detection, and the relative position estimation were 0.647 s, 0.234 s, and 0.217 s, as shown in Table 7.As mentioned previously, database image retrieval was not applied in the experiments, so there was no computation time for image retrieval included in the system.It was obvious that the absolute position estimation needed more time to compute the epipolar relationship and estimate the absolute scale.Therefore, for the keyframes in local localization, drift detection was first employed to decide whether the absolute position estimation should be activated, which was a time-saving strategy for the localization system.To clearly illustrate the performance of the proposed local localization method, position errors with respect to multiple confidence levels (25%, 50%, 78%, and 95%) are shown in Table 6.According to the experimental results, the median errors by the proposed method were 17.46 cm, 19.99 cm, and 30.19 cm, corresponding to the 30-cm, 40-cm, and 60-cm drift thresholds, respectively.The experimental results demonstrated that the accumulative errors could be effectively reduced by the proposed localization method.Moreover, in all experimental scenes, the average and maximum errors by the proposed method were limited within an acceptable range that satisfied the requirements of most indoor localization and navigation applications.For the proposed localization system, the average computation time of the absolute position estimation, drift detection, and the relative position estimation were 0.647 s, 0.234 s, and 0.217 s, as shown in Table 7.As mentioned previously, database image retrieval was not applied in the experiments, so there was no computation time for image retrieval included in the system.It was obvious that the absolute position estimation needed more time to compute the epipolar relationship and estimate the absolute scale.Therefore, for the keyframes in local localization, drift detection was first employed to decide whether the absolute position estimation should be activated, which was a time-saving strategy for the localization system.

Conclusions
In this paper, a drift-aware monocular localization system is proposed, which is based on a pre-constructed dense 3D map for indoor environments.The proposed system can be divided into two modes: the global localization mode and the local localization mode.Considering the impact of the camera moving direction on the scale estimation, a pixel-distance weighted least squares algorithm is investigated in global localization for computing the absolute scale, which is used to acquire the absolute positions of the query camera.Because the accumulative errors caused by the relative position estimation are inevitable, a drift detection method is introduced, and the drift distance is estimated by the proposed line model-based MLESAC algorithm.In the process of localization, with the aid of drift detection, the system switches from the local localization mode to the global localization mode when the estimated drift distance is greater than the given threshold, which effectively reduces the accumulative errors that are caused by position iteration.According to the experimental results, under the condition of different database image densities, the average and maximum position errors by the proposed global localization method are limited within 16 cm and 36 cm, respectively, which outperforms the comparative localization methods.In the three experimental scenes, taking advantage of the proposed drift detection method, the maximum position errors by local localization are limited within 35 cm, 46 cm, and 68 cm, corresponding to the 30-cm, 40-cm, and 60-cm thresholds, respectively.Compared with the comparative methods, the position accuracies of global and local localization in the proposed monocular localization system are improved by at least 30.09% and 65.59%, respectively.The experimental results also show that the introduced drift detection method achieves a higher accuracy of the drift estimation, compared with the general method.In the future, how to manage a mass of pre-constructed 3D dense maps will be studied, and efficient image retrieval methods for indoor visual localization will be further investigated.
. The proposed system contains two modes: global localization and local localization.Local localization consisting of the relative position estimation and drift detection will be discussed in Section 3. The absolute position estimation, i.e., global localization, aims at acquiring query camera positions in the indoor coordinate system on the basis of pose-tagged database images which are stored in the preconstructed 3D map.Global localization can be divided into two parts: scale-free localization using the epipolar constraint, and absolute scale estimation based on the proposed PWLS algorithm.The global localization method is utilized for both the initial position estimation and position correction, once the drift distance exceeds the threshold in local localization.

Figure 1 .
Figure 1.Framework of proposed monocular localization system.

Figure 1 .
Figure 1.Framework of proposed monocular localization system.

Figure 2 .
Figure 2. Example of pre-constructed visual map in corridor scene.(a) Visual feature points in visual map; (b) Aligned point clouds in visual map.

K
represent the calibration matrices of the database and query cameras, respectively.The position coordinates D X and Q X of the matched descriptors on the database image and the query image can be normalized by:

Figure 2 .
Figure 2. Example of pre-constructed visual map in corridor scene.(a) Visual feature points in visual map; (b) Aligned point clouds in visual map.

Figure 3 .
Figure 3. Epipolar constraint between query camera and database camera.

Figure 3 .
Figure 3. Epipolar constraint between query camera and database camera.

Figure 4 .
Figure 4. Illustration of scale ambiguity problem in monocular localization.

Figure 5 .
Figure 5. Matched visual features and corresponding 3D points.(a) Query image; (b) Best-matched database image in visual map; (c) Matched visual features between query and database images; (d) Matched visual features on a portion of visual map.

Figure 4 .
Figure 4. Illustration of scale ambiguity problem in monocular localization.

Figure 4 .
Figure 4. Illustration of scale ambiguity problem in monocular localization.

Figure 5 .
Figure 5. Matched visual features and corresponding 3D points.(a) Query image; (b) Best-matched database image in visual map; (c) Matched visual features between query and database images; (d) Matched visual features on a portion of visual map.

Figure 5 .
Figure 5. Matched visual features and corresponding 3D points.(a) Query image; (b) Best-matched database image in visual map; (c) Matched visual features between query and database images; (d) Matched visual features on a portion of visual map.

Figure 6 .
Figure 6.Illustration of relative position estimation method.

Figure 6 .
Figure 6.Illustration of relative position estimation method.

Figure 7 .
Figure 7. Schematic diagram of relative position estimation by solving the PnP problem.

Figure 7 .
Figure 7. Schematic diagram of relative position estimation by solving the PnP problem.
the relative position estimation, the estimated feature points, which have been used to solve the PnP problems, are triangulated from the previous two query images.That is, the positions of the estimated feature points are determined by the two factors: the image positions of the common features in the previous two query images, and the corresponding positions of the query camera.

Figure 8 .
Figure 8. Schematic diagram of drift detection method based on dense 3D map interaction.

Figure 8 .
Figure 8. Schematic diagram of drift detection method based on dense 3D map interaction.
C S .The database feature points that are stored in the visual map are shown as pentacles with marks 1, 2, and 3 in Figure 9.The line jcc l joins the centers of the query camera, thereby indicating the moving direction of the user on the XOY plane.Each common feature in the set CS relates to two projection points in the   For a common feature, when the corresponding database feature point is closer to the line jcc l , the distance dis d , as shown in Figure9, between the projection points is shorter.

Figure 9 .
Figure 9. Illustration of removing unreliable key features for drift detection.

Figure 9 .
Figure 9. Illustration of removing unreliable key features for drift detection.
the subset E S are randomly sampled to fit a line.A function i d is introduced based on the line model as a measurement for calculating the distance between the estimated feature point and the fitted line.As shown in Figure 10, the distance function i d depends on the length of the line segment between the estimated feature point and the fitted line, and the direction of the segment is perpendicular to the direction of the line model.The distance function i d for the estimated feature point i E

Algorithm 1 : 1 n e − 2 n
Iterative solution for the parameters γ and σ1.Compute the likelihood p i inlier and p i outlier of each feature point in the set S E by p i inlier = γp in (d i ) andp i outlier = (1 − γ)p out (d i )with the parameters γ and σ (for initialization, γ 0 = 0.5 and σ 0 = parameter γ and the standard deviation σ by γ = step 1 to step 3 until convergence.

Algorithm 2 :
Drift detection based on map interaction 1.

XAlgorithm 2 : 2 . 3 . 4 .
is the 2D position of an estimated feature point, and i D X is the 2D position of the corresponding database feature point.The specific processes of the drift detection method are shown in Algorithm 2. Drift detection based on map interaction 1. Select a keyframe (e.g., the th t query image) from the query image sequence; Find common features between the keyframe, the   th and put them in the set C S ; Remove unreliable common features from the set C S ; Calculate the best-fitted lines by the line model-based MLESAC algorithm using the estimated feature points, and then select the inliers of the estimated feature points; 5. Compute the drift distance drift d by the inliers.A threshold th d is given to monitor the localization drifts, where drift th d d  indicates that the estimated drift distance exceeds the threshold, in which case the localization system must switch to the global localization mode from the local localization mode.

Figure 11 .
Figure 11.Overall process of proposed monocular localization system.

Figure 11 .
Figure 11.Overall process of proposed monocular localization system.

Figure 12 .
Figure 12.Examples of database images in different experimental environments (i.e., an office, a gymnasium and a corridor).

Figure 12 .
Figure 12.Examples of database images in different experimental environments (i.e., an office, a gymnasium and a corridor).

Figure 13 .
Figure 13.CDFs of average errors by various global localization methods.(a) CDFs of average errors in Case 1; (b) CDFs of average errors in Case 2; (c) CDFs of average errors in Case 3.

Figure 14 .
Figure 14.CDFs of distance errors by proposed drift estimation algorithm in different indoor scenes.

Figure 14 .
Figure 14.CDFs of distance errors by proposed drift estimation algorithm in different indoor scenes.

Figure 15 .
Figure 15.Position errors and CDFs of position errors by various local localization methods.(a) Position errors in office scene; (b) CDFs of position errors in the office scene; (c) Position errors in the gymnasium scene; (d) CDFs of position errors in the gymnasium scene; (e) Position errors in the corridor scene; (f) CDFs of position errors in the corridor scene.

Table 1 .
Position errors by global localization with various localization methods.

Table 1 .
Position errors by global localization with various localization methods.

Table 2 .
Performance of various global localization methods.

Table 3 .
Distance errors by drift estimation in different scenes.

Table 3 .
Distance errors by drift estimation in different scenes.

Table 4 .
Performance of proposed drift estimation algorithm.

Table 4 .
Performance of proposed drift estimation algorithm.

Table 5 .
Position errors by local localization with various localization methods.

Table 5 .
Position errors by local localization with various localization methods.

Table 6 .
Performance of local localization with various methods.

Table 7 .
Computation time of the proposed localization system.

Table 6 .
Performance of local localization with various methods.

Table 7 .
Computation time of the proposed localization system.