FloorVLoc: A Modular Approach to Floorplan Monocular Localization

: Intelligent vehicles for search and rescue, whose mission is assisting emergency personnel by visually exploring an unfamiliar building, require accurate localization. With GPS not available, and approaches relying on new infrastructure installation, artiﬁcial landmarks, or pre-constructed dense 3D maps not feasible, the question is whether there is an approach which can combine ubiquitous prior map information with a monocular camera for accurate positioning. Enter FloorVLoc—Floorplan Vision Vehicle Localization. We provide a means to integrate a monocular camera with a ﬂoorplan in a uniﬁed and modular fashion so that any monocular visual Simultaneous Localization and Mapping (SLAM) system can be seamlessly incorporated for global positioning. Using a ﬂoorplan is especially beneﬁcial since walls are geometrically stable, the memory footprint is low, and prior map information is kept at a minimum. Furthermore, our theoretical analysis of the visual features associated with the walls shows how drift is corrected. To see this approach in action, we developed two full global positioning systems based on the core methodology introduced, operating in both Monte Carlo Localization and linear optimization frameworks. Experimental evaluation of the systems in simulation and a challenging real-world environment demonstrates that FloorVLoc performs with an average error of 0.06 m across 80 m in real-time. a mission covering 80 m. Furthermore, we show that the FloorVLoc computational time is 0.1% the update time of the underlying monocular SLAM algorithm, thus operating in real-time. Based on these results, we conclude that FloorVLoc can be effectively deployed on an intelligent vehicle for the indoor search and rescue application of interest.


Introduction
Indoor spaces are all around us-airports, malls, office buildings, museums, manufacturing factories, homes-and constitute the environment in which the vast majority of our productive life is spent. With the continued, rapid progress of integrating intelligent systems into the workflow of search and rescue, security, and manufacturing, a prerequisite for all such applications is the ability to autonomously navigate. While GPS/GNSS systems offer effective solutions in outdoor environments when unobstructive line-of-sight to satellites is available, they are inadequate in indoor settings. This is true even if high sensitivity receivers are used, due to the relatively large errors caused by multi-path. Indoor localization continues to receive considerable attention and, due to the scope of the known challenges associated with it, remains an open problem.
Indoor global localization comes in many flavors due to the assumptions which are made about the environment and the type of platform and sensors that are used. However, the goals to achieve global localization are clear: distances from the platform to landmarks in the scene must be determined so that the resulting localization can be of a correct global scale; proper data associations must be • does not require the environment to be previously explored • incorporates prior information which is readily available and easy to obtain • effectively resolves the metric scale ambiguity • provides a means to handle and correct drift in all degrees of freedom • utilizes geometric map information of the environment structure which is stable and stationary (without photometric reliance) and • keeps the prior map information required at a minimum. The main contributions of this paper are:

1.
A modular core methodology which can integrate any monocular-based local scene reconstruction with a floorplan to perform global localization, including automatic scale calibration.

2.
A theoretical and experimental analysis of the conditions on the visual features associated with the walls necessary for the localization to be uniquely recovered.

3.
Two full global positioning systems based on this methodology which perform continuous positioning at real-time computation rates. One approach utilizes visual odometry in between Floorplan Vision Vehicle Localization (FloorVLoc) linear optimization localization updates; another approach formulates the system in a Monte Carlo Localization (MCL) framework with a FloorVLoc measurement model, correcting motion based on whatever wall information is present shown in Figure 2.

4.
Experimental evaluation of the global positioning systems for indoor search and rescue applications in a challenging real-world environment, as well as real and simulation testing for a focused study of various aspects of the methodology.

Literature Review
One rather immediate indoor alternative to GPS/GNSS would be based on radio frequency, such as Wifi, Bluetooth, or Ultra-wideband (UWB), yet the requirement of new infrastructure installation or deployable equipment prevents it from being a feasible solution in the context of the present study [4][5][6][7]. Futhermore, due to the challenges associated with constructing and maintaining the map and the accuracy issues associated thereof, there has been an effort to integrate other sensors with radio frequency signals [8].
Image-Based Localization (IBL) [9][10][11][12] has broadly been a popular approach to localization due to the amount of information that is provided by images. State-of-the-art algorithms for IBL have continued to follow a 3D structure-based approach [13][14][15][16][17][18][19][20], where 2D-3D image point to 3D world point correspondences are established for a query image, and then the camera pose is solved using an n-point-pose (PnP) solver. Understandably, due to outliers in the data associations, the whole algorithm is often placed inside a robust framework (e.g., RANSAC [21]). Alternatively, Deep Convolutional Neural Networks have also been deployed in order to achieve Absolute Pose Regression (APR) where, rather than relying on machine learning for extracting local features or handling outliers, the camera pose is regressed directly from the network [22][23][24][25][26][27][28]. However, 3D structure-based approaches have continued to significantly outperform the APR ones since APR approaches cannot guarantee to generalize from the training data, and it has been shown that such approaches cannot outperform an image retrieval algorithm [29]. Furthermore, there are some issues of brittleness as updating the map requires retraining the CNN network. Therefore, there is quite a bit of room for further research in this area.
Similar IBL approaches have been used for indoors as well [18,30,31]. For example, fiducial markers have been utilized in an indoor localization framework; by specifying the 3D locations of such markers with respect to the indoor environment, the transformation between the marker and the camera can be obtained when recognizing it in the image [32][33][34]. Scale is resolved through specifying the size of the marker, and the orientation is computed based on the unique orientation of the marker pattern with respect to the camera. Another type of map that has been used is a multi-channel raster image of the building [35], where in addition to lines of the intersections between walls and the floor (like a floorplan), information about the current status of the environment is inserted, such as the contours of static objects and an occupancy grid map specifying information of available areas for the robot to navigate. In a similar fashion, another approach has been to create a distance function based map offline thereby providing information about both region occupancy and distances to the occupied areas [36]. One of the main drawbacks to approaching localization in such a way is the requirement to either install infrastructure in the environment or explore it before performing the localization. In fact, constructing and extending a large-scale 3D model of the world prior to online positioning often requires using a sensor of the same modality. Torii et al. investigated whether such models are required in contrast to 2D image retrieval-based methods relying on a database of geo-tagged images. Their experimentation shows that 3D large-scale models are not strictly required, and approaches which can combine image-based methods with local reconstructions produce accurate positioning solutions [37]. The focus on utilizing local reconstructions is the approach that is taken in this paper, but rather rather than relying on any type of large-scale 3D model, using a floorplan significantly alleviates the issue of map construction and maintenance, and further, it does not require environments to be explored beforehand. The approach in this paper utilizes local reconstructions and relates them to global planar information from the floorplan in order to achieve global positioning. While a monocular camera could be coupled with another type of map (e.g., a 3D point cloud lidar map) for localization by computing the similarity transformation between the local reconstruction and the global map [38], there are several advantages in choosing a floorplan to perform the indoor localization with this paper: First, from an application-context perspective, our method does not require the scene to be previously explored; second, using a floorplan provides more flexiblity for data association which is more critical in indoor settings where texture and features are not as prevalent as in outdoor settings; third, our method uses the minimal amount of prior map information required for global localization, an extra challenge which this paper solves.
While surprisingly somewhat rare, floorplans have been used as maps with localization systems before, and when they are used, they are primarily coupled with sensors which provide depth information (lidar or RGBD cameras). With regard to approaches using RGBD cameras, Ito et al. [39] approached localization in a Bayesian state estimation framework using a particle filter (i.e., Monte Carlo Localization) with an RGBD camera and IMU for visual-inertial odometry with the depth information provided by the depth camera utilized in the measurement process using a beam sensor model. Wifi signals were also integrated in order to help reduce multi-modality of the particle filter and increase the convergence speed. Winterhalter et al. [40] similarly utilized an RGBD camera and IMU to guide the proposal distribution, using the Google Tango Tablet platform where the depth data formed the basis of the measurement model, and localization operated in an MCL framework. Ma et al. [41] developed an RGBD Simultaneous Localization and Mapping (SLAM) system which combines direct image alignment frame-to-keyframe with frame-to-plane alignment via a global graph optimization. Coupling the local motion estimation with the global in the same optimization has the drawback of not being modular to using any underlying visual SLAM algorithm. Chu et al. [42] approached performing global localization using a monocular camera via full point cloud, line, and conservative free space matching by using a scaled-version of the reconstruction (therefore similar output to what one would get with an RGBD camera) from a semi-dense visual odometry algorithm. Computing the scale offline and then using it for matching means that such an approach cannot be performed online as it avoids automatic scale calibration, a critical issue which is solved in this paper. Mendez et al. [43,44] approached the problem from a different perspective where semantic information is extracted from the images and then matched to the floorplan to localize using the angular distribution of the semantic labels. Similarly, Wang et al. focused on utilizing text from the names of stores and facade segmentation in order to localize within a shopping mall [45]. With their approach, the text of store names was determined via a Markov Random Field (MRF) and accuracy was limited to around 1-5 m. Further, in situations when text was missing, the system suffered severely. Depth information of the scene has also been acquired via lidar which can be integrated into a pose-graph optimization rather than a filtering scheme. The data acquired by the lidar sensor can be utilized for odometry via scan-matching, and then an ICP algorithm can be deployed in order to use the floorplan and provide an a priori estimate on the pose, which is subsequently optimized according to the underlying pose graph [46,47].
Monocular SLAM/Visual Odometry (VO) has been researched somewhat extensively, including approaches such as ORB-SLAM2 [48], SOFT-SLAM [49], LDSO [50], NID-SLAM [51], ROCC [52], Fast-SeqSLAM [53], SVO [54], and PL-SVO [55]. Furthermore, the Bundle Adjustment algorithm has consistently been a state-of-the-art approach for jointly inferring 3D structure, typically sparse scene structure, and camera pose information from a series of images by minimizing the geometric reprojection error, and many SLAM formulations have relied on it [56][57][58][59][60]. In light of this, due to the known scale ambiguity issue, there has been a considerable amount of work to utilize monocular cameras to acquire metric depth information either from Deep Convolutional Neural Networks [61][62][63][64][65] or by relating visual cues to structural information in the scene for which metric scale can be known. One approach is to take advantage of the objects present in the scene, known as Object-SLAM. The goal is to recognize objects in the environment and compute the scale by utilizing priors on the sizes of such objects [66][67][68][69][70]. In general, such a problem deals with both robust object detection and object measurement [71][72][73]. Other geometric structures (e.g., the ground plane) have also been exploited to help resolve the scale [74][75][76][77][78]. Lastly, the visual information has also been often integrated with other sensors such as an IMU [79][80][81][82] or lidar [83][84][85].
In order to continuously provide global positioning so that local reconstructions can be properly matched to the global structure, perceptual information should be integrated into a system which fuses multiple sources of information effectively. Traditionally, Extended Kalman Filter approaches were utilized [86][87][88][89][90][91] for localization, but more recently, approaches based on particle filters especially for mobile robotics localization have been deployed [92][93][94][95][96], also known as Monte Carlo Localization (MCL). More rarely, however, has Bundle Adjustment been utilized to integrate various sources of external scene information with visual information into a formulation which can solve global positioning. One way is to augment the cost function to include these external constraints, and thereby perform a constrained Bundle Adjustment (BA) optimization [97][98][99]. Finally, one of the well-known challenges to overcome with vision-based navigation is drift. In contrast to sensors which can measure depth, monocular vision systems add another degree of freedom to drift over time with respect to scale. Thus, drift correction becomes of vital importance, and is addressed in a variety of ways [100][101][102]. For example, in [103], per-frame global optimizations are performed over all current and past camera poses while in other works techniques of loop closure or derivatives thereof have been required to resolve it (e.g., [104]). This paper addresses the drift issue by deriving the relationship between the vision-based data and the floorplan and explicitly providing mathematical criteria for showing when global localization computations can be drift-free based on the geometrical structure in view. While a floorplan typically offers more information, utilizing simply the planar information is sufficient to localize effectively and accurately.

Global Localization: The FloorVLoc Core
To give an overview of the global localization: a series of images taken by a moving monocular camera is processed by an underlying monocular SLAM algorithm which outputs local motion and scene structure. This paper provides two methods for performing this global positioning-one in a Monte Carlo Localization framework and the other via an optimization framework. The MCL framework combines the camera egomotion with wheel encoders to form a motion model and relies on a FloorVLoc Core measurement model to refine the localization. The optimization framework incorporates the local camera motion with a floorplan-based scale estimate to provide visual odometry and solves for the refined poses via a FloorVLoc Core optimization.
This section presents the core methodology for integrating a monocular camera with a floorplan to provide global localization. An emphasis is placed on the automatic scale calibration in an online fashion-a feature which is necessary in order to be able to use any underlying monocular SLAM system for online global positioning. In addition, an analysis is presented of the visual features associated with the walls to ensure unique global localization, which intrinsically corrects drift. The beauty of all of this comes from the fact that it can be derived from first principles and consequently has significant ramifications.
Suppose that the platform is moving and an image is taken at time t by the camera located at position p ∈ R 3 with orientation denoted by R ∈ SO(3). Further, suppose features are extracted from the image such that the i-th feature point has normalized homogeneous image coordinates ofq i ∈ R 3 . Then the true location Q i of the corresponding feature is where λ i is referred to as the depth of the ith feature point. Consider the estimated positionp and orientationR by the localization algorithm at the same time instant. Then the 3D locationQ i can be expressed as:Q Assuming thatq i is a feature on the jth wall of the floorplan, then bothQ i and Q i are 3D points which should lie on that wall. Let π j = (N j , b j ) denote the planar properties of such a wall where N j is the normal vector and b j is the distance from the plane to the world origin. While the true location Q i is unknown,Q i can be estimated, and consequently the depthλ i , by using a ray-tracing algorithm. Under this assumption, it directly follows that: Applying this to (2) and left-multiplying by N j , yieldŝ Since b j = N j Q i is the distance from the origin of the world coordinate system to the jth wall, (4) can be simplified toλ Consider next a sequence of images captured using the monocular camera and then processed using a monocular vision Simultaneous Localization and Mapping (SLAM) algorithm where the local motion and scene structure are recovered. Because depth computation is up to an unknown scalar s which is common to all of the points,λ i = 1 sλ i Thus: Equation (6) can be used in either an optimization framework or integrated as a measurement process in a filtering framework. Both approaches will be presented with a study of certain planar criteria which must be met to uniquely compute the pose.

Optimization Framework (FloorVLoc-OPT)
Suppose thatR is a good approximation to the true orientation R so that where ∆Ψ is a small rotation matrix (approximately equal to the identity matrix I). Using the small angle approximation and the wedge operator (·) ∧ , which converts 3D vectors into 3 × 3 skew-symmetric matrices, δφ, δθ, and δψ being small angles. Using this approximation, (6) can be rewritten as The estimate for the pose is obtained by scaling the camera egomotion from the underlying monocular SLAM algorithm and transforming it based on a world reference pose. The world reference pose is chosen to be the last one for which a global localization solution was computed or the initial world pose. Definingp = 1 sp and after some algebra, (10) becomes which in matrix form looks like: Note that (12) provides one constraint on the variables of the problem. Assuming that a sufficiently large number of such equations are collected so that the associated matrices have full rank (more on this later), the resulting linear system of equations can be solved efficiently. If required, the small angle approximation can be relaxed by replacingR ← ∆ΨR after finding the angular perturbation and performing a new iteration of the above algorithm to refine the solution. Here ∆Ψ should be formed as the rotation matrix associated with ∆ψ and not its small-angle approximation.

Planar Motion
The constraints developed in the previous section can be used as the basis for global localization for six degree of freedom platforms (e.g., quadcopters, mobile phones). However, because the platform of interest for search and rescue is a small robotic vehicle, in this section we will elaborate on the specialization to planar motion. Suppose then that a ground vehicle is used to carry the camera which is furthermore assumed to be rigidly mounted. Using the assumption that changes in height and roll and pitch angles are negligible, the orientation perturbation reduces to ∆ψ = (0, 0, δψ) and by defining c ψ = cos (ψ) and s ψ = sin (ψ) so that where (·) x , (·) y , and (·) z denote the x, y, and z components of the vector, respectively, (11) becomes: Again, in matrix form:

Data Association
One of the key components to effectively perform the global positioning is the data association. Planar associations are found between the map points from the underlying SLAM algorithm across some designated time horizon and the global planes π j provided by the building floorplan. The associations are found by performing ray tracing. During the initial stage of the algorithm, before any scale factor has been computed, all planar associations are kept; however, in subsequent stages, once a scale factor has been computed, then associations undergo an outlier filtering process. The planar fitting error j i derived from (6) is computed for every point-plane association (q i , π j ): Note that q i are all the points across some time horizon (meaning points from the past are propagated to the current pose coordinate frame). Associations are only kept when | j i | < τ where τ is some threshold. For the experimentation in this paper, τ was chosen to be 30 cm. Because some points have more error than others either due to triangulation error or due to obstacles in the environment, the planar fitting error can be used to formulate a weight for each planar association and a weighted least squares optimization can be performed using (15). The weights are defined to be: where µ j and σ j are the mean and standard deviation, respectively, of the planar fitting error for all points on plane j. If there are fewer than 10 associated points to some plane, those points are discarded to mitigate outliers. In addition, due to the fact that a robotic vehicle has planar motion, only vertical planes constrain the localization, and thus only points on vertical planes are utilized for the optimization. Lastly, in order to be more robust to incorrect solutions, pose computations are only accepted if their corresponding positions and orientations are a weighted threshold away from the a priori estimated poses, with weight proportional to the time since the last computed localization solution.

Initialization
The vehicle starts from a known world pose but an estimate for the scale needs to be computed. This is done by finding planar associations as described in Section 3.2.1 and obtaining a scale factor for each association (q i , π j ) by:ŝ The overall estimate is computed by taking the median of all scale factors: Using the median is advantageous for handling potential outliers. In addition, here we consider points on all planes including horizontal planes such as the ground or ceiling. While such points do not affect the position or orientation for the vehicle, they can be used to compute the scale which is necessary for initializing the system using the floorplan. So in this context, such points are useful to include. This scale factor is then refined when the linear optimization is performed.

Uniqueness Criteria
In order to uniquely compute the pose of the camera using the planar associations, certain planar criteria must be met which are manifested in the rank of matrix (12) for the general case and matrix (15) for a vehicle. For the general case, (12) is a linear system of equations with seven unknowns, and hence at least seven distinct visual features must used. To determine the requirements for computing a unique camera pose based on planar criteria, let A π be the matrix of the k world planes that are in view across the time horizon and be defined as follows: Then the following result holds: Lemma 1. A necessary and sufficient condition for computing a unique pose is that seven distinct feature points are visible and the matrix A π has rank at least four.

Proof. See Appendix A.
In considering the uniqueness criterion for a robotic vehicle platform where planar motion occurs, define the matrix A v π to be: In this special case, the conditions in Lemma 1 can be reduced:

Lemma 2.
A necessary and sufficient condition for computing a unique pose for a camera rigidly mounted onto a robotic vehicle is that four distinct feature points are visible and the matrix A v π has a rank of at least three.
Proof. See Appendix B.

Monte Carlo Localization Framework (FloorVLoc-MCL)
The second approach presented in this paper for monocular floorplan localization which can utilize any underlying monocular SLAM algorithm is based on a Monte Carlo Localization (MCL) [105] framework. MCL is achieved via the Bayesian recursive update, where each particle x t represents a pose hypothesis and where z t denotes the measurement and u t refers to the control, which in this context corresponds to odometry. The particles are propagated according to a motion model p(x t |x t−1 , u t ) and then re-weighted based on a measurement model p(z t |x t ) in order to estimate the posterior belief of pose x t . The particles are initialized uniformly using the a priori pose estimate and its uncertainty, and the scale component of each particle is initialized using the method given in Section 3.2.2.

Motion Model
More specifically, because the particles are elements of the Lie Algebra sim(2) (position, orientation, and scale), the scale of the scene must also be taken into account in the motion model so that each particle can properly and effectively propagate the scale forward. The position and orientation use a standard odometry motion model with Gaussian error terms in a three-vector parameterized form. The odometry is obtained using the wheel encoders for distance and the camera egomotion from the underlying SLAM algorithm for relative orientation. Let x t denote the estimated current pose updated from the odometry motion model, x t−1 refer to the pose at the previous time step, and the corresponding poses from the camera egomotion be ξ t−1 , ξ t . The estimated scale for each particle is computed by where (·) tr refers to the translation component of the pose vector. This way, the scale can be propagated in a way that is coupled to the propagation of the position of each particle.

Measurement Model
The measurement model of MCL is used to integrate the global planar information provided by the floorplan and the reconstructed local scene structure in order to re-weight the propagated pose hypotheses. The general idea is to compute the planar fitting error for all of the points, placing them on their respective walls, according to each specific pose hypothesis. This formulates the log-likelihood of the measurement. In order to accomplish this, associations are established according to the method described in Section 3.2.1 except no outlier filtering is performed here. The measurement log-likelihood is then defined as, where R is the rotation matrix corresponding to the orientation component of x t , p is the translation component of x t , and s is the scale component of x t . Furthermore, ρ(·) is a robust kernel (e.g., Huber) to help mitigate outliers and σ z is a saturation term. In the experimentation presented in this paper, the odometry between the previous and current poses, d t 1 →t 2 , along with a constant κ were used to define σ z as: which effectively places less weight on measurements with small baselines. The justification for this is that the accuracy of Bundle Adjustment (from Multi-View Geometry) increases with longer baselines. The tuning parameter κ was chosen to be 100 for the experimentation presented later in the paper. Each particle is then updated by the following weight policy, which experimentally was verified to work well: After re-weighting the particles, then all of the particles are re-sampled using stochastic universal resampling, and the process repeats recursively according to (22).

Experimental Evaluation
In order to evaluate the performance of our proposed approach, we collected data using a robotic vehicle equipped with the uEye IDS monocular camera (IDS Imaging Development Systems GmbH); the camera was forward facing and mounted at a height of about 15 cm off the ground. The external odometry was based on wheel encoders, and the vehicle sensor array also included an Odroid XU4 (Ameridroid High-Performance Electronics) onboard computer and an Arduino Nano. The odometry from the wheel encoders was somewhat poor, yielding an error of about 5%. The on-board camera provided up to 57 fps at full resolution of 3.2 MP with a field of view of 65.6 × 51.6 degrees. The length of the vehicle was 53.5 cm and its width was 28.1 cm-which is important to note in light of the accuracy of the systems. For the tests presented, the vehicle was used for data acquisition, and the positioning systems were run offline on a laptop computer. Two types of image resolutions were experimented with when running the local monocular SLAM system [48] which was used for all of the experimentation: The optimization framework used 2056 × 1542, and the MCL framework used 640 × 480. For both systems, the initial ground-truth pose in the floorplan was given. Note that the floorplan, which was obtained from building managers, contained the ceiling height. Details on how ground truth was obtained are given in Appendix C.

Comparison to Related Methods
The positioning algorithms presented in this paper, FloorVLoc-OPT and FloorVLoc-MCL, which correspond to the optimization and MCL frameworks, respectively, were compared to each other as well as to a state-of-the-art SLAM algorithm, ORB-SLAM. Because ORB-SLAM is a monocular SLAM algorithm, the map built is with respect to the initial camera pose and the trajectory is without scale. We align ORB-SLAM's initial pose with the ground truth initial world pose and scale the trajectory via two different ways: The first way computes the ratio of the distance between two known ground-truth positions and the corresponding unscaled distance between two ORB-SLAM positions.
The second way uses the distance obtained from the external odometry provided by the wheel encoders rather than the translation between two ground truth poses.
where d t 1 →t 2 is the wheel encoder odometry between two selected poses used for the scale computation. Note that the latter can be implemented in real-time while the former is useful for experimentation. The results for the FloorVLoc-MCL and FloorVLoc-OPT global positioning systems as well as the comparison approaches are given in Table 1. Table 1. Results for the FloorVLoc global positioning systems across a trajectory of about 8000 cm. The errors are found by taking the error vector between each pose and its corresponding ground truth and computing the mean error vector (ē) and the covariance matrix from which the standard deviation σ is obtained.
Methodē pos (cm)σ pos (cm)ē ori (rad)σ ori (rad) Trajectories for both FloorVLoc algorithms are shown in Figure 3a,b. The vehicle started in a more open area and then proceeded down a long corridor, making a loop around an island of offices containing rather narrow corridors and very tight turns, and then returning to the open area. During the execution, there were dynamic obstacles (e.g., people walking in front of the vehicle at multiple instances, sometimes right in front) as well as static obstacles (e.g., cabinets) and other challenging aspects such as significant illuminance variation due to sunlight entering windows. Handling dynamic obstacles is usually a front-end task and one which has been directly tackled in the literature [106,107], and its negative result is typically manifested in tracking difficulty or reconstructed points which need to be treated as outliers for floorplan-based localization. Due to the modular design of the global positioning systems in this paper, the focus on dealing with dynamic obstacles was placed on properly handling outliers in the reconstructed scene as discussed in Section 3.2.1.  First of all, it can be seen that the FloorVLoc algorithms significantly outperformed the ORB-SLAM transformed trajectories. Whereas both ORB-SLAM ones expectedly suffered from drift significantly, the FloorVLoc approaches properly handled the drift as shown in Figure 4, which was one of the main benefits of incorporating the floorplan into the global positioning. It can also be seen that the main sources of error were in the X direction, which is related to the fact that this was the direction of the corridor, and consequently, the majority of the motion was in the X direction. Furthermore, it can also be seen that the motion in the Y direction was properly constrained as the error was quite small and relatively consistent throughout despite the trajectory containing a loop around the island of offices. Even though there were people that walked around the moving vehicle at three separate instances throughout the execution, it can be seen that the error was still low, with standard deviation even less than the length of the vehicle itself.

ORB-SLAM with
One of the interesting places in the experimentation happened around frame 1534 where in Figure 5 we can see the uncertainty of the underlying monocular SLAM algorithm The vehicle was in the process of turning out of the island of offices and encountered a situation where there was primarily a white wall and the floor-a classic place for tracking failure-and indeed tracking often did at this point. Figure 5 also shows a rise in the uncertainty in the Bundle Adjustment optimization during that time period. Figure 5 is a plot of the pose inverse Hessian norm of the local Bundle Adjustment optimization the last time that a particular frame was part of the optimization. Understandably, the uncertainty could not be directly related to global pose uncertainty because of the correlations which existed between camera poses and due to the fact that the uncertainty for the poses in the local BA were with respect to the keyframes, which were fixed to handle the gauge freedom. Nevertheless, it was indicative of places in the trajectory when the underlying local SLAM had more difficulty. In fact, errors in the local reconstruction from the windowed BA optimization will propagate to the global positioning system similar to how sensor errors propagate into the error of any localization system. This is inherent to the design of the system which was done in order to keep modularity so that any underlying SLAM algorithm could be used and computational cost could be kept at a minimum. The important point to note is that while SLAM systems have to deal with long-term drift, the FloorVLoc methods provide the means to correct drift and localize with respect to the building.  Therefore, it can be seen that the orientation error shown in Figure 6, particularly for FloorVLoc-MCL, increased during that time; however, the error subsequently decreased utilizing the walls of the floorplan to guide its correction. For that area, the orientation error for FloorVLoc-OPT increased but not as much. Note that the OPT approach uses a larger horizon when computing the pose, whereas the MCL approach used the current points that the camera views for updating the motion in the measurement model. One of the benefits of using a larger horizon is that challenging tracking situations can be handled more effectively, as can be seen by the results. However, it should be noted that as the horizon increased, the influence of drift naturally also increased since technically drift was only eliminated when only the points in the current BA optimization were used. FloorVLoc-OPT used a horizon of 15 keyframes which provided a nice balance. Overall, while both FloorVLoc approaches performed very well for the challenging dataset, the optimization-based approach tended to outperform the MCL approach. For the MCL approach, the initial pose was assumed to be known with an uncertainty modeled by zero-mean Gaussian noise with a standard deviation of 10 cm in the position and 10 degrees in the orientation. Ten runs of the algorithm were performed in order to incorporate the stochastic nature, and the results were averaged. Figures 4 and 6 plot the averaged results for the position and orientation errors. In Figure 7, the average error as well as the standard deviation of the error across all runs are plotted.

FloorVLoc-MCL
In this section, we perform an ablation study of the localization algorithms under more controlled simulation environments. In order to further test specifically the MCL approach, which is probabilistic in nature, we constructed an indoor building in simulation using Unreal Engine 4 and Microsoft Airsim. Refer to Figure 8 for a view of the indoor simulation environment. The environment was built to emulate specific characteristics of indoor buildings. Obstacles such as a cabinet and bookshelf were placed to serve as planar obstacles and other obstacles such as a sofa and a chair were added to provide exposure to non-planar obstacles. The simulation building also contained windows with sunlight glare, a door, a lamp hanging from the ceiling, and other features so that the environment provided a reasonably realistic scenario for additional testing. Odometry was simulated by taking the ground-truth positions, computing the distance between them and accumulating a 5% error on the distance, similar to what happens with the wheel encoders in the real data testing. The vehicle's trajectory was designed so that the vast majority of the environment could be in view at some point. The camera parameters of the simulation camera were chosen to be similar to the ones with the real IDS camera. The camera was also placed at a similar height (15 cm). Ten runs of the algorithm were performed in order to incorporate the stochastic nature of the MCL framework and the positioning results were subsequently averaged. The initial pose was perturbed from the ground truth with a position error of 7.07 cm and an orientation error of 0.02 rad. where it rose to 14 • due to underlying uncertainty in the Bundle Adjustment where tracking was challenged by a tight turn with white walls. Despite this, the FloorVLoc method successfully managed to reduce the orientation error, using the floorplan to guide it, down to an orientation error of 1 • by the end of the trajectory. The trajectory along with the floorplan of the environment are shown in Figure 9. The average position error in the X and Y directions was 13.94 cm and 12.59 cm, respectively. The average standard deviation of the errors in the X and Y directions was 10.29 cm and 18.57 cm, respectively. The average orientation error was 0.086 rad and the standard deviation of the orientation error was 0.080 rad. The trajectory length was about 1570 cm. As the results demonstrate, there is consistency in the magnitude of error between the real data results and the simulation results, which is nice to observe. Figure 9. Trajectory for FloorVLoc-MCL in Tech-S-2. The trajectory was specified so that the vast majority of the environment could be in view at some point. The positioning accuracy is similar to that for the real-world testing which demonstrates nice consistency between performance in simulation and in the real building.

FloorVLoc-OPT
In this section, we perform controlled experiments for single camera localization based on the planarity in the scene and study the effect of various sources of error including noise due to triangulation of points, data associations, and estimated prior camera pose.
The synthetic dataset used in this section, Tech-S-1, was generated with various environments and artificial floorplans (refer to Figure 10), according to a number of varying parameters. More specifically, all of the points were perturbed with some artificial pixel noise, ranging from ± half a pixel according to a uniform distribution. For the various synthetic tests, parameters varied across the initial orientation error, the number or percentage of incorrect correspondences, the number of planes in the scene (which sometimes included a planar obstacle such as a cabinet), the number of points per plane, the structure of the synthetic environment, the magnitude of the error resulting from the incorrect correspondences, artificial scale factor noise, and whether a robust estimation framework was set in place to handle the outliers. All of these details along with the localization results are summarized in Table 2. Note that the position error refers to the magnitude of the position error vector, and the symbols " " and "-" correspond to a specific parameter being present or absent, respectively.  (e) Synthetic environment 5. Figure 10. The Tech-S-1 synthetic environments used for the focused testing of the localization algorithm presented in FloorVLoc Core. Variations in the initial orientation error, number of incorrect correspondences, magnitude of error from incorrect correspondence, number of planes present in the scene, number of points per plane, whether a robust estimation framework was used, pixel noise, and scale factor noise were set forth to study the resulting effect on the localization accuracy.
Test 1 serves as an initial benchmark to present the accuracy under somewhat ideal conditions. Here, the noise introduced into the system resulted only from the artificial pixel noise. In tests 2 and 3, a single data misassociation was made to investigate its effect on the localization. The results show the correlation between the position accuracy and the error of the misassociation projected along the plane's normal. In test 2, the error resulting from the misassociation was relatively low (3.6 cm) and the position accuracy matched that, whereas in test 3, the error from the incorrect association was quite large (193.2 cm), and this caused the position accuracy to degrade somewhat.
Test 4 examined what happened when the misassociation was on a wall which had a normal vector orthogonal to the wall it should be on. The results indicated that the orthogonal normal vector did not have a significant, negative effect on the positioning accuracy as long as the distance from the misassociated point to its correct wall was sufficiently close, which in this case was less than 10 cm. In tests 5-9, the localization algorithm from FloorVLoc Core was placed into a Random Sample Consensus (RANSAC) [21] framework in order to observe the effect of a significant increase in outliers. In test 5, 53.3% of the data had incorrect correspondences; in test 6, the same percentage of incorrect correspondences was coupled with initial orientation error; and in test 7, the outliers on the walls are shown in Figure 10d. Test 8 introduced perturbations in the scale for 40% of the data where the error was multiplicative Gaussian. More specifically, the depth for each point λ i was perturbed to beλ i = (1 + δ λ )λ i where δ λ ∼ N (0, 0.1 2 ). Lastly, in test 9, a planar obstacle (simulating a cabinet for example) was introduced into the environment. Note that according to the localization criteria from Section 3.2.3 a positioning solution was still viable even with the planar obstacle, due to the geometrical properties of the walls, provided that the outliers from the planar obstacle could be handled (which they were in this test).
In addition to performing these localization tests using synthetic data, real data from our Tech-R-1 dataset were used to further investigate and understand the algorithms that could result from FloorVLoc Core. The reconstructed unscaled point cloud shown in Figure 11 and corresponding floorplan shown in Figure 12. The real data were collected by our robotic platform and subsequently processed by a monocular SLAM algorithm [48]. For the tests presented below, in order to handle the outliers which existed in the real data, the localization was performed inside an MSAC robust estimator [108] augmented with a regularization term. This was motivated by the fact that sometimes an obstacle in view contained a high number of features compared with the surrounding walls. In these cases, an estimation solution treating such obstacle points as belonging to nearby walls would have a comparatively lower cost due to the large support from the high number of features. The effect was that the position solution would be incorrect while the orientation solution could still potentially be correct. In order to handle this, a regularization term was incorporated utilizing a weighting parameter γ: wherep is the estimated a priori position. In the real testing above, γ was chosen to be 0.5, and here ρ(·) refers to the MSAC robust estimator kernel.   Figure 11 corresponds to the reconstruction of the environment viewed from a camera facing in the negative Y direction.

Test 1-Scale Perturbation
In this test, the scale factor was perturbed with Gaussian multiplicative noise while the initial orientation and position were unperturbed at their correct values. More specifically,s = (1 + δ s )s where δ s ∼ N (0, 0.1 2 ). The localization algorithm was run 30 times with different scale factors each time, and the position and orientation solution values were averaged across all of the runs before computing the error. For all of the tests in this section, the position error refers to the magnitude of the error vector. The average position error was 0.26 cm and the average orientation error was 0.013 rad.

Test 2-Initial Orientation Perturbation
For this test, the initial orientation was perturbed with Gaussian noise using a standard deviation of about 7 degrees while the initial position and scale were unperturbed at their correct values. In other words,ψ 0 ∼ N (ψ 0 , 0.1222 2 ). Thirty iterations were run with different initial orientations each time. The average position error was 1.03 cm and the average orientation error was 0.030 rad.

Test 3-Initial Position Perturbation
In this test, the initial position was perturbed with Gaussian noise using a standard deviation of 30 cm in both the X and Y directions while the initial orientation and scale were unperturbed at their correct values.

Test 4-All Perturbation
In this test, the initial position, initial orientation, and scale were all perturbed each run of the algorithm, and fifty runs of the algorithm were performed. The distribution of noise for each variable had the same parameters as specified in the three tests above. The average position error was 2.71 cm and the average orientation error was 0.027 rad.

Runtime
For all experimentation, a 6-core 2.8 GHz (with a 4.7 GHz turbo boost) Intel Xeon CPU was used. The average runtime for the MCL update was (40 ± 15) ms (i.e., approximately 25 Hz) and the average runtime for the optimization framework update was (0.14 ± 0.09) ms. In addition, the windowed Bundle Adjustment optimizations with the underlying SLAM system ran on average for (129 ± 86) ms. This was measured for the 640 × 480 image resolution. As mentioned above and shown in Figure 13, the additional processing time for providing global positioning was very low, especially with the optimization framework, which was 0.1% the computational time of the BA algorithm. This shows that these global positioning modules ran in real time to provide online global positioning.

Conclusions
In this work, we address vision-based indoor exploration by presenting a core methodology which combines a monocular camera with a floorplan to perform global positioning for a robotic vehicle. The main advantage of our approach is its modular nature: our method integrates any monocular-based local scene reconstruction with a floorplan to perform global localization, incorporating a means to automatically calibrate the scale. This means that any state-of-the-art monocular visual SLAM system can be directly utilized for the global positioning. A detailed theoretical and experimental analysis is presented for the conditions on the visual features associated with the walls necessary for the localization to be uniquely recovered. Furthermore, we developed two full global positioning systems based on the core methodology which perform continuous positioning. One approach utilizes visual odometry in between FloorVLoc linear optimization localization updates; another approach formulates the system in an MCL framework with a FloorVLoc measurement model, correcting motion based on whatever wall information is present. We demonstrate the effectiveness of FloorVLoc with experimental evaluation in the real world as well as in simulation, using Unreal Engine and Microsoft Airsim. Results for a challenging real-world dataset demonstrate robust, reliable, and accurate positioning with an average error of 0.06 m across a mission covering 80 m. Furthermore, we show that the FloorVLoc computational time is 0.1% the update time of the underlying monocular SLAM algorithm, thus operating in real-time. Based on these results, we conclude that FloorVLoc can be effectively deployed on an intelligent vehicle for the indoor search and rescue application of interest.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1
Proof. Let q i 1 , q i 2 , . . . , q i k each correspond to a subset of the m total feature points where each are associated with the k unique world planes π 1 , π 2 , . . . , π k , respectively. Let matrix A be defined by re-organizing the data in (12) according to the world planes for all of the points: Rearranging the rows of the matrix so that the first k rows contain k feature points viewed on k unique planes, denoted q 1 , q 2 , . . . , q k , yields: Given that the combined position and scale contain four degrees of freedom and the upper-right submatrix, which is A π from (20), left-multiplies the vector containing the inverse scale and unscaled position (s −1 ,p) , it follows that in order to uniquely compute the position and scale, A π must be a rank of at least four.
To address uniquely computing the orientation, consider the situation of only a single wall being in view: Subtracting the first row from the remaining ones: It can be seen from the matrix A 1 that the bottom-left submatrix A ψ has a rank of at most two. This is due to the fact that right-multiplying by N 1 gives 0. This implies that using a single plane allows the orientation to be recovered up to a rotation about the N 1 vector. Therefore, a single plane cannot uniquely recover the orientation but two planes with linearly independent normal vectors is sufficient.
Because computing the position and scale requires that matrix A π has a rank of at least four, which corresponds to having four distinct planes with at least three of them having linearly independent normal vectors, the orientation can be uniquely computed under this same criterion.

Appendix B. Proof of Lemma 2
Proof. As before, let q i 1 , q i 2 , . . . , q i k each correspond to a subset of the m total feature points where each are associated with k unique planes π 1 , π 2 , . . . , π k , respectively. Defining α ij as given in (14), construct A v as follows: Rearranging the rows of the matrix so that the first k rows contain k feature points viewed on k unique planes yields: Given that the position and scale contain three degrees of freedom and the upper-right submatrix, which is A v π from (21), left-multiplies the vector containing the inverse scale and unscaled position (s −1 ,p x ,p y ) , it follows that in order to uniquely compute the position and scale, A v π must be rank at least three. Note that if (N j x , N j y , b j ) = (0, 0, b j ), corresponding to a horizontal plane, then even though the row in matrix A v does not constrain the position or orientation, it does constrain the inverse scale.
With regard to the orientation, consider a single wall in view: which is equivalent to This means that the orientation can be uniquely found as long as matrix A v ψ has a rank of at least 1. This will occur when α i1 are not all zero. Geometrically, this happens when the points in view lie on any non-horizontal plane. Due to the rank constraint of A v ψ being at most one, the stricter constraint for the position and scale supercedes, and the criterion for computing a unique solution remains the case when A v π has a rank of at least three.

Appendix C. Process for Obtaining Ground Truth for the Tech-R-2 Dataset
For a moving robotic vehicle in a large indoor environment, acquiring ground truth at a relatively high-level of accuracy is a well-known challenge. One of the interesting and helpful features of the environment where the above positioning systems were tested is the tiling which exists on the floor. Two such tiles are used consistently across the floor: a large tile which is 40 cm × 40 cm and a small tile which is 20 cm × 20 cm (refer to Figure A1). This can be utilized for two purposes: a natural coordinate system and a means to provide ground truth. Directions for the positive X and Y axes can be set according to the orthogonal directions given by the tiles. The ground-truth labeling can be obtained in a manual way by forming associations between 3D points on the ground and 2D image points. The 3D points can be known based on the grid formed by the tiles. Upon forming these 2D-3D associations, the camera pose can be computed using a Perspective-N-Point (PnP) algorithm. Specific checkpoints along the vehicle trajectory were selected for ground-truth computation. In situations where the keyframes used in the global positioning systems were between ground-truth selected frames, then as long as the discrepancy was within two frames, the error was computed via a weighted average between the ground-truth pose compared to the keyframe poses before and after. Two frames correspond to a somewhat small displacement as the vehicle traveled at about 0.3 m/s and the frames were acquired at about 16 Hz. Overall, for the results presented in this section, a total of 15 ground-truth checkpoints were used.