Solving Monocular Visual Odometry Scale Factor with Adaptive Step Length Estimates for Pedestrians Using Handheld Devices

The urban environments represent challenging areas for handheld device pose estimation (i.e., 3D position and 3D orientation) in large displacements. It is even more challenging with low-cost sensors and computational resources that are available in pedestrian mobile devices (i.e., monocular camera and Inertial Measurement Unit). To address these challenges, we propose a continuous pose estimation based on monocular Visual Odometry. To solve the scale ambiguity and suppress the scale drift, an adaptive pedestrian step lengths estimation is used for the displacements on the horizontal plane. To complete the estimation, a handheld equipment height model, with respect to the Digital Terrain Model contained in Geographical Information Systems, is used for the displacement on the vertical axis. In addition, an accurate pose estimation based on the recognition of known objects is punctually used to correct the pose estimate and reset the monocular Visual Odometry. To validate the benefit of our framework, experimental data have been collected on a 0.7 km pedestrian path in an urban environment for various people. Thus, the proposed solution allows to achieve a positioning error of 1.6–7.5% of the walked distance, and confirms the benefit of the use of an adaptive step length compared to the use of a fixed-step length.


Introduction
In the context of pedestrian navigation and handheld device pose estimation for urban mobility, localization and orientation would gain from an accurate global pose estimation using Global Navigation Satellite Systems (GNSS), Inertial Measurement Unit (IMU) and cameras [1][2][3].Such systems generally operate well in outdoor open environments.However, urban environments comprising closely spaced buildings (i.e., urban canyon) still represent challenging areas for GNSS, which suffer from attenuation, reflection, and blockage effects [4][5][6].It is even more challenging using low-cost Micro Electro Mechanical System (MEMS) sensors and low computational resources, which are typically embedded in pedestrian mobile devices [7].Moreover, the hand can move freely even when the pedestrian's centre of mass is static, making it difficult to estimate an accurate pose.
In order to assist the pedestrian navigation during large displacements in urban environments, e.g., through on-site augmented reality applications on handheld mobile device, our aim is to provide a continuous and accurate pose estimate.In such context, we propose to integrate human motion estimation, i.e., step length estimation, using a handheld IMU proposed in the field of Pedestrian Dead Reckoning (PDR), to scale the monocular Visual Odometry (VO).To complete the pose estimation, we propose to use an estimation of the holding in hand of mobile devices coupled with data content in Geographical Information System (GIS).Although embedded technologies in general public devices become more efficient, the choice was made to propose a solution that can be fully embedded with a minimal hardware setup and a low memory requirement, and without any connection with networks, any deployment of new infrastructures, to be totally autonomous.
Thus, our first contribution is to achieve a continuous pose estimation with a mobile device held in hand during large displacements in urban environments.Our second contribution is to present an adaptive method, based on the use of an IMU and a monocular camera, that takes advantage of human motion estimation and data content in Geographic Information Systems to dynamically solve the scale ambiguity and suppress the scale drift in the monocular Visual Odometry.Finally, our third contribution is to propose a solution that does not require pedestrians to make specific and unnatural movements or to revisit the same place, as is the case with Simultaneous Localization And Mapping (SLAM) techniques [8][9][10].
The Section 2 conducts a state of the art on pose estimation and Visual Odometry scale estimation.The Section 3 presents the coupling process between PDR and VO, i.e., the scale estimation based on step length estimation, and the position estimation on the vertical axis using a hand-held model of the equipment and GIS data.The Section 4 details the hardware setup used in our experiments, the acquisition scenario and the establishment of reference waypoints and tracks to evaluate our method.The assessment by comparison between a foot-mounted INS aided by GNSS phase measurements and the proposed method using a handheld device is presented in the Section 5. Finally, we conclude that human motion analysis can be informative as an additional clue and constraint to scale the monocular Visual Odometry and to improve the pose estimation with general public handheld devices.

Related Work
In this section, we briefly summarize the solutions for handheld device pose estimation in urban environments using a camera or/and an IMU.In the literature on urban localization, one approach is to use a camera and a database comprising images associated with their global poses, namely appearance-based localization [3,11,12].The pose of the camera is determined by searching an image containing a part of the input image in the database.This is typically performed by using Content-Based Image Retrieval techniques.However, the accuracy of this approach strongly degrades under appearance changes.In addition, such a huge database is not always available, not easy to prepare, and also may not be applicable to mobile devices for pedestrians due to limited computational resources.
In the field of pedestrian navigation, Pedestrian Dead-Reckoning approaches have been investigated for the urban localization [13,14].The pose is incrementally determined by computing the displacement between the poses after the initial global pose is acquired with other approaches such as appearance-based localization.Sensor fusion is an alternative approach and has also been investigated to compensate for the drawback of each sensor [15][16][17][18].The fusion was initially introduced to compensate for weaknesses in appearance-based approaches due to failure under fast rotational motions (i.e., motion blur and large displacement in the images) [15,19,20].A gyroscope in the IMU has an advantage that it can accurately track sensor angular rates at high frequency for short time intervals.An accelerometer in the IMU is useful during static phases to compensate for global orientation estimation by using the gravity measurement.
Monocular Visual Odometry is a frame-to-frame relative pose tracking approach using a camera for the incremental pose estimation [21][22][23].In unknown environments, the camera pose is incrementally determined by, first, computing feature correspondences between two consecutive images, and then, computing the relative pose from the correspondences.However, one problem is to solve for scale ambiguity in the estimated poses [23].The relative poses can fundamentally be estimated up-to-scale if known landmarks are used into the VO [8] or an initial translation of a camera is assumed to be known [24].These approaches solve the initial scale estimation but still face the problem of scale drift that occurs in large displacements and degrades the localization accuracy [25].As reported in [26], both direct and indirect VO approaches cannot avoid the scale drift when only a monocular camera is available, and the tracking errors are quickly accumulated over time.The drift can be punctually corrected by using appearance-based localization, using some known landmarks [3,27] or knowledge of existing 3D city models [28].Ref. [29] presents a method for resolving the scale ambiguity and drift by using the slant distance obtained from a skyline matching between the camera and images synthesized using a 3D building model.The correction of the scale factor shows a 90% improvement of the positioning solution compared to a solution that does not correct the scale drift.These methods require to be located in an area where the 3D model is available and known with precision, whereas it is generally not.Loop closure, when revisiting the same place, is also used to correct the drift [25].However, this approach is not applicable in the context of pedestrians navigating from point A to point B, without passing through the same place again.
Since the unit of the accelerometer in the IMU is metric, it is also useful to estimate the scale in the VO [30][31][32].To estimate the scale, one approach is to compute the double integration of the acceleration after removing the gravity elements.However, the double integration is sensitive to noise because the error is quickly accumulated, even though the error is small during a short time interval.To further improve the accuracy and stability of metric scale estimation, additional methodologies need to be investigated.An alternative approach is to estimate the scale with an Extended Kalman Filter, that includes the scale factor in the state vector.As reported in [31], the method is also sensitive to a dynamic bias which is difficult to estimate.
Others scale estimation approaches exist but there is a hard constraint that the sensors need to be rigidly fixed to the pedestrian.For example, Ref. [33,34] have the sensors attached to a helmet.This constraint is not valid in the context of handheld AR.
As a recent approach dedicated to pedestrians, the pedestrian face is used as a known object to estimate the scale [32].On a mobile device (e.g., a smartphone), two cameras are usually installed: a user-facing camera and a world-facing camera.In the context of handheld AR, the user-facing camera captures the user's face, while performing the VO by using the world-facing camera.When a relative pose of the two cameras is calibrated, the metric scale for the VO can be computed from the face.This setup is reasonable for handheld devices, however, this approach requires that the pedestrian be static while the hand is moving.It should also be noted that VO using a stereo camera does not have these scale issues because the scale can be uniquely determined from the disparity between two cameras [26].
Other approaches, close to ours, use a pedometer and an average step length for scale estimation in monocular VO [35,36].As the pedestrian has to avoid others pedestrians, bikers, cars, elements in the scene and to wait at pedestrian crossings, the step lengths are not constant during the walk and a average step length does not correspond to pedestrian movement in urban environments.This is why we propose to use a dynamic step length estimation to scale monocular VO.

Scaled Monocular Visual Odometry
In this paper, we propose a novel approach for pose estimation with sensors held in hand based on monocular Visual Odometry and Pedestrian Dead-Reckoning.A human motion analysis from inertial data, i.e., a step length estimation, is used to dynamically solve the scale ambiguity and suppress the scale drift.The overview of the proposed method is illustrated in Figure 1.The proposed method can be divided into three main processes: "PDR" part, "VO" part and "Coupling" part.Since the "PDR" part and "VO" part are independent, our approach can be referred as a loosely coupled approach.The variables needed to develop the proposed method are all detailed in their corresponding following parts.

VO
Loosely coupling

R n c (t)
< l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " i D 6 x a + K 0 t W g x O C B 7 t v l I l P g h m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L B P y l X u b B o h P N 7 n 6 G n c 8 q l y H m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t > h ped < l a t e x i t s h a 1 _ b a s e 6 4 = " L B P y l X u b B o h P N 7 n 6 G n c 8 q l y H m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L B P y l X u b B o h P N 7 n 6 G n c 8 q l y H m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L B P y l X u b B o h P N 7 n 6 G n c 8 q l y H m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " L B P y l X u b B o h P N 7 n 6 G n c 8 q l y H m q 7 o m 9 t f q l K U I S Z O 4 z 7 F J W F m n P M + 2 8 a T m N p 1 b z 0 T f z N K z e o 9 y 7 U p 3 v U t a c D u z 3 E u g u Z J 1 S V 8 c 1 q p X e S j L u I A h z i m e Z 6 h h i v U 0 T B V P u I J z 9 a 1 J a 2 p l X 1 K r U L u 2 c e 3 Z T 1 8 A E b a k j k = < / l a t e x i t >

3DGIS
< l a t e x i t s h a 1 _ b a s e 6 4 = "     The "PDR" part is dedicated to the processing of IMU measurements.The same origin is defined for the IMU sensors and the IMU measurements are given in the Body frame, labeled by b.We estimate pedestrian step lengths with sensors held in hand, while the pedestrian walks naturally [37,38].The process outputs a step length step k in meter at every instant of step t step k .The details are given in Section 3.1.
The "VO" part is dedicated to monocular VO for relative pose estimation based tracking.The origin of the camera frame is the optical center of the camera and the visual measurements are given in the Camera frame, labeled by c.In this part, because the scale estimation is independent of the image processing, any VO or vSLAM method can be applied if it outputs a complete pose estimate [23,39].The implemented Visual Odometry framework is detailed in Section 3.2.
The "Coupling" part is dedicated to the scale determination by fitting trajectories from the "PDR" part and the "VO" part.This part finally solves the scale ambiguity in monocular VO.The scaled pose in monocular VO is the final output of the proposed solution.The final pose estimate is given in the Navigation frame (i.e., the North-East-Down (NED) frame), labeled by n.In the proposed solution, we need to consider the frequency difference between sensors and the timing of the scale estimation.The details are given in Section 3.4.

Step Length Estimation Process
Classically, inertial signals are integrated according to a strap-down mechanization to compute the recursive positions of an IMU.This is possible only if the accumulated error caused by low-cost inertial sensors is frequently calibrated or reset, such as using Zero velocity UPdaTes (ZUPT) [14].As a moment of zero velocity does not always occur in the context of hand-held device-based pedestrian navigation, ZUPT is not easily achievable.Also, the location of the pedestrian is normally represented as a pedestrian's center of mass on a 2D map.This is obviously different from the location of the device held in hand, because the hand performs free motions even when the pedestrian's center of mass is static.Therefore, instead of double-integrating the measured accelerations, a step length model is adopted to derive step lengths as pedestrian's displacements for handheld devices.We summarize the procedure of pedestrian step length estimation proposed in [37,38].It is claimed that this step length estimation is performed with a 2.5% up to 5% error on the walked distance.The different steps of the procedure are presented in Figure 2 and the notations are detailed hereafter.To estimate step lengths, IMU signals acquired with sensors held in hand are analyzed as follows.A first motion classification is operated to determine the walking phases or the static phases of the pedestrian.A peak detection and a thresholding on the energy are applied on gyroscope signal ω gyro and accelerometer signal f acc to determine the events t step k when the pedestrian's foot comes in contact with the ground at k th step.The step frequency is also computed as f k .A second motion classification is operated by analyzing the variance of the IMU measurements to determine the device's carrying mode (Static and Walking in Texting or Swinging mode) according to [38].Then, a generic model is used to compute the step lengths step k according to [37].It is based on the user's height h ped and on a set of three parameters {a, b, c} trained on 12 subjects.

Monocular Visual Odometry
Because step length estimation is independent of monocular Visual Odometry, any existing methods can be used.Here, we introduce our implementation based on the following standard procedure [22].As a pre-processing, a standard camera calibration, using a checkerboard [40] and modeling the camera as a pinhole camera [41], is performed for a fixed resolution to express the camera's coordinates in a normalized space [42].This enables to correct image distortions and to determine the intrinsic parameters matrix K of the camera.
In the VO, to determine the unknown pose at time (t), the known poses at (t − 1) and (t − 2) are used.The different steps of the vision procedure are detailed in Figure 3 and the notations are detailed hereafter.
< l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " y s 7 1 B 9 J w 7 Knowing the intrinsic parameters matrix K of the camera and the correspondences between x i and X n , expressed in homogeneous coordinates symbolized with a tilde (˜), the pose of the camera's optical center can be computed.It is expressed as a rotation matrix R c n , giving the rotation from the global coordinate system to the camera one, and a translation vector t c n giving the translation from the origin of the global coordinate system to the one in the camera coordinate system C.The output of the "Vision" part in Figure 3 are R n c (t), the rotation from the Camera frame to the Navigation frame at time (t) and t n c (t), the translation from Camera frame to the Navigation frame t n c (t) with a dimensionless scale factor s(t) at time (t).
For the relative pose estimation between images, feature points x i are extracted by SURF detector [43].A sparse feature points extraction is operated to improve the reliability by suppressing redundantly extracted points.Then, extracted feature points in consecutive images are matched using the Sum Square Difference (SSD) distance to select unique correspondences.A filtering stage, using a geometric constraint such as epipolar constraint, is then applied with the M-estimator SAmple Consensus (MSAC) algorithm [44] to exclude outliers.A triangulation stage is operated only on inliers that are visible in the two images at times (t − 1) and (t − 2), and in the current image at time (t) to estimate the 3D points X n in the Navigation frame.To perform pose estimation using a calibrated camera, PnP algorithms constitute one of the most suitable solutions [45].The camera pose is computed using the following formulation: where C n is the position of the camera's optical center in the Navigation frame.The successive positions of the camera's optical center are used for the scale estimation.It should be noted that we did not use any map optimization, such as pose graph optimization and bundle adjustment typically used in visual SLAM because they are computationally expensive.We simply implemented monocular Visual Odometry and did not keep the map in the memory as Dead-Reckoning.

Digital Terrain Model and Handheld Height
To complete the position estimation on the horizontal plane, the position on the vertical axis is determined in urban environments using data contained in 3D Geographical Information System (GIS), i.e., the Digital Terrain Model (DTM) and an estimation of the hand height holding a mobile device with respect to the pedestrian height.The DTM is a set of points referenced in planimetry (X,Y) and altimetry (Z).With an interpolation method, this provides the elevation of the ground level relief in digital form [46].In the context of the use of an augmented reality applications for mobility assistance, the screen of the equipment held in the hand is considered maintained at a height h hand to see information.In order to estimate the handheld device position on the vertical axis, the height of several pedestrians h ped , as well as the screen center height h hand on which an augmented reality display would be proposed, were measured.These measurements are presented in Table 1.Thus, knowing the distance between the center of the screen and the optical center of the camera, the height of the equipment held in the hand can be experimentally defined as a ratio of the pedestrian height: It should be noted that the height variations ∆h hand (t), i.e., the variations of the DTM, is used to estimate the scale factor of the displacement along the vertical axis in the following section.In addition, since the variations in the DTM are very small, displacements on the vertical axis could be ignored to simplify the scale-factor calculation.

Scale Determination
Because steps and images are not sampled at the same times, an interpolation is needed.Then using the knowledge of pedestrian step lengths and the frequencies of both the visual measurement and the pedestrian's step, a linear interpolation is operated to determine the pedestrian displacement between the instants of two consecutive images, as illustrated in Figure 4.The scale of the VO during the step k + 1 th is computed by using the k th step length based on the interpolation.The step length estimation step k provides the magnitude of the pedestrian displacements on the horizontal plane.With the assumption that the displacements of the handheld device are mainly on the horizontal plane, displacements on the vertical axis are not taken into account.Therefore, to estimate the scale s(t), a comparison is made between D VO (t), i.e., the magnitude of the displacement of the camera's optical center C n on the horizontal plane estimated by the VO at times (t − 1) and (t), and D step (t), i.e., an interpolation at times (t − 1) and (t) of the estimated pedestrian step length on the horizontal plane step k between the instants of step (k − 1) and (k). (5) The scale factor s(t) is then defined as: Thus, at each relative pose estimated by the VO, the scale is computed and used to correct the displacement of the handheld device on the horizontal plane in the Navigation frame.The final outputs are R n c (t), the rotation matrix from the Camera frame to the Navigation frame at time (t) and t n c (t), the translation vector from the Camera frame to the Navigation frame at time (t).
It should be noted that during the whole process, several steps can fail.In the "PDR" part, there might be some miss or false step detections.In the "VO" part, there might be false correspondences between features even if the matching process should limit it.

Known Object Recognition-Based Pose Estimation
To punctually suppress the drift and correct the relative pose estimate based on the scaled monocular Visual Odometry, the detailed knowledge of existing sparse known objects contained in 3D GIS is used.A known object allows to estimate absolute pose when one is detected in video frames [47].The mean positioning accuracy is claimed to be 25 cm on the horizontal plane when a known object was detected in video frames.It should be noted that the track estimated with the monocular VO is highly dependent on initialization and reinitializations.Therefore, an inaccuracy in the known object-based pose estimation results in a bias in the orientation estimate of the trajectory.As the last view before a known object detection loss is a grazing view, which is the most degraded and the less accurate case for the known object-based pose estimation, the absolute pose estimated when the pedestrian is in a static phase in front of a known object since a few seconds is preferred to re-initialize the monocular VO.That corresponds to the most accurate case for the known object-based pose estimation.Figure 5 presents an illustration of our global approach for continuous pose estimation with a handheld device in urban environments.It comprises three main stages:

Hardware Setup
According to the hardware setup needed to develop the proposed approach, the handheld device used in the experiments is composed of a monocular camera and an IMU rigidly attached together, as illustrated in Figure 6.This hardware configuration gives access to raw data without the filters commonly applied to mobile device signals.To synchronize the IMU signals and the monocular camera recordings, timestamps from the GPS receivers embedded in both devices were used.The details of the devices are as follows: A Garmin camera "VIRB 30 Ultra" (https://virb.garmin.com),set up with a fixed focal length, was used for image acquisition.The resolution of the camera was 1920 × 1080 pixels and was chosen to correspond to a standard resolution of smartphone's acquisition.The video was acquired at 60 Hz frame rate.According to the computational resources in the device, the video was down-sampled at 10 Hz to reduce the computation time.The image resolution could also be resized, but this has not been done for the implementation of our solution.
A dedicated platform named ULISS [48] was used for measurements.It comprises a tri-axis Inertial Measurement Unit and a tri-axis magnetometer sampled at 200 Hz, a barometer, a High Sensitivity GPS receiver and an antenna.They are all low-cost sensors similar to those embedded in mobile devices owned by pedestrians.

Digital Terrain Model
The DTM, used in the experiments, was computed by the French National Geographical Institute (IGN) (professionnels.ign.fr/mnt).The resolution of the mesh is 1 meter with a decimetric accuracy for the altitude [49].Other DTM, provided by public data (e.g., Open Street Map, Google Earth, etc.) could also be used.In our implementation, DTM data are processed using the OBJ format.

Scenario
A 0.7 km walk, which includes passages with an open view and buildings with important specular reflections, was performed by three different people in urban environments with the acquisition device held in hand.Details of the different pedestrians are given in Table 2.The pedestrian activity was in Texting mode at all times, i.e., the pedestrian walked while looking at the screen of the handheld device.

Estimation of Reference Waypoints and Tracks
During acquisitions, pedestrians walked on absolute reference waypoints marked on the ground (i.e., with a crossed-out white circle) and made a stop of several seconds over them, as illustrated in Figure 7. Their locations were measured with a centimetric accuracy using a geodetic dual-frequency GNSS receiver (http://www.septentrio.com/) in differential mode.The starting and finishing positions of the acquisition were also determined using a differential GNSS solution.
To assess pedestrian track estimates, reference tracks need to be established during acquisitions.As described in Table 3, in some urban spaces, GNSS-based solutions were not accurate and not available all the time.When the pedestrian walks in an open environment, the standard deviation of the positioning error was less than one metre.When the pedestrian walks in urban canyons, the standard deviation of the positioning error was up to 21 m for differential GNSS and up to 30 m for standalone GPS.These results represent the difficulty of urban location using such sensors, and is far from the accuracy required for an assessment.Thus, to estimate an accurate and continuous reference track, the PERSY (PEdestrian Reference SYstem) platform [50] was mounted on the foot, on the same side as the handheld device was used during the data acquisition, as illustrated in Figure 7.The PERSY platform outputs location relative to a starting position and claims to have a 0.3% positioning mean error of the distance traveled.Quasi-static phases of the acceleration and the magnetic field are used to mitigate inertial sensor errors.GNSS phase measurements are added to the strap-down EKF to improve the positioning accuracy.

Activities Classification and Step Length Estimation
Figure 9a presents the results of the activity classification for a walk in urban environments.It can be observed that the walking phases with the device held in hand in Texting mode (in green), and the moments when the pedestrian is static on each of the different reference waypoints (in pink) are correctly classified.Figure 9b presents the step lengths estimated, thanks to the walking frequency analysis and the knowledge of the pedestrian's height.It also can be observed that the step frequency and the step lengths are not constant during the walk.This is due to the fact that the pedestrian has to avoid others pedestrians, bikers or cars, and to wait at pedestrian crossings.When the pedestrian walks normally, an amplitude of 40 cm is observed for the step lengths estimation, i.e., mainly between 1 m and 1.4 m.This validates the fact that the use of an average step length is not appropriate for an accurate scale estimation.Step length (in m) Step lengths estimation (b) Table 4 presents the median step length of each pedestrian, as well as the standard deviation of their step lengths.

W1 M1 W2
Median step length 1.06 m 1.18 m 0.99 m Step lengths standard deviation 0.20 m 0.14 m 0.29 m

Estimated Trajectory
To perform the proposed scaled monocular VO, the initial pose in the Navigation frame must be estimated.For this process, position from GNSS and orientation from the IMU [51] were used.Then, to punctually suppress the drift and reinitialize the pose in the Navigation frame, the known object-based pose estimation proposed in [52] was used.
Figure 10 presents the relative PERSY reference track (in green) and the track estimated with the scaled monocular VO (in blue).From these results, it is observed that the estimated track is close to the relative PERSY reference track.This means that the proposed approach, using adaptive pedestrian step lengths estimates, allows to correctly scaled the monocular VO and to estimate the localization of a device held in hand by pedestrian while walking in urban environments.A visualization of the position estimation is also available on Youtube (https://youtu.be/3bSFrtF2lwU).To assess the performance of the proposed approach, comparisons are made between the positions estimated with the scaled monocular VO and the relative PERSY reference tracks.Figure 11 presents the "Horizontal Positioning Error" and Figure 12 presents the "Cumulative Distribution Function" of the positioning errors for the three different pedestrians.in the orientation.This results in an error in the orientation estimate, which increases the positioning error at the end of the walk.This could be solved by using more sophisticated VO frameworks such as [53].
It should be noted that when the pedestrian makes small displacements, the proposed approach estimates a smooth trajectory, whereas the use of an average step length would have distorted the scale and the position estimate.Figure 13 presents a focus on an area where the pedestrian makes small displacements.A visualization is proposed on the horizontal plane and in a 3D environment containing 3D models of the known object and buildings around it.

Conclusions
In the context of pedestrian navigation, urban environments constitute challenging areas for both localization and handheld device pose estimation.Accurate position and orientation estimation are even more challenging, using only low-cost sensors available in general public devices, i.e., monocular camera and Inertial Measurement Unit.
To address these challenges, we propose a general approach, based on monocular Visual Odometry, to continuously estimate the pose of a handheld device.Our approach does not require any connection or any deployment of new infrastructures.To solve the scale ambiguity and suppress the scale drift in monocular Visual Odometry, an adaptive pedestrian step lengths estimation is used for the displacement on the horizontal plane.To complete the estimation, a handheld equipment height model, with respect to the Digital Terrain Model contained in Geographical Information Systems, is used for the displacement on the vertical axis.In addition, known objects allow to correct the pose estimate and reset the monocular Visual Odometry when one is detected in video frames.
A long walk of about 0.7 km with an IMU, that integrates low-cost sensors, combined with a camera held in hand has been conducted by three different pedestrians in urban environments with sparse known objects.An assessment is conducted using absolute reference waypoints, whose coordinates have been precisely determined with a Differential GNSS solution in an off-line phase.The assessment is completed by comparing results to a relative reference track obtained with a foot-mounted INS aided by GNSS phase measurements.The proposed approach enables to estimate the pose of a handheld device in urban environments, which is needed for augmented reality applications.Furthermore, this also allows to accurately estimate the pedestrian displacements without any use of GNSS positioning, which strongly deteriorates in urban and indoor environments.A comparison is also proposed between the use of an adaptive step length and the use of a fixed-step length to scale the monocular visual odometry.The proposed solution allows to achieve a positioning error between 1.6% and 7.5% of the walked distance, and confirms the benefit of the proposed solution compared to the use of a fixed-step length.
We plan that part of future works will be dedicated to more accurately estimating the pose, by fusing the presented global approach with PDR and barometric height in a tightly coupling process, mainly to improve the pose estimate in case of handheld device's fast rotational motions.The proposed global approach will also be extended to other known objects in order to reduce the distance between two absolute pose estimates.
a d y m e E G Z G O e u z b T T S 1 K 5 7 6 5 n 4 m 8 n U r N 6 z L D f F u 7 4 l D d j 9 O c 5 5 0 D i q u I S v j k v V s 2 z U e e x h H 2 W a 5 w m q u E A N d f I e 4 B F P e L Y u L W G N r b v P V C u X a X b x b V k P H 6 a f k F s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i D 6 x a + K 0 t W g x O C B 7 t v l I l P g h a d y m e E G Z G O e u z b T T S 1 K 5 7 6 5 n 4 m 8 n U r N 6 z L D f F u 7 4 l D d j 9 O c 5 5 0 D i q u I S v j k v V s 2 z U e e x h H 2 W a 5 w m q u E A N d f I e 4 B F P e L Y u L W G N r b v P V C u X a X b x b V k P H 6 a f k F s = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i D 6 x a +

< l a t e x i t s h a 1 _ b a s e 6 4 =
" k K p 2 n p x H R F n P r f U R S M S + 1 x t L 4 j Y = " > A A A C 1 3 i c j V H L S s N A F D 2 N r 1 p f s S 7 d B I t Q N y U R Q Z d F X b i s Y B /S l j J J p 2 0 w T U I y E U s p 7 s S t P + B W / 0 j 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 d u i 5 s T D N 1 4 w 2 N 7 + w u J R d z q 2 s r q 1 v 6J v 5 W h w k k c O r T u A F U c N m M f d c n 1 e F K z z e C C P O h r b H 6 / b V i Y z X r 3 k U u 4 F / I U Y h b w 9 Z 3 3 d 7 r s M E U R 0 9 3 z r l n m D G o D Me M L 8 7 K Y q 9 j l 4 w S 6 Z a x i y w U l B A u i q B / o I W u g j g I 0 a 6 a s T 3 v p G S u 1 Q 6 d 4 9 E a k N L B L m o D y I s L y N E P F E + U s 2 d + 8 x 8 p T 3 m 1E f z v 1 G h I r M C D 2 L 9 0 0 8 7 8 6 W Y t A D 0 e q B p d q C h U j q 3 N S l 0 R 1 R d 7 c + F K V I I e Q O I m 7 F I 8 I O 0 o 5 7 b O h N L G q X f a W q f i b y p S s 3 D t p b o J 3 e U s a s P V z n L O g t l + y C J 8 f F M r H 6 a i z 2 M Y O i j T P Q 5 R x h g q q 5 H 2 D R z z h W b v U b r U 7 7 f 4 z V c u k m i 1 8 W 9 r D B x j f l m U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " k K p 2 n p x H R F n P r f U R S M S + 1 x t L 4 j Y = " > A A A C 1 3 i c j V H L S s N A F D 2 N r 1 p f s S 7 d B I t Q N y U R Q Z d F X b i s Y B /S l j J J p 2 0 w T U I y E U s p 7 s S t P + B W / 0 j 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 d u i 5 s T D N 1 4 w 2 N 7 + w u J R d z q 2 s r q 1 v 6 J v 5 W h w k k c O r T u A F U c N m M f d c n 1 e F K z z e C C P O h r b H 6 / b V i Y z X r 3 k U u 4 F / I U Y h b w 9 Z 3 3 d 7 r s M E U R 0 9 3 z r l n m D G o D M e M L 8 7 K Y q 9 j l 4 w S 6 Z a x i y w U l B A u i q B / o I W u g j g I 0 a 6 a s T 3 v p G S u 1 Q 6 d 4 9 E a k N L B L m o D y I s L y N E P F E + U s 2 d + 8 x 8 p T 3 m 1E f z v 1 G h I r M C D 2 L 9 0 0 8 7 8 6 W Y t A D 0 e q B p d q C h U j q 3 N S l 0 R 1 R d 7 c + F K V I I e Q O I m 7 F I 8 I O 0 o 5 7 b O h N L G q X f a W q f i b y p S s 3 D t p b o J 3 e U s a s P V z n L O g t l + y C J 8 f F M r H 6 a i z 2 M Y O i j T P Q 5 R x h g q q 5 H 2 D R z z h W b v U b r U 7 7 f 4 z V c u k m i 1 8 W 9 r D B x j f l m U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " k K p 2 n p x H R F n P r f U R S M S + 1 x t L 4 j Y = " > A A A C 1 3 i c j V H L S s N A F D 2 N r 1 p f s S 7 d B I t Q N y U R Q Z d F X b i s Y B /S l j J J p 2 0 w T U I y E U s p 7 s S t P + B W / 0 j 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 d u i 5 s T D N 1 4 w 2 N 7 + w u J R d z q 2 s r q 1 v 6 J v 5 W h w k k c O r T u A F U c N m M f d c n 1 e F K z z e C C P O h r b H 6 / b V i Y z X r 3 k U u 4 F / I U Y h b w 9 Z 3 3 d 7 r s M E U R 0 9 3 z r l n m D G o D M e M L 8 7 K Y q 9 j l 4 w S 6 Z a x i y w U l B A u i q B / o I W u g j g I 0 a 6 a s T 3 v p G S u 1 Q 6 d 4 9 E a k N L B L m o D y I s L y N E P F E + U s 2 d + 8 x 8 p T 3 m 1E f z v 1 G h I r M C D 2 L 9 0 0 8 7 8 6 W Y t A D 0 e q B p d q C h U j q 3 N S l 0 R 1 R d 7 c + F K V I I e Q O I m 7 F I 8 I O 0 o 5 7 b O h N L G q X f a W q f i b y p S s 3 D t p b o J 3 e U s a s P V z n L O g t l + y C J 8 f F M r H 6 a i z 2 M Y O i j T P Q 5 R x h g q q 5 H 2 D R z z h W b v U b r U 7 7 f 4 z V c u k m i 1 8 W 9 r D B x j f l m U = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " k K p 2 n p x H R F n P r f U R S M S + 1 x t L 4 j Y = " > A A A C 1 3 i c j V H L S s N A F D 2 N r 1 p f s S 7 d B I t Q N y U R Q Z d F X b i s Y B /S l j J J p 2 0 w T U I y E U s p 7 s S t P + B W / 0 j 8 A / 0 L 7 4 w p q E V 0 Q p I z 5 9 5 z Z u 6 9 d u i 5 s T D N 1 4 w 2 N 7 + w u J R d z q 2 s r q 1 v 6 J v 5 W h w k k c O r T u A F U c N m M f d c n 1 e F K z z e C C P O h r b H 6 / b V i Y z X r 3 k U u 4 F / I U Y h b w 9 Z 3 3 d 7 r s M E U R 0 9 3 z r l n m D G o D M e M L 8 7 K Y q 9 j l 4 w S 6 Z a x i y w

Figure 1 .
Figure 1.Block diagram of the proposed approach.
t e x i t s h a 1 _ b a s e 6 4 = " i J M A Y M E m 4 J p l W y e i H u 8 / E E a o S B c = " > A A A C 1 3 i c j V H L S s N A F D 2 N r / q O d e k m W I S 6 K a k I u i y 6 c V n B P q S t J Z l O 2 9 C 8 S C Z i K c W d u P U H 3 O o f i X + g f + G d M Q W 1 i E 5 I c u b c e 8 7 M v d c O X S c W p v m a 0 e b m F x a X s s s r q 2 v r G 5 v 6 V q 4 W B 0 n E e J U F b h A 1 b C v m r u P z q n C E y x t h x C 3 P d n n d H p 7 K e P 2 a R 7 E T + B d i F P K 2 Z / V 9 p + c w S x D V 0 X M t z x I D u z c W k y u / w 4 y C 2 O / o e b N o qm X M g l I K 8 k h X J d B f 0 E I X A R g S e O D w I Q i 7 s B D T 0 0 Q J J k L i 2 h g T F x F y V J x j g h X S J p T F K c M i d k j f P u 2 a K e v T Xn r G S s 3 o F J f e i J Q G 9 k g T U F 5 E W J 5 m q H i i n C X 7 m / d Y e c q 7 j e h v p 1 4 e s Q I D Y v / S T T P / q 5 O 1 C P R w r G p w q K Z Q M b I 6 l r o k q i v y 5 s a X q g Q 5 h M R J 3 K V 4 R J g p 5 b T P ht L E q n b Z W 0 v F 3 1 S m Z O W e p b k J 3 u U t a c C l n + O c B b W D Y o n w + W G + f J K O O o s d 7 K J A 8 z x C G W e o o E r e N 3 j E E 5 6 1 S + 1 W u 9 P u P 1 O 1 T K r Z x r e l P X w A f U K W j Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i J M A Y M E m 4 J p l W y e i H u 8 / E E a o S B c = " > A A A C 1 3 i c j V H L S s N A F D 2 N r / q O d e k m W I S 6 K a k I u i y 6 c V n B P q S t J Z l O 2 9 C 8 S C Z i K c W d u P U H 3 O o f i X + g f + G d M Q W 1 i E 5 I c u b c e 8 7 M v d c O X S c W p v m a 0 e b m F x a X s s s r q 2 v r G 5 v 6 V q 4 W B 0 n E e J U F b h A 1 b C v m r u P z q n C E y x t h x C 3 P d n n d H p 7 K e P 2 a R 7 E T + B d i F P K 2 Z / V 9 p + c w S x D V 0 X M t z x I D u z c W k y u / w 4 y C 2 O / o e b N o q m X M g l I K 8 k h X J d B f 0 E I X A R g S e O D w I Q i 7 s B D T 0 0 Q J J k L i 2 h g T F x F y V J x j g h X S J p T F K c M i d k j f P u 2 a K e v T X n r G S s 3 o F J f e i J Q G 9 k g T U F 5 E W J 5 m q H i i n C X 7 m / d Ye c q 7 j e h v p 1 4 e s Q I D Y v / S T T P / q 5 O 1 C P R w r G p w q K Z Q M b I 6 l r o k q i v y 5 s a X q g Q 5 h M R J 3 K V 4 R J g p 5 b T P h t L E q n b Z W 0 v F3 1 S m Z O W e p b k J 3 u U t a c C l n + O c B b W D Y o n w + W G + f J K O O o s d 7 K J A 8 z x C G W e o o E r e N 3 j E E 5 6 1 S + 1 W u 9 P u P 1 O 1 T K r Z x r e l P X w A f U K W j Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i J M A Y M E m 4 J p l W y e i H u 8 / E E a o S B c = " > A A A C 1 3 i c j V H L S s N A F D 2 N r / q O d e k m W I S 6 K a k I u i y 6 c V n B P q S t J Z l O 2 9 C 8 S C Z i K c W d u P U H 3 O o f i X + g f + G d M Q W 1 i E 5 I c u b c e 8 7 M v d c O X S c W p v m a 0 e b m F x a X s s s r q 2 v r G 5 v 6 V q 4 W B 0 n E e J U F b h A 1 b C v m r u P z q n C E y x t h x C 3 P d n n d H p 7 K e P 2 a R 7 E T + B d i F P K 2 Z / V 9 p + c w S x D V 0 X M t z x I D u z c W k y u / w 4 y C 2 O / o e b N o qm X M g l I K 8 k h X J d B f 0 E I X A R g S e O D w I Q i 7 s B D T 0 0 Q J J k L i 2 h g T F x F y V J x j g h X S J p T F K c M i d k j f P u 2 a K e v T Xn r G S s 3 o F J f e i J Q G 9 k g T U F 5 E W J 5 m q H i i n C X 7 m / d Y e c q 7 j e h v p 1 4 e s Q I D Y v / S T T P / q 5 O 1 C P R w r G p w q K Z Q M b I 6 l r o k q i v y 5 s a X q g Q 5 h M R J 3 K V 4 R J g p 5 b T P h t L E q n b Z W 0 v F3 1 S m Z O W e p b k J 3 u U t a c C l n + O c B b W D Y o n w + W G + f J K O O o s d 7 K J A 8 z x C G W e o o E r e N 3 j E E 5 6 1 S + 1 W u 9 P u P 1 O 1 T K r Z x r e l P X w A f U K W j Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " i J M A Y M E m 4 J p l W y e i H u 8 / E E a o S B c = " > A A A C 1 3 i c j V H L S s N A F D 2 N r / q O d e k m W I S 6 K a k I u i y 6 c V n B P q S t J Z l O 2 9 C 8 S C Z i K c W d u P U H 3 O o f i X + g f + G d M Q W 1 i E 5 I c u b c e 8 7 M v d c O X S c W p v m a 0 e b m F x a X s s s r q 2 v r G 5 v 6 V q 4 W B 0 n E e J U F b h A 1 b C v m r u P z q n C E y x t h x C 3 P d n n d H p 7 K e P 2 a R 7 E T + B d i F P K 2 Z / V 9 p + c w S x D V 0 X M t z x I D u z c W k y u / w 4 y C 2 O / o e b N o q

Figure 5 .
Figure 5. Global approach for continuous pose estimation in urban environments.

Figure 8
Figure8presents the relative PERSY reference track (in green) and the absolute reference waypoints (in white).Environmental specificities that may introduce difficulties in the monocular VO are also identified (in red).

Figure 8 .
Figure 8. Absolute reference waypoints and relative PERSY reference track of the 0.7 km walk in urban environments.

Figure 10 .
Figure 10.Scaled monocular Visual Odometry track, absolute reference waypoints and relative PERSY reference track of the 0.7 km walk in urban environments.

Figure 11 .
Figure 11.Scaled monocular Visual Odometry Horizontal Positioning Error compared to relative the PERSY reference track.

Figure 13 .
Figure 13.Example of a situation where the pedestrian makes small displacements: (a) vicinity of the station 44, (b) vicinity of the station 77.

Table 1 .
Measurements of pedestrian height and handheld equipment height.
The Scaled Monocular Visual Odometry (relative pose estimate).(b)The Known Object-Based Pose Estimation (absolute pose estimate).3.The AR visualization.

Table 4 .
Median step lengths and step lengths standard deviation.