Probabilistic Multi-Sensor Fusion Based Indoor Positioning System on a Mobile Device

Nowadays, smart mobile devices include more and more sensors on board, such as motion sensors (accelerometer, gyroscope, magnetometer), wireless signal strength indicators (WiFi, Bluetooth, Zigbee), and visual sensors (LiDAR, camera). People have developed various indoor positioning techniques based on these sensors. In this paper, the probabilistic fusion of multiple sensors is investigated in a hidden Markov model (HMM) framework for mobile-device user-positioning. We propose a graph structure to store the model constructed by multiple sensors during the offline training phase, and a multimodal particle filter to seamlessly fuse the information during the online tracking phase. Based on our algorithm, we develop an indoor positioning system on the iOS platform. The experiments carried out in a typical indoor environment have shown promising results for our proposed algorithm and system design.


Introduction
In recent years, researchers have developed various approaches for mobile-device (smartphone, tablet) user-positioning in GPS-denied indoor environment. To name a few, radio frequency (RF) fingerprinting techniques, motion-sensor-based pedestrian dead reckoning (PDR) techniques, and visual-sensor-based feature matching techniques, are some of the most popular approaches in indoor positioning. However, all of them have their own limitations. RF fingerprinting techniques (Bluetooth [1], RFID [2], Zigbee [3] and WiFi [4][5][6],) have the problem of signal fluctuation due to the multipath fading effect in indoor environment. The motion-sensor-based PDR approach [7,8] suffers from the fact that the motion sensors equipped in the mobile device are low cost Micro Electromechanical System (MEMS) sensors, which have relatively low accuracy. Thus, the integration drift will cause the positioning error to accumulate over time. The visual-sensor-based positioning techniques [9][10][11][12][13] extract features (SIFT [14], SURF [15]) from captured images and compare them with an image database. These techniques are limited by their costly feature-matching algorithm and restricted computation resources on a mobile platform. In ASSIST [16], acoustic signals emitted from smartphone speakers are adopted to locate the user using the time difference of arrival (TDoA) multilateration method. ASSIST can locate the user within 30 cm. However, this method requires sound receivers to be preinstalled in ceilings or walls, which adds extra infrastructure to the indoor environment.
To overcome the drawback of each sensor, people have come up with fusion approaches to combine different sensors to achieve a better positioning result. However, since the sensors are measuring different physical phenomena, it is not an easy task to effectively fuse the information Specifically, we make the following contributions: Under the HMM framework, we propose a graph structure to store the model constructed by multiple sensors in the offline training phase, and a multimodal particle filter to efficiently fuse the information during the online tracking phase. The particle filter is able to handle the motion sensor drift problem during the resampling step. The WiFi signal strength fluctuation problem is mitigated using the motion sensor information to guide the particle propagation towards the higher likelihood field. Based on our algorithm, we develop an indoor positioning system on the iOS platform. To the best of our knowledge, our iOS application is the first one to achieve accurate, robust, and highly integrated indoor positioning by seamlessly fusing the information from the multiple sensors on board.
This paper is organized as follows: In the next section, we describe, in detail, the offline training phase. Then, in Section 3, a HMM framework is introduced to describe the probabilistic multi-sensor fusion. In Section 4, we talk about particle filter steps for HMM state estimation. The implementation on the iOS platform, and experimental results, are presented in Section 5. Finally, we conclude our research in Section 6 with a discussion of future work.

Offline Training Phase
In this section, we will discuss every detail of our preparation for online tracking.

3D Modeling of Indoor Environments
A detailed 3D model of the indoor environment is constructed by fusing the data from a camera and a 2D line-scan LiDAR. Both devices are mounted rigidly on a robotic servo, which sweeps vertically to cover the third dimension ( Figure 1). Fiducial target-based extrinsic calibration [33][34][35] is applied to acquire transformation matrices between LiDAR and the camera. Based on the transformation matrix, we perform registration to fuse the color images from the camera with the 3D point cloud from the LiDAR.
Sensors 2015, 15, page-page 3 particle filter is able to handle the motion sensor drift problem during the resampling step. The WiFi signal strength fluctuation problem is mitigated using the motion sensor information to guide the particle propagation towards the higher likelihood field. Based on our algorithm, we develop an indoor positioning system on the iOS platform. To the best of our knowledge, our iOS application is the first one to achieve accurate, robust, and highly integrated indoor positioning by seamlessly fusing the information from the multiple sensors on board.
This paper is organized as follows: In the next section, we describe, in detail, the offline training phase. Then, in Section 3, a HMM framework is introduced to describe the probabilistic multi-sensor fusion. In Section 4, we talk about particle filter steps for HMM state estimation. The implementation on the iOS platform, and experimental results, are presented in Section 5. Finally, we conclude our research in Section 6 with a discussion of future work.

Offline Training Phase
In this section, we will discuss every detail of our preparation for online tracking.

3D Modeling of Indoor Environments
A detailed 3D model of the indoor environment is constructed by fusing the data from a camera and a 2D line-scan LiDAR. Both devices are mounted rigidly on a robotic servo, which sweeps vertically to cover the third dimension ( Figure 1). Fiducial target-based extrinsic calibration [33][34][35] is applied to acquire transformation matrices between LiDAR and the camera. Based on the transformation matrix, we perform registration to fuse the color images from the camera with the 3D point cloud from the LiDAR. x y z = and its related pixel in the camera image plane is described as ; where f is the focal length of the camera. In order to fuse the information from LiDAR and the camera, we need to look for the relationship to match P and c P . As shown in Figure 2, a 3D point in the LiDAR calibration plane is represented as P l " rx, y, zs T and its related pixel in the camera image plane is described as P c " rX, Y, 1s T . The 3D point P l with intensity information is projected to a calibration plane under a pinhole camera model. The calibration plane is defined at z " f and the projected point in the calibration plane is shown as P " ru, v, 1s T . Based on similar triangle rules, we have the following relationship: where f is the focal length of the camera. In order to fuse the information from LiDAR and the camera, we need to look for the relationship to match P and P c .  Figure 3 gives a workflow of the extrinsic calibration procedure. After projecting the 3D points to the calibration plane, we get a 2D point cloud. These 2D points are interpolated to generate a LiDAR intensity image. The problem of extrinsic calibration has become how to find the geometric constraints between a LiDAR intensity image and a camera image using the checkerboard pattern. The transformation of the checkerboard pattern from the LiDAR calibration coordinate frame to the camera coordinate frame is represented by a rigid 3 x 3 transformation matrix T .  As shown in Figure 4, to obtain the features of a checkerboard accurately, we select a Region of Interest (ROI) from the LiDAR intensity image and the camera panorama for the checkerboard pattern. Next, we take advantage of Random Sample Consensus (RANSAC) algorithm to find the correspondences between the LiDAR intensity images and the camera panorama images. In RANSAC, a pair of points is selected as an inlier only when the distance between them falls within the specified threshold. The distance metric used in RANSAC is as follows: where P is a point in the LiDAR intensity image, c P is a point in the camera panorama image, ( ) T P is the projection of a point on the intensity image based on the transformation matrix T , d is the distance between a pair of points, ξ is the threshold, and N is the number of points.  Figure 3 gives a workflow of the extrinsic calibration procedure. After projecting the 3D points to the calibration plane, we get a 2D point cloud. These 2D points are interpolated to generate a LiDAR intensity image. The problem of extrinsic calibration has become how to find the geometric constraints between a LiDAR intensity image and a camera image using the checkerboard pattern. The transformation of the checkerboard pattern from the LiDAR calibration coordinate frame to the camera coordinate frame is represented by a rigid 3 x 3 transformation matrix T.  Figure 3 gives a workflow of the extrinsic calibration procedure. After projecting the 3D points to the calibration plane, we get a 2D point cloud. These 2D points are interpolated to generate a LiDAR intensity image. The problem of extrinsic calibration has become how to find the geometric constraints between a LiDAR intensity image and a camera image using the checkerboard pattern. The transformation of the checkerboard pattern from the LiDAR calibration coordinate frame to the camera coordinate frame is represented by a rigid 3 x 3 transformation matrix T .  As shown in Figure 4, to obtain the features of a checkerboard accurately, we select a Region of Interest (ROI) from the LiDAR intensity image and the camera panorama for the checkerboard pattern. Next, we take advantage of Random Sample Consensus (RANSAC) algorithm to find the correspondences between the LiDAR intensity images and the camera panorama images. In RANSAC, a pair of points is selected as an inlier only when the distance between them falls within the specified threshold. The distance metric used in RANSAC is as follows: where P is a point in the LiDAR intensity image, c P is a point in the camera panorama image, ( ) T P is the projection of a point on the intensity image based on the transformation matrix T , d is the distance between a pair of points, ξ is the threshold, and N is the number of points. As shown in Figure 4, to obtain the features of a checkerboard accurately, we select a Region of Interest (ROI) from the LiDAR intensity image and the camera panorama for the checkerboard pattern. Next, we take advantage of Random Sample Consensus (RANSAC) algorithm to find the correspondences between the LiDAR intensity images and the camera panorama images. In RANSAC, a pair of points is selected as an inlier only when the distance between them falls within the specified threshold. The distance metric used in RANSAC is as follows: where P is a point in the LiDAR intensity image, P c is a point in the camera panorama image, TpPq is the projection of a point on the intensity image based on the transformation matrix T, d is the distance between a pair of points, ξ is the threshold, and N is the number of points. The algorithm for generating the transformation matrix is summarized below: (1) Find the inliers for the corners of checkerboard based on RANSAC algorithm.The RANSAC algorithm follows these steps: (a) Randomly select three pairs of points from the LiDAR intensity image and camera image to estimate a fitting model. (b) Calculate the transformation matrix T from the selected points. (c) Change the T value, if the distance matrix of a new T is less than the original one.
(2) Choose the transformation matrix T, which has the maximum inliers.
(3) Use all inlier point pairs to compute a refined transformation matrix T.
After generating the transformation matrices, we are able to stitch the camera panoramas together and fuse them with one LiDAR intensity image by applying the transformation matrices. Then, we back project the textured 2D points to a 3D color point cloud. The algorithm for generating the transformation matrix is summarized below: (1) Find the inliers for the corners of checkerboard based on RANSAC algorithm.
The RANSAC algorithm follows these steps: (a) Randomly select three pairs of points from the LiDAR intensity image and camera image to estimate a fitting model.
(b) Calculate the transformation matrix T from the selected points.
(c) Change the T value, if the distance matrix of a new T is less than the original one. (2) Choose the transformation matrix T , which has the maximum inliers.
(3) Use all inlier point pairs to compute a refined transformation matrix T .
After generating the transformation matrices, we are able to stitch the camera panoramas together and fuse them with one LiDAR intensity image by applying the transformation matrices. Then, we back project the textured 2D points to a 3D color point cloud.  The extrinsic calibration result is applied to a large indoor environment. The LiDAR-camera scan system is mounted on a pushcart in order to record the data in stop-and-go mode. By manually aligning the data in each survey point, we can get a detailed 3D model of the indoor environment. At the same time, the 3D model is partitioned, based on the survey point locations. Figure 5 shows a 2D map of a large corridor area and survey point locations. The corresponding 3D model is shown in Figure 6. In order to construct a 3D model of the corridor, with an area of 630,000 square feet, we have collected 1,029,974 data points; each point with an XYZ and RGB value. Based on the high accuracy of the laser beam, this model is much more accurate than the 3D model generated from a RGB-D sensor. Admittedly, it is computational heavy to process the data in the offline training phase to build a detailed 3D color point cloud. However, during the online tracking phase, we only need to match the captured image with a local model instead of the entire one, which significantly reduces the cost.  The extrinsic calibration result is applied to a large indoor environment. The LiDAR-camera scan system is mounted on a pushcart in order to record the data in stop-and-go mode. By manually aligning the data in each survey point, we can get a detailed 3D model of the indoor environment. At the same time, the 3D model is partitioned, based on the survey point locations. Figure 5 shows a 2D map of a large corridor area and survey point locations. The corresponding 3D model is shown in Figure 6. In order to construct a 3D model of the corridor, with an area of 630,000 square feet, we have collected 1,029,974 data points; each point with an XYZ and RGB value. Based on the high accuracy of the laser beam, this model is much more accurate than the 3D model generated from a RGB-D sensor. Admittedly, it is computational heavy to process the data in the offline training phase to build a detailed 3D color point cloud. However, during the online tracking phase, we only need to match the captured image with a local model instead of the entire one, which significantly reduces the cost.
(1) Find the inliers for the corners of checkerboard based on RANSAC algorithm.
The RANSAC algorithm follows these steps: (a) Randomly select three pairs of points from the LiDAR intensity image and camera image to estimate a fitting model.
(b) Calculate the transformation matrix T from the selected points.
(c) Change the T value, if the distance matrix of a new T is less than the original one. (2) Choose the transformation matrix T , which has the maximum inliers.
(3) Use all inlier point pairs to compute a refined transformation matrix T .
After generating the transformation matrices, we are able to stitch the camera panoramas together and fuse them with one LiDAR intensity image by applying the transformation matrices. Then, we back project the textured 2D points to a 3D color point cloud.  The extrinsic calibration result is applied to a large indoor environment. The LiDAR-camera scan system is mounted on a pushcart in order to record the data in stop-and-go mode. By manually aligning the data in each survey point, we can get a detailed 3D model of the indoor environment. At the same time, the 3D model is partitioned, based on the survey point locations. Figure 5 shows a 2D map of a large corridor area and survey point locations. The corresponding 3D model is shown in Figure 6. In order to construct a 3D model of the corridor, with an area of 630,000 square feet, we have collected 1,029,974 data points; each point with an XYZ and RGB value. Based on the high accuracy of the laser beam, this model is much more accurate than the 3D model generated from a RGB-D sensor. Admittedly, it is computational heavy to process the data in the offline training phase to build a detailed 3D color point cloud. However, during the online tracking phase, we only need to match the captured image with a local model instead of the entire one, which significantly reduces the cost.

. Graph Structure Construction
The key idea of constructing a graph structure is to represent the indoor environment using where V are vertices defined at each survey point during 3D modeling indoor environment, and E are edges that connect different segments of the 3D model if there is ect access from one segment to another. The corresponded graph structure for a corridor area own in Figure 7 on the left. For a small room, we scan the entire room by standing in the midd d rotating the scanning system 360°. Thus, the room will be represented by a single vertex in t ph structure. For a large open space, the graph structure is shown in Figure 7 on the right.
By introducing a graph as the data structure, we are able to restrict the user movement an

Graph Structure Construction
The key idea of constructing a graph structure is to represent the indoor environment using a graph tG " pV, Equ, where V are vertices defined at each survey point during 3D modeling of the indoor environment, and E are edges that connect different segments of the 3D model if there is a direct access from one segment to another. The corresponded graph structure for a corridor area is shown in Figure 7 on the left. For a small room, we scan the entire room by standing in the middle and rotating the scanning system 360˝. Thus, the room will be represented by a single vertex in the graph structure. For a large open space, the graph structure is shown in Figure 7 on the right.

Graph Structure Construction
The key idea of constructing a graph structure is to represent the indoor environment using a where V are vertices defined at each survey point during 3D modeling of the indoor environment, and E are edges that connect different segments of the 3D model if there is a direct access from one segment to another. The corresponded graph structure for a corridor area is shown in Figure 7 on the left. For a small room, we scan the entire room by standing in the middle and rotating the scanning system 360°. Thus, the room will be represented by a single vertex in the graph structure. For a large open space, the graph structure is shown in Figure 7 on the right.
By introducing a graph as the data structure, we are able to restrict the user movement and predict the user location at the next moment. Since the user can only move around connected vertices, we will only have limited amount of candidate vertices for the next move. If the space involved in a vertex is small enough, finding the user's location is approximated as locating the vertex. This is similar to grid-based localization. However, the smaller the grid is, the higher the computational cost will be. In practice, we choose the grid size as the survey area covered during each 3D scan. Thus, locating the vertex gives a coarse estimation of the user's location. A finer positioning within the vertex is achieved using particle filtering. The use of the graph structure also increases the system's robustness. In the case of crowded environments, where the sensor signal fidelity may not be reliable, the constraints in the graph can help detect a sensor failure if the prediction based on the sensor's measurement violates the constraints. A vertex V i encodes the 3D color point cloud PT V i , WiFi received signal strength RSS V i , and position POS V i . These attributes are stored in an object array,

Gaussian Process Modeling of WiFi Received Signal Strength
The edges act as constraints for the motion choices, since the user in vertex V i can only directly access vertex V j if there is an edge E ij " tV i , V j u between them.
By introducing a graph as the data structure, we are able to restrict the user movement and predict the user location at the next moment. Since the user can only move around connected vertices, we will only have limited amount of candidate vertices for the next move. If the space involved in a vertex is small enough, finding the user's location is approximated as locating the vertex. This is similar to grid-based localization. However, the smaller the grid is, the higher the computational cost will be. In practice, we choose the grid size as the survey area covered during each 3D scan. Thus, locating the vertex gives a coarse estimation of the user's location. A finer positioning within the vertex is achieved using particle filtering. The use of the graph structure also increases the system's robustness. In the case of crowded environments, where the sensor signal fidelity may not be reliable, the constraints in the graph can help detect a sensor failure if the prediction based on the sensor's measurement violates the constraints.

Gaussian Process Modeling of WiFi Received Signal Strength
A Gaussian process (GP) essentially estimates a posterior probability distribution over functions from training data (details can be found in [36]). We will give a brief introduction here.
Let us first define a function, f px˚q, be the posterior distribution that makes predictions for all possible inputs x˚. Additionally, we have D " tpx i , y i q|i " 1, ..., nu, which is a set of training samples consisting of n observations drawing from a noisy process, y i " f px i`ε q, where each x i is an input sample in d , and each y i is a target value in . ε is additive Gaussian noise with zero mean and variance σ 2 n . For notational convenience, the inputs of the training set are grouped into a dˆmatrix X, and the observations y i are grouped into a vector y.
To estimate the posterior distribution over function f px˚q from training dataset D, GP depends on a covariance function kernel kpx p , x q q, which specifies how the values at different points are correlated to each other. This kernel can be specified as any arbitrary covariance function, and we have chosen the widely-used squared exponential kernel Here, the parameters σ 2 f and l are the signal variance and the length scale, which determine the strength of the correlation between different points.
Since we only have access to the noisy observations y instead of the true function value f pxq, we must add a term to account for observation noise in the covariance function: Here, σ 2 n is the Gaussian observation noise and δ pq is one if p " q or zero otherwise. For an entire set of input values X, the covariance over the corresponding observations y can be written as: where K is the nˆn covariance matrix of the input values, defined as Krp, qs " kpx p , x q q.
Note that the covariance between the observations is written as a function of the inputs, emphasizing the non-parametric nature of Gaussian process regression. Now we can generate the posterior distribution over functions pp f px˚q|x˚, X, yq " Npµ x˚, σ 2 x˚q to predict the function value for any arbitrary points x˚, given the training data X and y: The predicted mean and variance are: σ 2 x˚" kpx˚, x˚q´k T pK`σ 2 n Iq´1k˚ (8) The parameters σ 2 f , σ 2 n and l control the smoothness of the functions, estimated by a GP, and can be learned from training data, by maximizing the log marginal likelihood of the observations. This learning process is completed offline, immediately after the training dataset is built.
To apply GP in WiFi signal strength modeling, the input values X correspond to positions, and the observations y correspond to signal strength measurements gathered at these positions. The GP posterior is estimated from a collection of signal strength measurements corresponding to their positions. Assuming independence between different access points, we estimate a GP for each access point separately.

Motion Dynamic Model
The motion dynamic model is defined as the user's position changes over time, which are represented by the distance traveled and the heading movement. The built-in motion sensors in the mobile device, including an accelerometer, gyroscope and magnetometer, are used to track the user's movement. We first take a look at the accelerometer measurement. If the user is standing still, it is expected that their mobile device will register little acceleration. Therefore, the standard deviation in the magnitude of acceleration is selected to detect the walking/stopped transition. If σ |a| ă 0.01, it is very likely that the user is stopped. If σ |a| ě 0.01, however, it is not sufficient to ascertain that the user is walking. For example, the sudden movement of the user's hands could result in a larger acceleration. Thus, we exploit the repetitive nature of walking [37]. Figure 8 shows the acceleration data recorded by a walking user. We can see that the acceleration data exhibits a highly repetitive pattern. This pattern arises due to the rhythmic nature of walking.
8 The motion dynamic model is defined as the user's position changes over time, which are represented by the distance traveled and the heading movement. The built-in motion sensors in the mobile device, including an accelerometer, gyroscope and magnetometer, are used to track the user's movement. We first take a look at the accelerometer measurement. If the user is standing still, it is expected that their mobile device will register little acceleration. Therefore, the standard deviation in the magnitude of acceleration is selected to detect the walking/stopped transition. If , it is very likely that the user is stopped. If | | 0.01 a σ ≥ , however, it is not sufficient to ascertain that the user is walking. For example, the sudden movement of the user's hands could result in a larger acceleration. Thus, we exploit the repetitive nature of walking [37]. Figure 8 shows the acceleration data recorded by a walking user. We can see that the acceleration data exhibits a highly repetitive pattern. This pattern arises due to the rhythmic nature of walking. In order to determine whether the user actually enters the walking mode, we calculate the auto-correlation of the acceleration signal ( ) a n for lag τ at the th m , as follow: user is very likely to be walking. Otherwise, there is no change in the motion transition state. This double threshold is able to prevent some irregular movements, for example, the movement of the user's hand, to change the motion state.
Once we have determined that the state is walking, step counting and stride estimation are performed to calculate the walking distance of the user. As shown in Figure 9, step counting is realized by dividing the duration of a sample when the maximum auto-correlation is ( ) 0.  In order to determine whether the user actually enters the walking mode, we calculate the auto-correlation of the acceleration signal apnq for lag τ at the m th , as follow: χpm, τq " k"τ´1 ř k"0 rapm`kq´µpm, τq˚papm`k`τq´µpm`τ, τqqs τ˚σpm, τq˚σpm`τ, τq (9) where µpk, τq and σpk, τq are the mean and standard deviation of the sequence of samples from apkq to apk`τ´1q. If the user is walking, then the auto-correlation will spike the periodicity of the walker. We define ψpmq as the maximum of the auto-correlation. If σ |a| ě 0.01, and ψpmq ě 0.8, then the user is very likely to be walking. Otherwise, there is no change in the motion transition state. This double threshold is able to prevent some irregular movements, for example, the movement of the user's hand, to change the motion state.
Once we have determined that the state is walking, step counting and stride estimation are performed to calculate the walking distance of the user. As shown in Figure 9, step counting is realized by dividing the duration of a sample when the maximum auto-correlation is ψpmq ě 0.8 by τ opt , and round up to an integer value. The τ opt is determined by simply finding the most frequently occurring τ in the duration when ψpmq ě 0.8. Because the human stride is not constant during walking, the stride size is determined by dynamically checking the acceleration sequence. We apply an empirical equation, based on [38], to estimate the stride size.
where k a represents the measured acceleration and N represents the number of samples in one period of walking. The relationship between stride, period of one step, and acceleration is established through a walking test, where the tester walks with a fixed stride using ground marks. In order to detect the user's orientation, we apply a gyroscope and magnetometer sensor to detect angle change. According to reference [39], the rapid fluctuation in the sensor signals is modeled heuristically with a zero mean, white Gaussian noise. The initial orientation of the user is provided by the magnetometer sensor init  Figure 10 shows the device yaw attitude changing while a user, holding the mobile device horizontally, is walking in a corridor. Because the human stride is not constant during walking, the stride size is determined by dynamically checking the acceleration sequence. We apply an empirical equation, based on [38], to estimate the stride size.
where a k represents the measured acceleration and N represents the number of samples in one period of walking. The relationship between stride, period of one step, and acceleration is established through a walking test, where the tester walks with a fixed stride using ground marks. In order to detect the user's orientation, we apply a gyroscope and magnetometer sensor to detect angle change. According to reference [39], the rapid fluctuation in the sensor signals is modeled heuristically with a zero mean, white Gaussian noise. The initial orientation of the user is provided by the magnetometer sensor θ init " θ magn`nmagn , then the orientation is updated using the gyroscope θ t " θ t´1`θgyro`ngyro . Here, n magn and n gyro are Gaussian noise of the magnetometer and gyroscope measurements. Figure 10 shows the device yaw attitude changing while a user, holding the mobile device horizontally, is walking in a corridor.  Because the human stride is not constant during walking, the stride size is determined by dynamically checking the acceleration sequence. We apply an empirical equation, based on [38], to estimate the stride size. 3 1 where k a represents the measured acceleration and N represents the number of samples in one period of walking. The relationship between stride, period of one step, and acceleration is established through a walking test, where the tester walks with a fixed stride using ground marks. In order to detect the user's orientation, we apply a gyroscope and magnetometer sensor to detect angle change. According to reference [39], the rapid fluctuation in the sensor signals is modeled heuristically with a zero mean, white Gaussian noise. The initial orientation of the user is provided by the magnetometer sensor init  Figure 10 shows the device yaw attitude changing while a user, holding the mobile device horizontally, is walking in a corridor. To determine the new position, we first need to calculate the step number and stride length, and then estimate the movement as follows:

31472
Sensors 2015, 15,[31464][31465][31466][31467][31468][31469][31470][31471][31472][31473][31474][31475][31476][31477][31478][31479][31480][31481] To determine the new position, we first need to calculate the step number and stride length, and then estimate the movement as follows: xpt`1q " xptq`pplptq`δlptqq˚cospθptq`δθptqq (11) ypt`1q " yptq`pplptq`δlptqq˚sinpθptq`δθptqq (12) where lptq and θptq are the estimated step length and heading direction, while δlptq and δθptq are the zero mean Gaussian noises on the length and direction, respectively. The motion dynamic model fuses the information provided by different motion sensors, and indicates a higher likelihood field in particle filtering. There are four scenarios depending on the access of different kinds of motion sensors. When an accelerometer is available, the distance traveled can be estimated, based on step counting and stride length estimation. Otherwise, the distance is estimated with an empirical maximum speed, for example, 1 m/s. If a gyroscope and a magnetometer are available, the user heading is detected, if not, the heading remains unknown. Assuming an open space, the calculation of the likelihood field is shown in Figure 11. The grid points denote all possible state candidates for the next epoch, and the black dot is a state candidate of the current epoch. When only the distance traveled is measured, the likelihood field (shaded area) is located within a ring zone around the triangle, as shown in Figure 11a. The radius and width of the ring are determined, based on the measured distance and its uncertainty, respectively. If only the user's heading is detected, and an empirical maximum speed is used to calculate a maximum walking range within a time interval, the likelihood field is shown in Figure 11b. The angle of the shaded zone is determined based on the heading and its uncertainty. If both the distance and heading are measured, the likelihood field is shown in Figure 11c. This is the case for our system, which greatly reduces the amount of particles needed for precise positioning, thus decreasing the computational complexity in particle filtering. Finally, if we have no access to any motion sensors, assuming a maximum walking range, the likelihood field is located within a whole circular area, as shown in Figure 11d. The motion dynamic model fuses the information provided by different motion sensors, and indicates a higher likelihood field in particle filtering. There are four scenarios depending on the access of different kinds of motion sensors. When an accelerometer is available, the distance traveled can be estimated, based on step counting and stride length estimation. Otherwise, the distance is estimated with an empirical maximum speed, for example, 1 m/s. If a gyroscope and a magnetometer are available, the user heading is detected, if not, the heading remains unknown. Assuming an open space, the calculation of the likelihood field is shown in Figure 11. The grid points denote all possible state candidates for the next epoch, and the black dot is a state candidate of the current epoch. When only the distance traveled is measured, the likelihood field (shaded area) is located within a ring zone around the triangle, as shown in Figure 11a. The radius and width of the ring are determined, based on the measured distance and its uncertainty, respectively. If only the user's heading is detected, and an empirical maximum speed is used to calculate a maximum walking range within a time interval, the likelihood field is shown in Figure 11b. The angle of the shaded zone is determined based on the heading and its uncertainty. If both the distance and heading are measured, the likelihood field is shown in Figure 11c. This is the case for our system, which greatly reduces the amount of particles needed for precise positioning, thus decreasing the computational complexity in particle filtering. Finally, if we have no access to any motion sensors, assuming a maximum walking range, the likelihood field is located within a whole circular area, as shown in Figure 11d.

HMM Model
A general HMM characterizes a physical system with a state space model. In the problem of position tracking, the HMM model represents the temporal correlation of a user's position and orientation. Figure 12 shows a HMM factor graph, the state

HMM Model
A general HMM characterizes a physical system with a state space model. In the problem of position tracking, the HMM model represents the temporal correlation of a user's position and orientation. Figure 12 shows a HMM factor graph, the state Xptq " txptq, yptq, θptq, mptq, Vptqu, where xptq, yptq, θptq represents the user's position and orientation, mptq P pwalk, stopq, Vptq indicates the current vertex where the user is located in the graph structure. The state transition model ppX t |X t´1 q constructed of ppx t , y t |x t´1 , y t´1 q, ppθ t |θ t´1 q, ppm t |m t´1 q and ppV t |V t´1 q serve as the motion model in our algorithm. The observation model is based on the WiFi signal strength measurement, motion sensor readings, and the captured image during online tracking.  It is defined as a 2 by 2 matrix, which models the preference of staying in the previous state, avoiding too-rapid changes in motion states. Moreover, we apply different matrices for different environments. This enables the system to model the fact that the user is more likely to stop in a room than in a corridor.

( | )
t t p θ θ − represents the probability of the current orientation t θ given the previous orientation 1 t θ − . It depends on whether the user is walking or stopped, and whether he is in a hallway or an open space. As the choice is limited to two, we use binary code to represent the situation. We define Walking = 1, Stopped = 0 for the leftmost binary digit, and Hallway = 1, Open space = 0 for the rightmost binary digit in the binary representation of the situation. In total we have four situations (00, 01, 10, 11). For example, if the user is walking in a hallway, his situation code is 11. If the next motion state has been determined to be stop, the difference t θ Δ is sampled uniformly with a constant probability π . Otherwise, if the next motion state is to be walking, we make a simple assumption that the user prefers to walk in a straight line, so that Otherwise, we calculate the distance that the user has travelled. For this distance, we determine whether the movement along the corridor results in a transition to ppm t |m t´1 q represents the probability of the motion state being walking or stopped given the previous motion state. It is defined as a 2 by 2 matrix, which models the preference of staying in the previous state, avoiding too-rapid changes in motion states. Moreover, we apply different matrices for different environments. This enables the system to model the fact that the user is more likely to stop in a room than in a corridor. ppθ t |θ t´1 q represents the probability of the current orientation θ t given the previous orientation θ t´1 . It depends on whether the user is walking or stopped, and whether he is in a hallway or an open space. As the choice is limited to two, we use binary code to represent the situation. We define Walking = 1, Stopped = 0 for the leftmost binary digit, and Hallway = 1, Open space = 0 for the rightmost binary digit in the binary representation of the situation. In total we have four situations (00, 01, 10,11). For example, if the user is walking in a hallway, his situation code is 11. If the next motion state has been determined to be stop, the difference ∆θ t is sampled uniformly with a constant probability α " r0, 2πs. Otherwise, if the next motion state is to be walking, we make a simple assumption that the user prefers to walk in a straight line, so that ∆θ t is sampled from a zero mean Gaussian distribution. If the user is in an open space, ∆θ t follows a unimodal Gaussian distribution.
On the other hand, if the user is in a hallway, ∆θ t follows a bimodal Gaussian distribution, since the user has a higher chance of choosing from two opposite orientations. ppx t , y t |x t´1 , y t´1 q represents the probability of the current position px t , y t q given the previous position px t´1 , y t´1 q. If the previous motion state m t has been determined as stopped, the current position is equal to the previous one. Otherwise, the current position is updated by sampling the moving distance d t from a Gaussian Npµ, σ 2 q. Based on a simple straight motion assumption, the new position is calculated as follows: ppV t |V t´1 q represents the probability of the current vertex V t given the previous vertex V t´1 . If V t is in a corridor, we first calculate the walking/stopped probability. If m t " stopped, then V t " V t´1 . Otherwise, we calculate the distance that the user has travelled. For this distance, we determine whether the movement along the corridor results in a transition to another vertex, or remains within the same vertex area. The vertex transition is constrained to only two adjacent vertices. If V t is in an open space, it may have up to nine candidate vertices. We first sample the walking/stopped motion transition state and corresponding motion distance. Then, the vertex V t is determined based on a simple straight line movement.
The where z WiFi t is the received signal strength at time t, µ x˚a nd σ 2 x˚a re the mean and variance at position x˚, predicted using Equations (7) and (8).
The camera measurement likelihood model is computed by pairwise pixel comparison between the current view and the view of a particle [11]. In the particle filter, a given particle's view of the environment can be projected from the 3D textured model #pixels in similar color #total pixels (16) A predefined threshold determines the maximal color difference for two pixels, and a normalized color space is adopted to alleviate the effect of illumination.
The motion sensor measurement likelihood model compares the motion model ppX t |X t´1 q with the motion dynamic model derived from motion sensor readings.
where z Motion t is the pose calculated from the motion-sensor-based motion dynamic model, x˚is the pose estimated from the motion model defined in HMM.

Online Tracking with Particle Filter
Bayesian filtering is used to estimate the posterior state X t given all sensor measurements Z 0:t . Under the Markov assumption, we have the following recursive equation, which is updated whenever new sensor data become available: where ppX t |X t´1 q represents the motion model, and ppZ t |X t q is the observation model. We implement Bayesian filtering using a particle filter, which represents posterior over the state X t by sets S t of M weighted samples: S t " tă X is an importance weight of the state. The particle filter applies the recursive Bayesian filter update to estimate posteriors over the state space. An online tracking algorithm using a particle filter is performed according to the following steps: (1) Particle Initialization The initial position is calculated through the weighted K nearest neighbor (W-KNN) method. It searches for K closest matches of known positions in the WiFi received signal space from the offline-built dataset. By averaging these K position candidates with adopting the distances in signal space as weights, the initial estimated position is acquired. This initial position estimation is used as the starting point for particles.
(2) Particle propagation Next, we apply the state transition model to guide particle propagation. During the state transition process, we observe that the particle propagation plays an important role, which represents phase, WiFi received signal strength, color images, point cloud, and motion signals, including user acceleration, device attitude, and rotation rate, are recorded along the entire scenario. A detailed 3D model of the indoor environment is generated by fusing color images from the camera and point cloud from LiDAR. Then, we apply Gaussian process modeling to generate a signal strength map of the WiFi RF fingerprints. The WiFi signal strength map is overlapped with the 3D model and we divide the 3D model into different segments according to our survey points. Each segment is encoded in the vertices of a graph structure. We store the graph structure in the mobile device for the online tracking phase. Once we press the "Locate" button, the mobile device starts scanning the WiFi received signal strength from all the access points it can detect. Particles are initialized through weighted the K nearest neighbor method. After the initial distribution, particles start to propagate under the guidance of the HMM motion model. Every 5 s, the particles are resampled using the HMM observation model. Through particle filtering, we are able to locate the user and track the user's movement in real time.
The training and testing are conducted in an office building corridor area and a library's open space. In total, we have 24 survey points in the corridor and 21 survey points in the library. Positioning tests are conducted on a predefined path. The average length of the path is about 90 m in the library's open space and 100 m in the corridor area. The position update is performed every 5 s, which means the particle filtering step can be completed in 5 s. The memory usage is under 40 MB RAMsince the iOS app will be forced to shut down if it exceeds this limit. During the traverse on the path, we measured the error distance between the estimated position and the ground truth position. The ground truth position is based on manual annotation of waypoints. Whenever the tester reaches a waypoint, the timestamp is recorded and the estimated position with the actual one are compared. Figure 14 shows screenshots of the real time test results in the corridor and the library.

Implementation on iOS Platform and Experimental Analysis
We realize the indoor positioning algorithm on the iOS platform and build an app to test the system performance. The system workflow is shown in Figure 13. During the offline training phase, WiFi received signal strength, color images, point cloud, and motion signals, including user acceleration, device attitude, and rotation rate, are recorded along the entire scenario. A detailed 3D model of the indoor environment is generated by fusing color images from the camera and point cloud from LiDAR. Then, we apply Gaussian process modeling to generate a signal strength map of the WiFi RF fingerprints. The WiFi signal strength map is overlapped with the 3D model and we divide the 3D model into different segments according to our survey points. Each segment is encoded in the vertices of a graph structure. We store the graph structure in the mobile device for the online tracking phase. Once we press the "Locate" button, the mobile device starts scanning the WiFi received signal strength from all the access points it can detect. Particles are initialized through weighted the K nearest neighbor method. After the initial distribution, particles start to propagate under the guidance of the HMM motion model. Every 5 s, the particles are resampled using the HMM observation model. Through particle filtering, we are able to locate the user and track the user's movement in real time.    To gain insight into the positioning error distribution, Figure 16 presents the cumulative probabilities of the positioning errors of the different cases. The comparison further shows that the motion dynamic model greatly increases positioning accuracy.    To gain insight into the positioning error distribution, Figure 16 presents the cumulative probabilities of the positioning errors of the different cases. The comparison further shows that the motion dynamic model greatly increases positioning accuracy. To gain insight into the positioning error distribution, Figure 16 presents the cumulative probabilities of the positioning errors of the different cases. The comparison further shows that the motion dynamic model greatly increases positioning accuracy.  Table 2 compares the error mean, root mean square (RMS) error, and maximum error in different scenarios.   Table 2 compares the error mean, root mean square (RMS) error, and maximum error in different scenarios. As we can see from Table 2, the incorporation of motion sensors with WiFi RSSI has greatly reduced the positioning error. Due to the implementation issue on the iOS device, we have not included the visual sensors in the real time test. The visual sensors are applied separately. After the real time test is done, we capture the images at all the waypoints and correct the estimated position. Table 3 shows the correction results.  Table 3 shows that the visual sensors could further improve the positioning accuracy. Overall, the positioning performance has proved that our iOS application is a robust, accurate, highly-integrated indoor positioning system. However, we have not implemented various state-of-the-art indoor positioning methods in the literature, due to the difficulty in realizing them on a mobile platform. The state-of-the-art method of a WiFi signal strength based method using Gaussian process is discussed in [4,5], we implemented this on the iOS platform. Moreover, we introduce a framework based on HMM to effectively fuse the WiFi information with motion sensor and visual sensor information on the mobile platform in order to improve system performance.

Conclusions
In this paper, we have demonstrated an indoor localization system based on a graph structure and multimodal particle filtering technique. The implementation on the iOS platform, and the test in a real world situation proved that our application is a reliable indoor localization system. To the best of our knowledge, this is the first iOS app delivering such accurate, highly integrated indoor localization system on a small mobile device. Based on our system, many position-aware applications will be able to function properly indoors, providing more convenient service to people's daily lives.
In the future, we will focus on fusing the visual sensors on board into our real time localization system. We believe the key technology for future localization systems lies in how to efficiently fuse the information provided by various sensors. We are also looking to integrate our system into other platforms, for example, a vehicular infotainment platform, so that we have access to more sensor information and can function both indoors and outdoors.