Integrated IMU with Faster R-CNN Aided Visual Measurements from IP Cameras for Indoor Positioning

Considering the radio-based indoor positioning system pertaining to signal degradation due to the environmental factors, and rising popularity of IP (Internet Protocol) cameras in cities, a novel fusion of inertial measurement units (IMUs) with external IP cameras to determine the positions of moving users in indoor environments is presented. This approach uses a fine-tuned Faster R-CNN (Region Convolutional Neural Network) to detect users in images captured by cameras, and acquires visual measurements including ranges and angles of users with respect to the cameras based on the proposed monocular vision relatively measuring (MVRM) method. The results are determined by integrating the positions predicted by each user’s inertial measurement unit (IMU) and visual measurements using an EKF (Extended Kalman Filter). The results experimentally show that the ranging accuracy is affected by both the detected bounding box’s by Faster R-CNN height errors and diverse measuring distances, however, the heading accuracy is solely interfered with bounding box’s horizontal biases. The indoor obstacles including stationary obstacles and a pedestrian in our tests more significantly decrease the accuracy of ranging than that of heading, and the effect of a pedestrian on the heading errors is greater than stationary obstacles on that. We implemented a positioning test for a single user and an external camera in five indoor scenarios to evaluate the performance. The robust fused IMU/MVRM solution significantly decreases the positioning errors and shows better performance in dense multipath scenarios compared with the pure MVRM solution and ultra-wideband (UWB) solution.


Introduction
Indoor positioning technologies [1][2][3] are necessary and technical prerequisites for various industrial and consumer applications in location-based services.This location solution typically provides a user with a reliable and accurate pose estimation of a device or a person in public and private areas [4][5][6][7], like an airport, hotel, mall, home, etc.However, as GNSS (Global Navigation Satellite System) signal is not continually available in indoor environments and the presence of unavoidable issues in complex indoor areas, such as multipath and non-line of sight, the high precision and reliable indoor positioning is not easy to be achieved in realistic conditions.Based on these constraints, developing a low cost, reliable, and infrastructure-free or infrastructure-less positioning solution for consumers remains an open challenge at present.
To address this problem, much of the recent research focuses on sensor-based indoor positioning technologies.The representative solutions determining locations in indoor environments comprise Wi-Fi [8], and accelerometer to infer movements of a pedestrian based on measurements and previous locations.However, inertial sensors suffer from biases and drift errors that will accumulate over time.Therefore, with the limits of cost, size, performance, production, etc., the hybrid positioning technologies with inertial sensors are introduced to efficiently reduce the drift errors and further improve positioning accuracy, such as PDR/Bluetooth [25], INS/Wi-Fi [26], etc.The optimization-based INS/UWB approach in reference [27] achieved RMSE of position and orientation approximately as 3-cm and less than 1 • using an IMU and a UWB transmitter placed on a body and 10 UWB receivers deployed in a test room.However, this fusion system requires a more accurate distribution model to remove large amounts of the time of arrival (TOA) measurements outliers caused by multipath and NLOS conditions and combines the corrected UWB measurements with inertial measurements to determine six degrees-of-freedom (DOF) pose of the moving body.
Compared with the conventional radio-based positioning methods, vision-based positioning [28,29] is proved to become a greatly promising navigation and positioning technique in various applications.Visual simultaneous localization and mapping (SLAM) [30,31] has been extensively applied in various fields, such as virtual reality/augmented reality, robotic mapping and navigation, autonomous vehicle, etc. Feature detection and description, graph optimization and loop closure are key techniques to implement SLAM algorithms significantly.In contrast with binocular or stereo vision, monocular vision methods remain a challenge to provide robust and accurate pose estimations with good performances due to lacking scale factor and depth information.The integrated method with inertial sensors is able to overcome these limitations of monocular vision-only and IMU-only positioning by using their complementary properties.Fast movement information in short period can be predicted by IMU, and drift errors from IMU can be corrected by vision measuring effectively.A popular representative, visual inertial odometry (VIO) [32,33], which has great potential of resolving the estimation problems in these above applications.Popular VIO algorithms include the filter-based MSCKF [34] and optimization-based Okvis [35].Recently, a robust and versatile monocular visual-inertial state estimator, i.e., VINS-Mono, comprises optimization-based VIO, online loop detection, tightly-coupled re-localization and four DOF pose graph optimization in reference [36].The experiments show superior performance by comparing against Okvis by running on mobile devices.In [37] with the use of urban 3D model, the position and orientation of camera relative to the known street object is estimated by using efficient PnP algorithm.The final average position for the fusion of IMU with the camera data is 0.22 m in experiment conditions.This method largely relies on the rich and reliable 3D model to determine absolutely position especially in a low visibility area.
In practical application, indoor tracking and locating of a moving human target at a low cost with good performance remains an open issue.The popular vision-based methods to deal with identifying a pedestrian in images in various environments can be categorized into either traditional feature detector or deep learning-based methods.In [38], pedestrians are recognized through the use of algorithms based on edge density and symmetry maps.However, the position error depends on images sequences categories with mean location error at 0.98 m in backwards running.In [39], a Bayesian-based vision tracking system providing user's position estimation made the RMSE for position improve to 20-cm by integrating with inertial sensors through an EKF.In the recent years, deep learning [40] has made significant breakthroughs in visual recognition, speech recognition, and natural language processing.In addition, positioning approach with assistant of deep learning is becoming an active research area.In paper [41], a novel indoor positioning and tracking system fusing magnetic signal and visual images with a deep convolutional neural network (CNN) to extract deep features for measurements was presented.The main contribution of this paper [42] is to leverage CNN to build a proper radio propagation model which is applied in crowed scenarios.Reference [43] discussed a state-of-art survey on pedestrian detection and tracking methods by utilizing computer vision and deep learning techniques.In [44], the authors used CNN to classify pedestrian in images and showed higher levels of accuracy compared with traditional SVM approach with Haar features.The presented work above provides a new perspective for developing and resolving the indoor positioning and tracking issues.
The rise and rapid progress of 5G and internet of things technologies [45][46][47] allow diverse wireless devices to be connected by larger-scale wireless sensor networks [48] for exchanging and sharing information, and they have been widely used in medical treatment, smart home, higher education, intelligent transportation [49][50][51], etc.This technology commonly relies on measurements information among every pair of nodes, like ranges and angles, to realize relative or absolute localization based on wireless sensors.Reference [52] proposed an approach combining IMU and UWB ranging measurements for a relative positioning among multi users with a particle filter.These range-based and angle-based localization methods obtaining peer to peer measurements can be implemented by using cameras instead of wireless sensors.
In consequence, we propose a novel indoor positioning approach combining IMU and Faster R-CNN-aided relative measurements from IP cameras to determine pose estimations of users, which is inspired by the rising popularity of IP cameras in cities and the complementary properties of IMU with cameras.This solution leverages an extended Kalman filter to tightly fuse IMU data and relative ranges and angles with respect to cameras obtaining by our presented monocular vision relatively measuring (MVRM) method.We conducted this approach in indoor environments and evaluated the performances of the proposed approach in stationary objects and pedestrian blockage scenarios.The experiment results show that the proposed approach can significantly reduce positioning errors and enhance reliability during blockage period.
The rest of this paper is structured as follows.In Section 2, a concept of the proposed integrated system is presented.In Section 3, Faster R-CNN based object detection is analyzed briefly, and the proposed MVRM method is modelled.This section also formulates the integrated IMU/MVRM for a group of users and cameras.In Section 4, the experiment setup and the results of the proposed approach are introduced in real indoor environments.In Section 5, the conclusion and further work are summarized.

System Overview
The concept of the proposed IMU/MVRM integrated system for indoor positioning comprising two phases: offline training and online location, is shown in Figure 1.In the offline phase, IP cameras capture images of users, and send them to a server.These uploaded images are used to create a training dataset for training a model of detecting multi-users based on a deep neural network.In the online phase, cameras start to take an image of a user in real time when a positioning request is sent by this user and send it to the server to detect the user in this image with the trained model.Meanwhile, the user end sends the pose estimations predicted by its own IMU to the server.The fusion filter fuses IMU's predictions with the ranges and angles of this user with respect to these cameras obtained by the proposed MVRM method.Finally, the corrected estimations and sensor biases will be resent to the user end.

Faster R-CNN Based Object Detection
In the last few years, object detection by using deep learning has attracted a great deal of attention, in particular, using the typical region with CNN (R-CNN) [53].This technology utilizes CNNs to extract features from all candidates which makes it become a time-consuming work due to its high computational cost during training and test periods.To speed-up, Fast R-CNN [54] and Faster R-CNN [55] are introduced consecutively.Faster R-CNN with a region proposal network (RPN) specialized in proposals generation merges region proposals, features extraction, classification and bounding box regression into just one deep neural network which significantly increases the running speed by 250 times, as opposed as R-CNN.In this part, Zeiler and Fergus net (ZF) [56] based Faster R-CNN is used to detect users in images.
The performance of a deep learning partially depends on the size of a dataset.However, it is indeed hard to build a large-scale dataset for a particular detection task, like our multi-users detection.Therefore, we adopt the pre-trained Faster R-CNN, a 20 general class objects detection model on PACSAL VOC 2007 detection benchmarks [57] in source task, to be fine-tuned on our training dataset in target task, which efficiently improves the training performance and enhances generalization by reducing overfitting.Flowchart of fine-tuning Fast R-CNN is depicted on Figure 2. As shown, the model parameters can be optimized by fine-turning on the training data, and the output of the target task is the optimized model of multi-users detection which can accurately identify and locate users (trained pedestrians in offline) in images.

Faster R-CNN Based Object Detection
In the last few years, object detection by using deep learning has attracted a great deal of attention, in particular, using the typical region with CNN (R-CNN) [53].This technology utilizes CNNs to extract features from all candidates which makes it become a time-consuming work due to its high computational cost during training and test periods.To speed-up, Fast R-CNN [54] and Faster R-CNN [55] are introduced consecutively.Faster R-CNN with a region proposal network (RPN) specialized in proposals generation merges region proposals, features extraction, classification and bounding box regression into just one deep neural network which significantly increases the running speed by 250 times, as opposed as R-CNN.In this part, Zeiler and Fergus net (ZF) [56] based Faster R-CNN is used to detect users in images.
The performance of a deep learning partially depends on the size of a dataset.However, it is indeed hard to build a large-scale dataset for a particular detection task, like our multi-users detection.Therefore, we adopt the pre-trained Faster R-CNN, a 20 general class objects detection model on PACSAL VOC 2007 detection benchmarks [57] in source task, to be fine-tuned on our training dataset in target task, which efficiently improves the training performance and enhances generalization by reducing overfitting.Flowchart of fine-tuning Fast R-CNN is depicted on Figure 2. As shown, the model parameters can be optimized by fine-turning on the training data, and the output of the target task is the optimized model of multi-users detection which can accurately identify and locate users (trained pedestrians in offline) in images.
Generally, the performance of object detection can be evaluated in two aspects: mean average precision (mAP) and intersection over union (IoU).In this paper, mAP defined in PASCALVOC 2012 [58] is used to evaluate the fine-tuned Faster R-CNN.
AP represents as an area under the Precision-Recall curve: where P, R indicate the precision and recall rate of detection, written as Sensors 2018, 18, 3134 where T p, Fp mean the number of true and false positive samples, Fn means the number of false negative samples.
speed by 250 times, as opposed as R-CNN.In this part, Zeiler and Fergus net (ZF) [56] based Faster R-CNN is used to detect users in images.
The performance of a deep learning partially depends on the size of a dataset.However, it is indeed hard to build a large-scale dataset for a particular detection task, like our multi-users detection.Therefore, we adopt the pre-trained Faster R-CNN, a 20 general class objects detection model on PACSAL VOC 2007 detection benchmarks [57] in source task, to be fine-tuned on our training dataset in target task, which efficiently improves the training performance and enhances generalization by reducing overfitting.Flowchart of fine-tuning Fast R-CNN is depicted on Figure 2. As shown, the model parameters can be optimized by fine-turning on the training data, and the output of the target task is the optimized model of multi-users detection which can accurately identify and locate users (trained pedestrians in offline) in images.IoU is used to evaluate the performance of location which measures the overlapping ratio of the bounding boxes between predication and truth.
where F denotes the area of bounding box predicated by the fine-tuned Faster R-CNN, G denotes the actual area of bounding box.

Monocular Vision-Based Relatively Measuring Method
As mentioned previously, the proposed monocular vision-based relatively measuring method utilizes the locations in pixels of detected users in images with the use of the fine-tuned Faster R-CNN and the real height in meters of the users in word coordinate system to estimate ranges and angles of users with respect to cameras based on the ranging and angulation model presented in this section.Unlike radio-based TOA, angle of arrival (AOA), received signal strength indicator (RSSI), etc., the measuring accuracy of this vision-based method will not decrease due to multipath interference.

Ranging Model
In general, users being detected are not located on the optical axis of a camera.The relative range of a user respect to a camera is the distance between the user and optical center of the lens.As shown in Figure 3, r is the relative range, o is optical center of the lens, u, v are object distance and image distance respectively, l is the distance from a user to the optical axis, l c is the distance from image of the user to the center of the film measured on COMS sensor, h r denotes real height of the user, h c denotes image height of the user measured on CMOS sensor, the relationship of h r and h c is shown as: where h c is expressed as: Sensors 2018, 18, 3134 where n v is user height in pixels, re is the number of count of pixels per unit, f 35mm and f are 35 mm equivalent focal length and focal length in word units respectively, h p and v p are image dimensions, diag 35mm is length of the diagonal of 35 mm film.
Similarly, the relationship of l and c l is expressed as:  The estimated relative range r is given by: In which to acquire v , we need to start an initialization process in advance and keep the focus fix on anywhere in the images during the whole test.The initialization parameter v is defined as where N is number of detected users in an image, i W is the weight of user i , 0 i r denotes known initial relative range between user i and the camera,  Similarly, the relationship of l and l c is expressed as: where d h and d v are distances from image of a user to the center of the film measured in pixels in horizontal and vertical directions, respectively.The estimated relative range r is given by: In which to acquire v, we need to start an initialization process in advance and keep the focus fix on anywhere in the images during the whole test.The initialization parameter v is defined as where N is number of detected users in an image, W i is the weight of user i, r i 0 denotes known initial relative range between user i and the camera, n v i0 , l c i 0 are initial measurements of value n v and l c of user i respectively.hr i is user i's height.In this initial process for determining image distance of a camera, commonly we can get an image of multi-users captured by the camera.With knowledge of the initial range from each user to the camera, we can get a group of image distance values.With us of W i , a mean value of this group is regarded as the final image distance for this camera.
In addition, a proper camera calibration is needed to use the above pin-hole camera model for resolving ranging and angulation issue.Removal of lens' distortion from the images and principal point correction are the main tasks in camera calibration.Here, we use a typical camera calibration method by Zhengyou Zhang [59] to obtain intrinsic, lens distortion parameters of the test camera beforehand.

Angulation Model
The proposed angulation model is illustrated in Figure 4.The projections of a point A onto the horizontal plane that passing through the optical axis and onto its perpendicular plane are points C and P, respectively, and projections of them onto the image plane are points C and P .
horizontal plane that passing through the optical axis and onto its perpendicular plane are points C and P , respectively, and projections of them onto the image plane are points C and P .
The azimuth and elevation of A in the observer's local system are defined as where azimuth angle is assumed to be positive by turning clockwise, and elevation angle is assumed to be positive when target lies over the horizontal plane.

IMU/MVRM Integrated System
Considering a IMU/MVRM integrated system which comprises a group of users camera is considered to be an anchor, and every single user is equipped with a 6DOF IMU combining a 3-axis gyroscope and a 3-axis accelerometer.The focus in this section is to create dynamical and observation models of the integrated system in accordance with the framework of the extended Kalman filter.

Dynamical Model
The commonly used IMU sensor error models are written as The azimuth and elevation of A in the observer's local system are defined as where azimuth angle is assumed to be positive by turning clockwise, and elevation angle is assumed to be positive when target lies over the horizontal plane.

IMU/MVRM Integrated System
Considering a IMU/MVRM integrated system which comprises a group of users U = [u i |i ∈ K] and a group of cameras C = {C u i (t)|i ∈ K}, K = {1, • • • , k} denotes a set of users' ID, C u i (t) is considered to be a set of cameras which can observe user u i at current time t.Each camera is considered to be an anchor, and every single user is equipped with a 6DOF IMU combining a 3-axis gyroscope and a 3-axis accelerometer.The focus in this section is to create dynamical and observation models of the integrated system in accordance with the framework of the extended Kalman filter.

Dynamical Model
The commonly used IMU sensor error models are written as where δρ, δv, δϕ refer to position, velocity and attitude error vectors expressed in east-north-up coordinates system (ENU) respectively.ωen indicates angular rate of navigation frame related to earth; ω ie is the earth's angular rate, and f is specific force.δωb ib , δ f b are gyro drifts and accelerometer biases, respectively.For any single user u i , a 15-dimension state vector is defined as: The dynamical model is expressed as: .
where F u i is state transition model, G u i is control model, w u i is the process noise vector which is assumed to be drawn from a zero mean Gaussian normal distribution.

Observation Model
The measurements of the integrated system comprise ranges and angles of users with respect to their visible anchors.First, a simple model involved in a single user with an anchor is created.
The ranging function d u i c j and angulation function a u i c j of user u i related to anchor c j are: where v u i c j , n u i c j and m u i c j are observation noises which are assumed to be drawn from a zero mean Gaussian normal distribution.∆p = ∆e ij , ∆n ij , ∆u ij T is difference of positions of user u i with respect to anchor c j in ENU coordinates.The observation model of u i with c j is made by: where are ranges and angles measured by using MVRM method, d I MU u i c j , a I MU u i c j are ranges and angles predicted by user's IMU.
Extend the measuring function of a user and an anchor to its function with multiple anchors.Assuming the set of N cameras which can observe user u i simultaneously is C u i = c j j{1, • • • , N} , and the observation model extended to with Extend the dynamical and observation models applied to an IMU/MVRM integrated system comprises a group of users U = [u i |i ∈ K] and a group of cameras C = {C u i (t)|i ∈ K} to Sensors 2018, 18, 3134

Experiments Preparations
To test the performance of the proposed approach, several tests were conducted at Northwestern Polytechnical University.Before tests work, we performed several tasks for test preparations.The test area of the rectangle path with 23 markers in an office is 4.2 m × 2.4 m.
First, as discussed in Section 2, an offline training to build a Faster R-CNN by using a Caffe library was conducted.We randomly took and labeled 200 images of a test person in the test field to create a dataset according with PASCAL VOC 2007.This new dataset consisted of a test dataset with 80 images, training dataset with 72 images and validation dataset with 80 images.Next, perform the camera calibration to extract intrinsic, lens distortion parameters of the camera with a 90 cm × 115 cm checkerboard pattern by using Camera Calibration Toolbox for MATLAB.Finally, perform the initialization for each camera to determine an initial image distance.

The Fine-Tuned Faster R-CNN
As discussed in Section 3.1, the precision, recall rate and IoU are used to evaluate the accuracy of detection and location.To test the performance of the fine-tuned Faster R-CNN, we chose 80 images to create a dataset which included 45 positive samples and 35 negative samples.The precision-recall curve is shown in Figure 5a.The precision is reaching approximately 90% with a recall rate of 80%.The mAP for the fine-tuned Faster R-CNN is increased to 93.5%, compared to 58.7% for the source Faster-RCNN on the PASCAL 2007. Figure 5b shows the cumulative distribution functions plotted for IoU.The averages of IoU for the fine-tuned and source Faster R-CNN are approximately 0.823 and 0.819, respectively.In summary, the fine-tuned Faster R-CNN is more applicable to the detections on our target task which has an improved detection accuracy.As discussed in Section 3.1, the precision, recall rate and IoU are used to evaluate the accuracy of detection and location.To test the performance of the fine-tuned Faster R-CNN, we chose 80 images to create a dataset which included 45 positive samples and 35 negative samples.The precision-recall curve is shown in Figure 5a.The precision is reaching approximately 90% with a recall rate of 80%.The mAP for the fine-tuned Faster R-CNN is increased to 93.5%, compared to 58.7% for the source Faster-RCNN on the PASCAL 2007. Figure 5b shows the cumulative distribution functions plotted for IoU.The averages of IoU for the fine-tuned and source Faster R-CNN are approximately 0.823 and 0.819, respectively.In summary, the fine-tuned Faster R-CNN is more applicable to the detections on our target task which has an improved detection accuracy.

Analysis of Ranging and Angulation Model in Obstacle-Free Environments
Highly precise measurements including ranges and angles are vital to correct IMU's drifts and determine positions and headings of moving users.In this section, we test the performance of the proposed ranging and angulation model, and investigate how the location deviation of the detected bounding box obtained by Faster R-CNN affects the measuring accuracy.We took respectively 40 images of a test person locating at 23 markers with a stationary camera, and categorized markers into four phases by diverse angles and distances with respect to the camera.The four phases are detailed in Figure 6.As we can see from Figure 7a, the ranging accuracy based on the proposed ranging model is affected by detected bounding box's height errors and diverse measuring distances, and bounding box's height errors decrease the ranging accuracy dramatically when measure the target at a long distance.For example, for the bounding box's height errors at 25 pixels, the ranging error is 0.3 m when measuring the target less than 4 m, but when greater than 7 m, the ranging error is rapidly increased to 1.2 m. Figure 7b shows that the heading errors primarily depend on the bounding box's horizontal biases.Unlike the analysis of the ranging accuracy, the heading accuracy has no bearing on the measuring distances.For a maximum of image center horizontal biases at 14 pixels, the heading error reaches approximately 1.1 • .
Figure 8 shows the cumulative distribution functions of errors for four phases.Table 1 summarizes the errors in four phases.P2 shows the preferable ranging performance in phase 2 due to the shortest measure distances compared to the results of other phases, however, it appears no obvious distinctions of heading errors among four phases which is indeed proved to be consistent with the analysis of results in Figure 7. Due to the symmetric properties of markers' locations between P1 and P3, the CDF in Figure 8 and estimated position error data of P1 and P3 in Table 1 both show more similar results.      1 shows the estimated position error respectively caused by ranging error and by heading error in four phases.It is shown that the RMSEs in estimated position values respectively caused by ranging and heading differ by one order of magnitude, and the effect on position errors caused by heading is less than by ranging.
From the above tests, we conclude that the ranging accuracy is affected by both bounding box's height errors and diverse measuring distances, however, the heading accuracy is solely interfered with bounding box's horizontal biases.In addition, the inaccuracy in the vision part may come from human walking posture.For example, pedestrian's spine is curved too far forwards or backwards and greater stride variation when walking that will make pedestrian's height errors in images.This fact that the difference between detected pedestrian's height and the true height caused by walking posture somehow influences the distance values.In this paper, we neglect this inaccuracy.

Analysis of Ranging and Angulation Model in Obstacle Environments
In general, indoor objects, such as furniture and pedestrians, occasionally block cameras from viewing the moving users.In this section, we investigate the measuring performance of the proposed MVRM method when stationary obstacles or pedestrians block the camera from observing the moving users in indoor environment.As discussed above, two cases that analyze effect of obstructions by various stationary obstacles and a pedestrian on bounding box's height and bounding box's horizontal biases at a same measuring distance are considered below.In Figure 9a, common obstacles to be used in the test field which primary cause bounding box's height errors by blocking the lower half of the user's body.Figure 9b shows a scenario in which the pedestrian blocks a little view of camera from viewing the user.This scenario causes both bounding box's height errors and horizontal biases.In this test, the camera is perpendicularly placed at the location with a height of 1.2 m and a distance from the target of 5.1 m.Each obstacle all located at the same location in front of the target.
with bounding box's horizontal biases.In addition, the inaccuracy in the vision part may come from human walking posture.For example, pedestrian's spine is curved too far forwards or backwards and greater stride variation when walking that will make pedestrian's height errors in images.This fact that the difference between detected pedestrian's height and the true height caused by walking posture somehow influences the distance values.In this paper, we neglect this inaccuracy.

Analysis of Ranging and Angulation Model in Obstacle Environments
In general, indoor objects, such as furniture and pedestrians, occasionally block cameras from viewing the moving users.In this section, we investigate the measuring performance of the proposed MVRM method when stationary obstacles or pedestrians block the camera from observing the moving users in indoor environment.As discussed above, two cases that analyze effect of obstructions by various stationary obstacles and a pedestrian on bounding box's height and bounding box's horizontal biases at a same measuring distance are considered below.In Figure 9a, common obstacles to be used in the test field which primary cause bounding box's height errors by blocking the lower half of the user's body.Figure 9b shows a scenario in which the pedestrian blocks a little view of camera from viewing the user.This scenario causes both bounding box's height errors and horizontal biases.In this test, the camera is perpendicularly placed at the location with a height of 1.2 m and a distance from the target of 5.1 m.Each obstacle all located at the same location in front of the target.Figure 10a shows that greater errors in bounding box's height most likely induce greater ranging errors at a same measuring distance, which is consistent with the analysis of results in Figure 7a.Generally, the overlap area of the target with obstacles depends on the position and attitude of camera respective to the target and its obstacles.In our test condition, obstruction by the armchair yielding the least overlap area of the lower half of the body, which leads to a minimal RMSE of ranging as 0.1 m.Garbage can blockage scenario is second, followed by stool blockage scenario, and the obstruction by the cabinet leads to the greatest RMSE as 1.5 m.
Figure 10b shows the similar results that greater bounding box's horizontal biases most likely induce greater heading errors in Figure 7b.The minimum RMSE of heading is caused by the occlusion by the narrowest garbage at 0.1 • , however, the maximum's is caused by the occlusion by the widest armchair at 0.47 • .
In the view of the above tests, stationary obstacles used in these tests have greater influences on ranging accuracy than heading accuracy.The detection failure rates caused by the cabinet and armchair in blockage scenarios are 90% and 72% respectively, which is shown in Table 2. Figure 10a shows that greater errors in bounding box's height most likely induce greater ranging errors at a same measuring distance, which is consistent with the analysis of results in Figure 7a.Generally, the overlap area of the target with obstacles depends on the position and attitude of camera respective to the target and its obstacles.In our test condition, obstruction by the armchair yielding the least overlap area of the lower half of the body, which leads to a minimal RMSE of ranging as 0.1 m.Garbage can blockage scenario is second, followed by stool blockage scenario, and the obstruction by the cabinet leads to the greatest RMSE as 1.5 m. Figure 10b shows the similar results that greater bounding box's horizontal biases most likely induce greater heading errors in Figure 7b.The minimum RMSE of heading is caused by the occlusion by the narrowest garbage at 0.1°, however, the maximum's is caused by the occlusion by the widest armchair at 0.47°.
In the view of the above tests, stationary obstacles used in these tests have greater influences on ranging accuracy than heading accuracy.The detection failure rates caused by the cabinet and armchair in blockage scenarios are 90% and 72% respectively, which is shown in Table 2.

Case 2: Pedestrian Blockage
In this section, we investigate the potential effect of obstructions by an unidentified pedestrian by our vision system on ranging and heading accuracy.The test results under 5 various overlap ratios of the target and the pedestrian are presented in Figure 11.

Case 2: Pedestrian Blockage
In this section, we investigate the potential effect of obstructions by an unidentified pedestrian by our vision system on ranging and heading accuracy.The test results under 5 various overlap ratios of the target and the pedestrian are presented in Figure 11.As shown in Figure 11a,b, greater overlap ratios generally create greater bounding box's height errors and imaging horizontal biases which both decrease the accuracy of the ranging and heading.In general, when the overlap ratio approaching 50%, the ranging errors are greater than 1 m, while the heading errors are slightly greater than 1°.In terms of the detection failure rates, unlike the results in case 1, the detection ability is insensitive to the blocking by a pedestrian.
From the results in Table 2 and the above tests, the blockage cases more significantly decrease the accuracy of ranging than that of heading in our tests.Specifically, the effect of a pedestrian on the heading errors is greater than that of stationary obstacles.However, the possibility of detection failure caused by stationary obstacles is much higher than the pedestrian case.As shown in Figure 11a,b, greater overlap ratios generally create greater bounding box's height errors and imaging horizontal biases which both decrease the accuracy of the ranging and heading.In general, when the overlap ratio approaching 50%, the ranging errors are greater than 1 m, while the heading errors are slightly greater than 1 • .In terms of the detection failure rates, unlike the results in case 1, the detection ability is insensitive to the blocking by a pedestrian.
From the results in Table 2 and the above tests, the blockage cases more significantly decrease the accuracy of ranging than that of heading in our tests.Specifically, the effect of a pedestrian on the heading errors is greater than that of stationary obstacles.However, the possibility of detection failure caused by stationary obstacles is much higher than the pedestrian case.

Positioning Results and Analysis
To further evaluate the positioning ability of the proposed integrated IMU/MVRM approach, a test for a single user and an external camera is implemented in obstacles environment.The setup for this experiment is shown in Figure 12.The used tag includes one IMU (MPU9250) and one UWB module (DWM1000) and build a UWB positing system with three anchors for contrast deployed in the test field.The methods run on a laptop with an Intel Core i5-3230M CPU, operating at 2.4 GHz, and 4 GB of RAM.In order to time-synchronize the IMU readings with the camera frame, we processed each image frame from video file individually and tag the precise time stamp in millisecond-lever with OpenCV.The IMU data is being logged with millisecond-lever precision and interpolate the IMU readings by time to match camera frame time.
To further evaluate the positioning ability of the proposed integrated IMU/MVRM approach, a test for a single user and an external camera is implemented in obstacles environment.The setup for this experiment is shown in Figure 12.The used tag includes one IMU (MPU9250) and one UWB module (DWM1000) and build a UWB positing system with three anchors for contrast deployed in the test field.The methods run on a laptop with an Intel Core i5-3230M CPU, operating at 2.4 GHz, and 4 GB of RAM.In order to time-synchronize the IMU readings with the camera frame, we processed each image frame from video file individually and tag the precise time stamp in millisecond-lever with OpenCV.The IMU data is being logged with millisecond-lever precision and interpolate the IMU readings by time to match camera frame time.The test person carried the tag walking along the rectangle path with a normal speed.The twodimensional local coordinate system is built in the test field.The initial position of the test person is (6.6, 1.2), and the external camera is (−0.6, 2.4), each obstacle all locates at (2.2, 2.4) which is near the marker 10th.Figures 13-17 are horizontal positioning trajectories by using the MVRM, IMU/MVRM, and UWB in 5 scenarios.Table 3 summarizes the positioning errors.
As observed from Figure 13, the position estimations by using the MVRM method appears as a number of strong outliers continually occurring in blockage phase (a) due to lacking visual measurements including ranges and headings.Unfortunately, in blockage phase (b), the high armchair still blocks the camera from acquiring visual measurements which causes a large number of strong outliers in positioning trajectory again.For the fused IMU/MVRM solution, the position estimations predicted by IMU during blockage phase (a) and (b) which effectively reduces and removes the errors in positioning trajectory.For the UWB, in the dense multipath environments, such as corner, the positioning results are not satisfying.The test person carried the tag walking along the rectangle path with a normal speed.The two-dimensional local coordinate system is built in the test field.The initial position of the test person is (6.6, 1.2), and the external camera is (−0.6, 2.4), each obstacle all locates at (2.2, 2.4) which is near the marker 10th.The positioning results by using MVRM method in Figure 14a are similar to that in scenario 1(a), while the slightly narrow cabinet only causing lacking a small amount of visual measurement in shorter period, which suffers from smaller values of outliers in positioning trajectory compared with that in scenario 1(a).For the Figure 14b, the shorter cabinet rarely interferences with the person detection in images, however, it brings out bounding box's height errors.These visual errors are reduced by using the fused IMU/MVRM and the improved positioning results are achieved.The results of UWB are similar to that in scenario 1. shorter period, which suffers from smaller values of outliers in positioning trajectory compared with that in scenario 1(a).For the Figure 14b, the shorter cabinet rarely interferences with the person detection in images, however, it brings out bounding box's height errors.These visual errors are reduced by using the fused IMU/MVRM and the improved positioning results are achieved.The results of UWB are similar to that in scenario 1.As shown in Figure 15, for blockage phase (a), the outliers, in this context, are results of the visual measurements acquired by using the MVRM method.These measurements departing from the normal range due to the detected shorter bounding boxes which are affected by the stool.Nevertheless, the fused IMU/MVRM solution can reduce and remove the positioning estimations errors substantially.For (b), the test person is fully visible during this period, therefore, similar results of the two solutions are shown at the east side of the trajectories.In terms of UWB, except for the similarities in scenario 1 and 2, there are greater positioning errors in the north trajectory because of multipath effect from a row of metal cabinets at the north of the test field.As observed from Figure 13, the position estimations by using the MVRM method appears as a number of strong outliers continually occurring in blockage phase (a) due to lacking visual measurements including ranges and headings.Unfortunately, in blockage phase (b), the high armchair still blocks the camera from acquiring visual measurements which causes a large number of strong outliers in positioning trajectory again.For the fused IMU/MVRM solution, the position estimations predicted by IMU during blockage phase (a) and (b) which effectively reduces and removes the errors in positioning trajectory.For the UWB, in the dense multipath environments, such as corner, the positioning results are not satisfying.
Figure 16 shows the similar results with scenario 3. The garbage can has the smallest dimension, therefore, the MVRM solution which causes the smallest positioning estimations errors among all blockage scenarios shows slightly unsmooth trajectories in phase (a) compared with the fused IMU/MVRM solution.The UWB solution are similar to that in scenario 3.In Figure 17, for (a) and (b), the positioning results by using the MVRM solution shows greater errors due to lacking visual measurements affected by the pedestrian during these two blockage periods.Similarly, the fused IMU/MVRM solution significantly decreases the positioning errors by integrating with the position estimations predicted by IMU during blockage periods.The results of using UWB are similar to that in scenario 3 as well.Specifically, the UWB solution shows poor performance because of the effect of the pedestrian in phase (a).In summary, the pure MVRM solution is more sensitive to the effect of both stationary obstacles and pedestrians, however, the robust fused IMU/MVRM solution integrating with the IMU's predications and visual measurements which can effectively and significantly decrease the positioning errors in short period.Indoor objects commonly easily lead to the multipath effect.In dense multipath environments, such as corner, the UWB solution shows poor performance compared with the proposed IMU/MVRM.Therefore, this fused IMU/MVRM solution is more applicable to the dense multipath scenarios where the obstacles may partially block the view of the cameras.

Conclusions and Future Work
A novel fusion of IMU with visual measurements acquired by cameras to determine robust and accurate poses of the test person for indoor positioning is presented.The visual measurements including ranges and angles are obtained by the proposed MVRM method with the assistant of a finetuned Faster R-CNN which is used to detect and locate the target in images captured by the cameras.We developed an extended Kalman filter for integrating IMU data with ranges and angles to obtain a more robust and accurate estimations.We designed several tests to evaluate the performance.The  The positioning results by using MVRM method in Figure 14a are similar to that in scenario 1(a), while the slightly narrow cabinet only causing lacking a small amount of visual measurement in shorter period, which suffers from smaller values of outliers in positioning trajectory compared with that in scenario 1(a).For the Figure 14b, the shorter cabinet rarely interferences with the person detection in images, however, it brings out bounding box's height errors.These visual errors are reduced by using the fused IMU/MVRM and the improved positioning results are achieved.The results of UWB are similar to that in scenario 1.
As shown in Figure 15, for blockage phase (a), the outliers, in this context, are results of the visual measurements acquired by using the MVRM method.These measurements departing from the normal range due to the detected shorter bounding boxes which are affected by the stool.Nevertheless, the fused IMU/MVRM solution can reduce and remove the positioning estimations errors substantially.For (b), the test person is fully visible during this period, therefore, similar results of the two solutions are shown at the east side of the trajectories.In terms of UWB, except for the similarities in scenario 1 and 2, there are greater positioning errors in the north trajectory because of multipath effect from a row of metal cabinets at the north of the test field.
Figure 16 shows the similar results with scenario 3. The garbage can has the smallest dimension, therefore, the MVRM solution which causes smallest positioning estimations errors among all blockage scenarios shows slightly unsmooth trajectories in phase (a) compared with the fused IMU/MVRM solution.The UWB solution are similar to that in scenario 3.
In Figure 17, for (a) and (b), the positioning results by using the MVRM solution shows greater errors due to lacking visual measurements affected by the pedestrian during these two blockage periods.Similarly, the fused IMU/MVRM solution significantly decreases the positioning errors by integrating with the position estimations predicted by IMU during blockage periods.The results of using UWB are similar to that in scenario 3 as well.Specifically, the UWB solution shows poor performance because of the effect of the pedestrian in phase (a).
In summary, the pure MVRM solution is more sensitive to the effect of both stationary obstacles and pedestrians, however, the robust fused IMU/MVRM solution integrating with the IMU's predications and visual measurements which can effectively and significantly decrease the positioning errors in short period.Indoor objects commonly easily lead to the multipath effect.In dense multipath environments, such as corner, the UWB solution shows poor performance compared with the proposed IMU/MVRM.Therefore, this fused IMU/MVRM solution is more applicable to the dense multipath scenarios where the obstacles may partially block the view of the cameras.

Conclusions and Future Work
A novel fusion of IMU with visual measurements acquired by cameras to determine robust and accurate poses of the test person for indoor positioning is presented.The visual measurements including ranges and angles are obtained by the proposed MVRM method with the assistant of a fine-tuned Faster R-CNN which is used to detect and locate the target in images captured by the cameras.We developed an extended Kalman filter for integrating IMU data with ranges and angles to obtain a more robust and accurate estimations.We designed several tests to evaluate the performance.The results experimentally show that the ranging accuracy is affected by both bounding box's height errors and diverse measuring distances, however, the heading accuracy is solely interfered with bounding box's horizontal biases.The blockage cases more significantly decrease the accuracy of ranging than that of heading in our tests.The positioning experiments for a single user with a camera in five scenarios were implemented in indoor environments.The robust fused IMU/MVRM solution can effectively and significantly decrease the positioning errors and shows better performance in the dense multipath scenarios compared with the pure MVRM and UWB solution.
We think that the proposed approach for indoor positioning can be applied in current location-based applications in buildings where equipped IP cameras in near future.To perfectly implement it in real applications, developing a more fast and accurate detection in more complicated indoor environments is a critical task, and acquiring the more accurate poses of installed cameras is able to further enhance the performance of the proposed approach.In our proposed framework, the position of each camera is independent.We will consider a multi-camera model with known relative pose to improve the results in the future work.

Figure 1 .
Figure 1.The concept of the proposed inertial measurement unit (IMU)/monocular vision relatively measuring (MVRM) integrated system.

Figure 1 .
Figure 1.The concept of the proposed inertial measurement unit (IMU)/monocular vision relatively measuring (MVRM) integrated system.

Figure 2 .
Figure 2. Schematic view of fine-tuning Faster R-CNN (Region Convolutional Neural Network).Figure 2. Schematic view of fine-tuning Faster R-CNN (Region Convolutional Neural Network).

Figure 2 .
Figure 2. Schematic view of fine-tuning Faster R-CNN (Region Convolutional Neural Network).Figure 2. Schematic view of fine-tuning Faster R-CNN (Region Convolutional Neural Network).

d
are distances from image of a user to the center of the film measured in pixels in horizontal and vertical directions, respectively.

l
of user i respectively.i r h is user i 's height.In this initial process for determining image distance of a camera, commonly we can get an image of multi-users captured by the camera.With knowledge of the initial range from each user to the camera, we can get a group of image distance values.With us of i W , a mean value of this group is regarded as the final image distance for this camera.
is considered to be a set of cameras which can observe user i u at current time t .Each

Figure 5 .
Figure 5.The performance of the fine-tuned Faster R-CNN.(a) The Precision-Recall Curve; (b) The cumulative probability distribution (CDF) of IoU.

Figure 5 .
Figure 5.The performance of the fine-tuned Faster R-CNN.(a) The Precision-Recall Curve; (b) The cumulative probability distribution (CDF) of IoU.

Figure 8 .
Figure 8.The cumulative distribution functions of errors for four phases.(a) CDF of ranging errors; (b) CDF of angulation errors.

Figure 9 .
Figure 9. Indoor obstacles environments.(a) Stationary obstacles; (b) pedestrian blockage.Case 1: Stationary Obstacles According to dimensions of common obstacles in indoor environments, an armchair, cabinet, stool, and garbage can are used to test the ranging and heading accuracy.Figure10ashows that greater errors in bounding box's height most likely induce greater ranging errors at a same measuring distance, which is consistent with the analysis of results in Figure7a.Generally, the overlap area of the target with obstacles depends on the position and attitude of camera respective to the target and its obstacles.In our test condition, obstruction by the armchair yielding the least overlap area of the lower half of the body, which leads to a minimal RMSE of ranging as 0.1 m.Garbage can blockage scenario is second, followed by stool blockage scenario, and the obstruction by the cabinet leads to the greatest RMSE as 1.5 m.Figure10bshows the similar results that greater bounding box's horizontal biases most likely induce greater heading errors in Figure7b.The minimum RMSE of heading is caused by the occlusion by the narrowest garbage at 0.1 • , however, the maximum's is caused by the occlusion by the widest armchair at 0.47 • .In the view of the above tests, stationary obstacles used in these tests have greater influences on ranging accuracy than heading accuracy.The detection failure rates caused by the cabinet and armchair in blockage scenarios are 90% and 72% respectively, which is shown in Table2.

Figure 10 .
Figure 10.The effect of obstruction by various stationary obstacles.(a) Ranging error; (b) angulation error.

Figure 10 .
Figure 10.The effect of obstruction by various stationary obstacles.(a) Ranging error; (b) angulation error.

Figure 11 .
Figure 11.The effect of obstruction under five overlap ratios of target and the pedestrian.(a) Ranging error; (b) angulation error.

Figure 11 .
Figure 11.The effect of obstruction under five overlap ratios of target and the pedestrian.(a) Ranging error; (b) angulation error.

Figure 12 .
Figure 12.The setup for the experiment in test field.(a) The carried tag; (b) the setup for the test indoor positioning system.

Figure 12 .
Figure 12.The setup for the experiment in test field.(a) The carried tag; (b) the setup for the test indoor positioning system.
are horizontal positioning trajectories by using the MVRM, IMU/MVRM, and UWB in 5 scenarios.

Figure 13 .
Figure 13.The horizontal trajectory in armchair scenario.(a) The armchair blocked the view of the camera from seeing the test person at the marker 10th, and failed to detect the person; (b) the armchair blocked the view of the camera from seeing the test person near the marker 22nd, and failed to detect the person.

Figure 13 .
Figure 13.The horizontal trajectory in armchair scenario.(a) The armchair blocked the view of the camera from seeing the test person at the marker 10th, and failed to detect the person; (b) the armchair blocked the view of the camera from seeing the test person near the marker 22nd, and failed to detect the person.

Figure 14 .
Figure 14.The horizontal trajectory in cabinet scenario.(a) The cabinet blocked the view of the camera from seeing the test person at the marker 10th, and failed to detect the person; (b) the cabinet blocked the little view of the camera from seeing the test person near the marker 21st, and succeed to detect but caused image height errors.

Figure 14 .
Figure 14.The horizontal trajectory in cabinet scenario.(a) The cabinet blocked the view of the camera from seeing the test person at the marker 10th, and failed to detect the person; (b) the cabinet blocked the little view of the camera from seeing the test person near the marker 21st, and succeed to detect but caused image height errors.

Figure 15 .
Figure 15.The horizontal trajectory in stool scenario.(a) the stool blocked the little view of the camera from seeing the test person at the marker 10th, succeed to detect but caused image height errors; (b) the stool is not able to block any view of the camera from seeing the test person near the marker 21st, succeed to precisely detect.

Figure 16
Figure16shows the similar results with scenario 3. The garbage can has the smallest dimension, therefore, the MVRM solution which causes the smallest positioning estimations errors among all blockage scenarios shows slightly unsmooth trajectories in phase (a) compared with the fused IMU/MVRM solution.The UWB solution are similar to that in scenario 3.

Figure 15 .
Figure 15.The horizontal trajectory in stool scenario.(a) the stool blocked the little view of the camera from seeing the test person at the marker 10th, succeed to detect but caused image height errors; (b) the stool is not able to block any view of the camera from seeing the test person near the marker 21st, succeed to precisely detect.

Figure 16 .
Figure 16.The horizontal trajectory in garbage can scenario.(a) garbage can blocked the little view of the camera from seeing the test person at the marker 10th, and succeed to detect but caused little image height errors; (b) the stool is not able to block any view of the camera from seeing the test person near the marker 21st, and succeed to precisely detect.

Figure 16 .
Figure 16.The horizontal trajectory in garbage can scenario.(a) The garbage can blocked the little view of the camera from seeing the test person at the marker 10th, and succeed to detect but caused little image height errors; (b) the stool is not able to block any view of the camera from seeing the test person near the marker 21st, and succeed to precisely detect.

Figure 17 .
Figure 17.The horizontal trajectory in pedestrian scenario.(a) The pedestrian blocked the view of the camera from seeing the test person at the marker 10th and failed to detect the person; (b) the pedestrian blocked little view of the camera from seeing the test person near the marker 21st, and succeed to detect but caused larger image height errors.

Figure 17 .
Figure 17.The horizontal trajectory in pedestrian scenario.(a) The pedestrian blocked the view of the camera from seeing the test person at the marker 10th and failed to detect the person; (b) the pedestrian blocked little view of the camera from seeing the test person near the marker 21st, and succeed to detect but caused larger image height errors.

Table 1 .
The estimated position error caused by ranging and heading error in four phases.

Table 2 .
The ranging and heading error and detection failure rates in obstacle environments.

Table 3 .
The horizontal positioning errors.

Table 3
summarizes the positioning errors.

Table 3 .
The horizontal positioning errors.