The Performance Analysis of INS/GNSS/V ‐ SLAM Integration Scheme Using Smartphone Sensors for Land Vehicle Navigation Applications in GNSS ‐ Challenging Environments

: Modern smartphones contain embedded global navigation satellite systems (GNSSs), inertial measurement units (IMUs), cameras, and other sensors which are capable of providing user position, velocity, and attitude. However, it is difficult to utilize the actual navigation performance capabilities of smartphones due to the low ‐ cost and disparate sensors, software technologies adopted by manufacturers, and the significant influence of environmental conditions. In this study, we proposed a scheme that integrated sensor data from smartphone IMUs, GNSS chipsets, and cameras using an extended Kalman filter (EKF) to enhance the navigation performance. The visual data from the camera was preprocessed using oriented FAST (Features from accelerated segment test) and rotated BRIEF (Binary robust independent elementary features) ‐ simultaneous localization and mapping (ORB ‐ SLAM), rescaled by applying GNSS measurements, and converted to velocity data before being utilized to update the integration filter. In order to verify the performance of the integrated system, field test data was collected in a downtown area of Tainan City, Taiwan. Experimental results indicated that visual data contributed significantly to improving the accuracy of the navigation performance, demonstrating improvements of 43.0% and 51.3% in position and velocity, respectively. It was verified that the proposed integrated system, which used data from smartphone sensors, was efficient in terms of increasing navigation accuracy in GNSS ‐ challenging environments.


Introduction
Navigation involves the determination of the time-varying position, velocity, and attitude of a moving body. Currently, navigation technologies are expanding at a phenomenal rate, especially within the civilian sector, due to the increased availability of devices such as vehicle/personal navigators, smartphones, tablets, and other handheld devices [1]. Modern mobile phones are powerful processing devices with a host of onboard technologies of interest to navigation system designers. Due to the increased accuracy and decreased cost of microelectromechanical sensors (MEMS) [2], smartphones have been outfitted with various embedded sensors, such as global navigation satellite systems (GNSSs), cameras, gyroscopes, accelerometers, and magnetometers. These improve the smartphone's usefulness to navigation systems.
In hostile environments, such as in narrow street canyons, tunnels, or underground parking garages, GNSS signals are prone to multipath errors or to being blocked, causing the data to be insufficient for the purposes of vehicular navigation. The accelerometers and gyroscopes within a smartphone can be used to provide a relative navigation solution. In terms of integration, an extended Kalman filter (EKF) is the most popular choice as the main estimation algorithm for data fusion. GNSS is used to correct for the systematic errors associated with the inertial sensors, which are composed of biases, scale factors, and drifts, whereas the inertial navigation system (INS) is used as a bridge for seamless navigation when the GNSS experiences an outage [3,4]. Although the integration of INS/GNSS can compensate for the main drawbacks associated with low-cost INS, the navigation errors can diverge rapidly in the absence of effective GNSS data. In recent years, the use of vision-based sensors such as cameras in navigation has strongly increased because of the numerous associated advantages. Their energy consumption is low, they can be manufactured in very small sizes, and their cost is being dramatically reduced. A key problem, which is also the relative characteristic of visual methods that directly impacts the success of these applications, is the estimation of location and distance through the use of the information gathered by these visual sensors [5]. However, the solution presented by visual sensors suffers from accumulated drift over time. Hence, there is a need to fuse visual sensors with other sensors for long-term navigation.
In the literature, the integration of data from smartphone sensors for navigation purposes has been investigated by many researchers. Walter et al. [6] proposed a system for car navigation involving the fusion of data from internal smartphone sensors (e.g., gyroscope data) with car sensor data (e.g., speed information) to support navigation using the Global Positioning System (GPS). Although the results showed that the system was able to maintain higher positioning accuracy during GPS dropouts, this system utilized only the gyroscope among the many other available smartphone sensors for the strapdown algorithm. Niu et al. [7] utilized data from the inertial sensor of an iPhone 4 to aid in GPS positioning for car navigation. The results from road test showed that the MEMS inertial measurement unit (IMU) could be used to enhance the GPS positioning. Additionally, the nonholonomic constraints (NHC) [8] could be used to improve the navigation performance significantly. However, in this research, data from another GPS receiver was used in the testing instead of the smartphone GPS data. Gikas and Perakis [9] presented a rigorous performance evaluation of smartphone GNSS/IMU sensors for intelligent transportation systems (ITS) applications. The research involved performing vehicle cruising and maneuvering tests with two kinds of smartphones. However, this study addressed only the individual performance of each sensor, and did not address an integrated solution. Al-Hamad and El-Sheimy [10] focused on using smartphones as part of a mobile mapping system (MMS). In this work, the solutions from the smartphone's MEMS-based IMU and low-cost GPS receiver were used as the initial values for the exterior orientation parameters (EOPs) of each image. However, this work mainly focused on a mapping application instead of trajectory evaluation. Zeng et al. [11] performed a study investigating seamless pedestrian navigation using smartphone sensors. Many smartphone sensors were utilized in this work. However, some of them were used for the purpose of indoor and outdoor environmental identification. Despite the importance of vehicular navigation, this work focused only on pedestrian navigation based predominantly on the pedestrian dead reckoning (PDR) algorithm. Yan et al. [12] proposed a novel three-dimensional (3D) passive vision-aided PDR to continuously track user movements on different floors of a building by integrating the results of inertial navigation and the faster R-CNN (Regions with convolutional neural network features)-based real-time pedestrian detection. The test results demonstrated that the proposed method achieved the levels of 3D accuracy required by several emergency services. However, the visual detection used in this work was applied to floor plan detection. Wu et al. [13] presented a square-root inverse sliding window filter (SR-ISWF) for a vision-aided inertial navigation systems (VINS) and conducted experiments using a commercial-grade Samsung S4 mobile phone. Although the experimental results demonstrated the same level of estimation accuracy as state-of-the-art VINS algorithms, this study only involved testing the system within an indoor environment over a short travel distance. Speroni et al. [14] proposed a low-cost visual simultaneous location and mapping (V-SLAM) prototype using webcams and an Android smartphone for outdoor applications. The authors performed an experiment using their system in a car and provided preliminary results. However, this study involved the use of stereoscopic cameras instead of smartphone cameras. Hence, time synchronization was arguably a problem. Moreover, this work lacked a reference system to enable the generation of a reference trajectory (ground truth trajectory) for accuracy evaluation.
To the best of our knowledge, there are no studies that have rigorously investigated fusing data from IMU, GNSS chipset, and camera from smartphone for long-term land vehicle navigation applications in GNSS-challenging environments. The three main contributions of this work can be summed up as follows. First, a self-developed smartphone app was used for experimental data collection to avoid the problem of time synchronization between sensors. Second, the problem of scale in monocular solutions was solved at the beginning using measurements from GNSS chipset before this information was utilized to update the integrated filter. Finally, the proposed scheme was tested in different GNSS-challenging environments together with a high-precision reference system to verify its performance.
The remainder of the paper is organized as follows. Section 2 presents the proposed integration scheme, EKF-based data fusion, and overview of V-SLAM. Section 3 describes the field test and data processing strategy. The experimental results and discussion are then presented in Section 4. Finally, some concluding remarks and a brief outline for future research are presented in Section 5.

Integration Scheme
The proposed integration scheme is illustrated in Figure 1. Taking the stability of the structure and the computational cost of multi-sensor fusion into consideration, the system made use of a loosely coupled scheme of INS/GNSS/V-SLAM integration. The IMU measurements were processed using an INS mechanization to provide a navigation solution which accounted for position, velocity, and attitude with an unbounded error in the navigation frame. The GNSS provided the absolute position as the predominant measurement update. For the camera, oriented FAST (Features from accelerated segment test) and rotated BRIEF (Binary robust independent elementary features)-simultaneous localization and mapping (ORB-SLAM) [15] were chosen as the V-SLAM method to process the data and provide measurement updates. Finally, these data were fused together using an EKF estimation algorithm.

Model Design
The core system dynamics estimation model incorporated the inertial navigation equations in a local level frame [16,17].
where is the time derivative of the position in the local level frame; is the time derivative of velocity; is the time derivative of attitude; is a specific force, sensed by the accelerometer; Ω is the angular velocity of the body frame relative to the inertial frame, parameterized in the body frame; is the transformation matrix used to convert the body frame to the local level frame; and Ω and Ω are the rotation rate of the Earth with respect to the inertial frame and the rotation rate of the local level frame with respect to the Earth, respectively. The matrix converts the linear velocity values to angular changes in latitude and longitude: where and are radii of curvatures in the meridian and prime vertical of the reference ellipsoid, respectively [18]. and ℎ are latitude and altitude, respectively. In practicality, to avoid a singularity, a quaternion is used to express the attitude. During operation, the measured acceleration and rotation rate suffer from biases, scale factors, and signal noise.
To determine and compensate for those errors in IMU measurement, the bias, scale factor, and noise of the accelerometer and gyroscope have to be included in the system dynamics estimation model.
The state model can be expressed in the following form [19]: Its discrete-time equation is as follows: where  is the state vector which includes the components of position, velocity, attitude error, biases, and the scale factors of the gyroscope and accelerometer, respectively.
is the state continuous time transition matrix,  is the state discrete-time transition matrix from epoch k to k + 1, and is the system noise [20,21]. For GNSS measurements, the solution available from a smartphone includes only position. The EKF measurement model can be expressed as follows: (7) where and are positional vectors provided by INS and GNSS, respectively; 0 0 is the mapping matrix describing the relationship between the measurement and the state vector; and is the noise in the GNSS positional measurements.

Estimation using EKF
EKF equations are divided into two groups: Time predictions and measurement updates [22]. The time prediction equations are responsible for the forward time transition of the current epoch (k) state to the following epoch (k + 1) state. The time prediction equations are as follows: where (ˆ) denotes estimation,  is the model transition matrix, is the estimated covariance matrix of the system state, is the system noise matrix, (−) denotes the estimated values after prediction, and (+) denotes the estimated values after update.
The measurement update equations incorporate new measurements into the state estimate equations and are given as: where is the Kalman gain and is the measurement covariance matrix. All noise terms were considered to be white sequences with known covariance. The components and sequence of the computational steps are shown in Figure 2.

V-SLAM
SLAM is the process of using sensors to build a map of an unknown environment and simultaneously estimating the sensors' pose in this environment [23]. SLAM has been an attractive research topic over the last few decades in the fields of robotics and computer vision [15,24], and has recently expanded into other application areas, such as vehicle navigation. There are many kinds of sensors that can be used in association with the SLAM technique, such as cameras, laser sensors, or sonar sensors [25,26]. Cameras are inherently cheap, low in energy consumption, and provide rich information about the surrounding environment. Therefore, many researchers focus on using cameras in SLAM applications [27,28]. Because cameras are used as the main sensor, the technique is termed V-SLAM. The flowchart describing the V-SLAM technique is shown in Figure 3. V-SLAM methods are mainly classified into two categories based on the number of cameras used: Monocular and stereo methods [29]. Each of them has its own advantages and disadvantages in terms of cost, computational complexity, and flexibility. Stereo V-SLAM, which involves two cameras, is an expensive and complex system. Although stereo systems can retrieve image scale and depth information to provide 3D vision, they can be degraded to a monocular system when the distance to the scene is much larger than the distance between the two cameras [30]. In contrast, monocular V-SLAM utilizes only one camera. It is thus a low-cost, lightweight system, and is suitable for applications involving small robots and smartphone cameras. Despite the advantages of a monocular system, it suffers greatly from scale problems. Therefore, it is necessary to integrate monocular V-SLAM along with other systems to achieve absolute scale.
There are two different methodologies that are predominant in V-SLAM: Filtering methods and keyframe methods [30]. Filtering methods involve the camera poses and all the map features being utilized in processing at every frame. However, keyframe methods involve the camera pose being estimated using a subset of the entire map without the need to update the map features at every processed frame. Processing time is thus significantly reduced for keyframe methods, which makes them potentially applicable to real-time applications. Moreover, the filtering methods make use of the linearization process, so they suffer from accumulation errors. Overall, keyframe methods possess certain advantages over filtering methods in terms of processing time and accuracy.
ORB-SLAM, a currently available keyframe-based V-SLAM system built on excellent algorithms developed in recent years using ORB features [31] for all SLAM tasks, allows for more robust vision based navigation. ORB-SLAM is based on three main tasks: Tracking, mapping, and loop closing. Tracking is used to localize the camera with every frame and determine when to insert a new keyframe, while the mapping processes new keyframes and performs local bundle adjustment (BA) to achieve an optimal reconstruction of the surroundings of the camera pose. Finally, loop closing searches are performed with every new keyframe [32]. The complete ORB-SLAM system includes three subsystems for the purpose of processing data from monocular, stereo, and RGB-D cameras. In this present study, a monocular ORB-SLAM system was selected as the main algorithm for processing the image sequence from the smartphone cameras to provide up-to-scale translation expressed in the initial camera frame.
As mentioned above, the translation provided by the monocular V-SLAM is of an undefined scale. In order to use this translation to update the integration system, the following transformation processes were implemented. First, the translation was rescaled with the assistance of the GNSS measurement to determine metric distances. We utilized a short distance, which was initially extracted using GNSS measurements in an open sky to calculate the scale coefficient. This scale coefficient was then used to scale the monocular V-SLAM by applying the following equations: where ‖ ‖ is the distance derived from the GNSS position and ‖ ‖ is the corresponding translation from the V-SLAM.
, and , are translations between consecutive images up-to-scale and absolute scale, respectively. This translation vector in the camera frame was then converted into velocity in the body frame using the following equation: ∆ where and ∆ are translation of the absolute scale and time period between consecutive images, respectively.
is the rotation matrix from the camera frame to body frame, determined using a calibration process.
The velocity in the body frame was formed in the measurement model together with the NHC [8]. is the velocity provided by the monocular V-SLAM in the x direction of the body frame and 0 0 is the design matrix describing the relationship between the measurement and the state vector.

Tested Smartphones and Reference Navigation System
Two ordinary smartphone devices were tested in this study, namely the Sony Xperia Z3 and Lenovo Tango. Although these devices are not the latest models released by their respective manufacturers, they are still very recent and used extensively worldwide. Both devices comprised a camera, GNSS chipset, and MEMS IMU that featured three-axial accelerometer and gyroscope sensors. Additionally, their processing power enabled seamless operation, minimizing the effect of errors induced by components other than the sensors being accessed. A summary of the technical characteristics of both devices is shown in Table 1. The navigation system used to provide the reference trajectory ground truth during the two tests was an integration of GNSS and IMU. The GNSS receiver that was used in the tests was the commercial device PwrPak7D-E1 from NovAtel. The IMU that was used in the tests was the navigation-grade iNAV-RQH (iMAR). The equipment layout and technical characteristics of these reference sensors are shown in Figure 4 and Table 2, respectively.

. Smartphone Setup and Field Experiment
Field tests were conducted using a sport utility vehicle (SUV) as the test platform. The smartphones were seated in cradles that were firmly attached to front windshield of the SUV as shown in Figure 5. Moreover, these smartphones were roughly aligned with the vehicle frame. The tests were conducted in an urban area of Tainan City, Taiwan. The area included many tall trees, buildings, and an underground parking lot. We started with a static period in open sky and then drove for distances of approximately 3 km and 2 km, corresponding to tests one and two, respectively. The total time taken for the first test was approximately 14 min, and that taken for the second test was approximately 12 min.

Reference Trajectory Establishment
In land vehicle navigation applications, several processing strategies are potentially useable for estimating the reference trajectory. Depending on the integration level, the loosely or tightly coupled INS/GNSS techniques can be used [9]. The loosely coupled technique, which uses post-processed kinematic GNSS data, may provide a better solution than the tightly coupled technique. However, it requires more than four satellites to be observed, a requirement that is difficult to satisfy in GNSSchallenging environments. In this present study, the tightly coupled strategy was applied together with a smoother using the commercial INS/GNSS processing software package Inertial Explore (IE) 8.90. There were lever-arms between the reference system and the smartphones, so the "true" position and velocity were transformed to the center of the smartphones when exporting the reference solutions from the IE software. Figure 6 shows the number of visible satellites (red circles) and the position dilution of precision (PDOP) [17,33] values (blue asterisks) in the first test. It can be seen from the figure that the system sometimes had access to only a few satellites or even suffered a complete loss of satellites visible to the receiver (e.g., the time interval from 400 s to 500 s). This led to large PDOP values. As shown in Figure 7a, due to these bad conditions, the standard deviation (std) of the GNSS-only solution became larger. Figure 7b shows the std of the INS/GNSS integration solution incorporating smoothing. As mentioned in Section 3.1.1, a navigation-grade IMU was used together with a tightly coupled integration strategy to generate the reference trajectory. Therefore, the std in the north, east, and up directions remained smaller than 0.2 m during the first test. The PDOP and number of satellites visible to the receiver in the second test are shown in Figure  8 and the std of the GNSS-only solution is shown in Figure 9a. It is clear from Figures 8 and 9a that the test was run for 650 s. However, from time 550 s and on, SUV was driven into an underground parking lot. This led to only a few satellites being visible to the receiver and no solution from the GNSS-only approach. Figure 9b shows the std of the solution from the tightly coupled INS/GNSS with smoother. It can be seen from the figure that until time 550 s, the std in the three directions was smaller than 0.2 m. During the time that the SUV was in the underground parking lot, the std of the solution in three directions rose to 0.6 m.

Smartphone Data Recording and Preprocessing
Many free smartphone apps have recently become available for sensor data collection. However, it is difficult to find an app that can simultaneously record data from the IMU, GNSS chipset, and camera. In order to overcome this problem, some of our laboratory staff members developed an Android smartphone app which enabled the simultaneous collection of data from these sensors. The raw data included a video file and a mixed file incorporating the GNSS position and IMU (accelerometer and gyroscope) observations. This observable file was parsed by sensor types. They were then resampled at the desired sampling rate. A sampling rate of 50 Hz was chosen for the purposes of this study. Finally, accelerometer and gyroscope data were synchronized together as IMU data. The flowchart of this process is shown in Figure 10.

Test Results Using Sony Xperia Z3
In the first experiment, the Sony Xperia Z3 was used to test our proposed method. The testing was carried out along the streets of Tainan City, Taiwan. The test trajectory is shown in Figure 11. In the performance evaluation, the solutions provided by V-SLAM, the integration of the INS/GNSS integration aided by the NHC, and the INS/GNSS/V-SLAM integration using an original EKF were analyzed. Figure 11 shows the horizontal positions of these solutions on the map and the vertical positions on the plot below the map. It is clear from the figure that the solution from V-SLAM accumulated drift over time. In comparison to the reference trajectory (the actual trajectory), the positional root-mean-square error (RMSE) of this solution in the north, east, and up directions was 39.6 m, 31.0 m, and 5.5 m, respectively. The 3D RMSE was 50.6 m.  Figure 12 shows graphs obtained by comparing the test solutions and the reference, showing the navigation errors for position, velocity, and attitude in the first, second, and third subplots, respectively. All plots show the error in the north, east, and up directions. The numerical statistics in terms of RMSE are shown in Table 3. As can be seen from the results, the 3D positional error of the INS/GNSS integration solution using EKF was 11.7 m, which was 76.9% higher than that of the V-SLAM solution. For the INS/GNSS/V-SLAM integration, the positional accuracy improved to 78.5% than that of the V-SLAM solution. However, in comparison to the navigation solution from INS/GNSS integration, the result from INS/GNSS/V-SLAM improved only slightly in terms of position, velocity, and heading errors with improvements of only 6.6%, 2.5%, and 9.5%, respectively.   In this experiment, the GNSS environment was not very challenging. The role of V-SLAM in improving the INS/GNSS system was thus not very clear. In order to demonstrate the efficiency of the proposed integration scheme, we planned two GNSS signal outages of 30 s in different scenarios. Figure 13 shows the trajectories of the various solutions with the planned GNSS signal outages. As    Table 4 summarizes the navigation RMSE errors in both scenarios. As the graphs show, the positional accumulation errors in the north direction of the INS/GNSS solution were approximately 180 m and 70 m during the GNSS signal outages on the side and the corner of the rectangular route, respectively, whereas these accumulated errors in the INS/GNSS/V-SLAM solution were approximately 35 m and 15 m, respectively. This clearly demonstrates that the overall navigation solution from the INS/GNSS/V-SLAM integration method was much better than the solution provided by the INS/GNSS integration method. The improvement in accuracy was reflected in a reduction in position and velocity error of 40.0% and 20.3%, respectively. In terms of heading, due to the NHC, the improvement was only 4.5%.

Test Results Using Lenovo Tango
In the second experiment, the Lenovo Tango was used to test our proposed method. The test field incorporated a street environment and an underground parking lot environment. The test trajectory is shown in Figure 15. In the performance evaluation, the solutions provided by V-SLAM, the INS/GNSS integration aided by the NHC, and the INS/GNSS/V-SLAM integration using EKF were analyzed. Figure 15a shows the horizontal positions of these solutions on the map and the vertical positions on a plot below the map. The magenta rectangle highlights the underground parking lot area. Figure 15b shows an enlarged detail of the trajectories in the parking lot area. It can be seen from the map that the V-SLAM trajectory, shown in cyan, accumulated over time. The trajectory of the INS/GNSS solution with NHC, shown in red, was uncontrollable when the SUV drove through the underground parking lot where the GNSS signal was blocked. The trajectory of the INS/GNSS/V-SLAM solution, shown in blue, was the best in terms of its comparison to the reference trajectory, shown in green.

Conclusions
This study proposed an integrated scheme incorporating INS, GNSS, and V-SLAM using data from smartphone sensors to overcome the weaknesses of each sensor working individually under environmental effects and the problem of unbounded error. Two ordinary smartphones incorporating a self-developed app for collecting data were used in the testing. Moreover, a system comprised of a navigation-grade IMU and a differential GNSS receiver from PwrPak7D-E1 was used to generate reference solutions.
The test results indicated that, in the GNSS-friendly environment, the solution from the INS/GNSS/V-SLAM integration using an EKF was slightly improved compared to the conventional INS/GNSS integration. However, in the GNSS-challenging environment, this improvement was significant, with up to 43.0% and 51.3% improvements in position and velocity, respectively, which verified the possibility of a robust smartphone navigation solution.
Since this work proposed the integration scheme and tested the algorithm as post-processing, for future work, this framework should be integrated into smartphone environment and run directly in this platform. In addition, nonlinear, non-Gaussian filtering will be applied to the integrated system to overcome the limitations of EKF in terms of error modelling and highly dynamic movement.