# Sensor-Fusion for Smartphone Location Tracking Using Hybrid Multimodal Deep Neural Networks

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- We model the traditional methods of location estimation from sensor data with end-to-end machine learning approaches. The chosen networks avoid the need for hand-picked features. Instead, data processing models are learned automatically from data.
- We deploy a recurrent neural network—Long Short-Term Memory (LSTM) to model the sequential estimations of PDR—estimating a sequence of locations, starting from a known point and estimating the following points based on observations from sensor data.
- Performing modality fusion through a hybrid neural network—using different neural network structures on each sensing modality and fusing their representations via additional top layers. We use a recurrent neural network for the inertial sensor data processing and a fully-connected network for the WiFi modality.

## 2. Motivation and Related Work

#### 2.1. Pedestrian Dead Reckoning (PDR) on Inertial Sensors

#### 2.2. WiFi Fingerprinting on Received Signal Strength

#### 2.3. Multimodal Approaches

## 3. Methodology

#### 3.1. Pedestrian Dead Reckoning with Recurrent Neural Networks

#### 3.2. WiFi Fingerprinting with Deep Neural Networks

#### 3.3. Sensor Fusion via Multimodal Deep Neural Networks

## 4. Data

#### 4.1. Data Collection

#### 4.2. Data Preprocessing

#### 4.2.1. Inertial Sensor Data

#### 4.2.2. WiFi Fingerprint Data

#### 4.3. Sensor Fusion Dataset Alignment

## 5. Model Configuration

#### 5.1. Recurrent Neural Network on Inertial Sensors

#### 5.1.1. Time Window

#### 5.1.2. Overlapping Ratio

#### 5.1.3. Data Compression

#### 5.2. Overview of RNN for Inertial Sensors

#### 5.3. Deep Neural Network on WiFi Fingerprints

#### 5.3.1. Model Structure

#### 5.3.2. Model Tuning

#### 5.4. Overview of DNN for WiFi Fingerprints

#### 5.5. Multimodal Deep Neural Networks on Sensor Fusion

#### 5.5.1. MDNN Integration

**Element-wise Fusion:**The MDNN with element-wise fusion architecture is shown in Table 6. By concatenating the modality-specific hidden layer outputs from both LSTM and DNN sub-networks of 128 dimension output, the fusion layer read these two hidden outputs by implementing element-wise matrix calculation of concatenation ($128\times 2$) or multiplication (128). This fused matrix then goes through higher 128 and 64 size fully-connected joint layers and eventually are regressed to a two-value prediction (${X}_{est}$, ${Y}_{est}$).

**Residual Connection Fusion:**Table 7 shows the MDNN with a residual connection architecture. Different from the element-wise fusion MDNN, in order to emphasise the WiFi features which are smaller in representation compared to the time-sequential inertial sensor data, we add a residual connection layer that transfers the hidden output (128) from WiFi penultimate fully-connected layer to the joint layer, fusing together with the LSTM (128) and DNN last FC layer outputs ($128\times 2$). This derives a $128\times 3$ representation for the sensor-fusion component, which performs the final location estimation.

**Late Fusion:**Table 8 presents the MDNN architecture with the late fusion strategy. This works by combining two separate LSTM and DNN model outputs, the predictions that produce the lat-long coordinate estimation (X

_{Sensor}, Y

_{Sensor}) and (X

_{WiFi}, Y

_{WiFi}) respectively. These estimations form a four-dimensional feature input vector, which provides the representations needed by the top layers to estimate the final latitude and longitude (X

_{Fusion}, Y

_{Fusion}).

#### 5.5.2. MDNN Implementation

#### 5.6. Overview of MDNN on Sensor Fusion

## 6. Evaluation

#### 6.1. MM-Loc Comparative Study

#### 6.2. MM-Loc Visualisation

## 7. Discussion

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Misra, P.; Enge, P. Global Positioning System: Signals, Measurements and Performance, 2nd ed.; Ganga-Jamuna Press: Lincoln, MA, USA, 2006. [Google Scholar]
- Radu, V.; Marina, M.K. Himloc: Indoor smartphone localization via activity aware pedestrian dead reckoning with selective crowdsourced wifi fingerprinting. In Proceedings of the International Conference on Indoor Positioning and Indoor Navigation, Montbeliard, France, 28–31 October 2013. [Google Scholar]
- Ari, A.; Chintalapudi, K.K.; Padmanabhan, V.N.; Sen, R. Zee: Zero-Effort Crowdsourcing for Indoor Localization. In Proceedings of the 18th Annual International Conference on Mobile Computing and Networking, Istanbul, Turkey, 22–26 August 2012; ACM: New York, NY, USA, 2012. [Google Scholar]
- Yang, Z.; Wu, C.; Liu, Y. Locating in fingerprint space: Wireless indoor localization with little human intervention. In Proceedings of the 18th Annual International Conference on Mobile Computing and Networking, Istanbul, Turkey, 22–26 August 2012; ACM: New York, NY, USA, 2012; pp. 269–280. [Google Scholar]
- Xiao, Z.; Wen, H.; Markham, A.; Trigoni, N. Robust pedestrian dead reckoning (R-PDR) for arbitrary mobile device placement. In Proceedings of the 2014 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Busan, Korea, 27–30 October 2014. [Google Scholar]
- Wang, H.; Sen, S.; Elgohary, A.; Farid, M.; Youssef, M.; Choudhury, R.R. No need to war-drive: Unsupervised indoor localization. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, Windermere, UK, 25–29 June 2012; ACM: New York, NY, USA, 2012. [Google Scholar]
- Chen, C.; Wang, B.; Lu, C.X.; Trigoni, N.; Markham, A. A survey on deep learning for localization and mapping: Towards the age of spatial machine intelligence. arXiv
**2020**, arXiv:2006.12567. [Google Scholar] - Radu, V.; Tong, C.; Bhattacharya, S.; Lane, N.; Mascolo, C.; Marina, M.; Kawsar, F. Multimodal Deep Learning for Activity and Context Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol.
**2018**, 1, 1–27. [Google Scholar] [CrossRef][Green Version] - Chen, Y.; Lymberopoulos, D.; Liu, J.; Priyantha, B. FM-based indoor localization. In Proceedings of the 10th International Conference on Mobile Systems, Applications, and Services, Windermere, UK, 25–29 June 2012; ACM: New York, NY, USA, 2012. [Google Scholar]
- Dey, A.; Roy, N.; Xu, W.; Choudhury, R.R.; Nelakuditi, S. AccelPrint: Imperfections of Accelerometers Make Smartphones Trackable. In Proceedings of the NDSS, San Diego, CA, USA, 23–26 February 2014. [Google Scholar]
- Marquez, L.; Salgado, J.G. Machine Learning and Natural Language Processing. Available online: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.3498 (accessed on 20 October 2021).
- Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science
**2015**, 349, 255–260. [Google Scholar] [CrossRef] [PubMed] - Harle, R. A survey of indoor inertial positioning systems for pedestrians. Commun. Surv. Tutor.
**2013**, 15, 1281–1293. [Google Scholar] [CrossRef] - Roy, N.; Wang, H.; Choudhury, R.R. I am a Smartphone and I can Tell my User’s Walking Direction. In Proceedings of the 12th Annual International Conference on Mobile Systems, Applications, and Services, Bretton Woods, NH, USA, 16–19 June 2014. [Google Scholar]
- Brajdic, A.; Harle, R. Walk Detection and Step Counting on Unconstrained Smartphones. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, Zurich, Switzerland, 8–12 September 2013; ACM: New York, NY, USA, 2013. [Google Scholar]
- Alzantot, M.; Youssef, M. UPTIME: Ubiquitous Pedestrian Tracking using Mobile Phones. In Proceedings of the 2012 IEEE Wireless Communications and Networking Conference (WCNC), Paris, France, 1–4 April 2012. [Google Scholar]
- Ferris, B.; Hahnel, D.; Fox, D. Gaussian Processes for Signal Strength-Based Location Estimation. In Proceedings of the Robotics Science and Systems, Online, 16–19 August 2006. [Google Scholar]
- Youssef, M.; Agrawala, A. The Horus location determination system. Wirel. Netw.
**2008**, 14, 357–374. [Google Scholar] [CrossRef] - Dai, H.; Ying, W.-H.; Xu, J. Multi-layer neural network for received signal strength-based indoor localisation. Communications
**2016**, 10, 717–723. [Google Scholar] [CrossRef] - Dayekh, S.; Affes, S.; Kandil, N.; Nerguizian, C. Cooperative Localization in Mines Using Fingerprinting and Neural Networks. In Proceedings of the 2010 IEEE Wireless Communication and Networking Conference, Sydney, NSW, Australia, 18–21 April 2010. [Google Scholar]
- Chintalapudi, K.; Padmanabha Iyer, A.; Padmanabhan, V.N. Indoor localization without the pain. In Proceedings of the Sixteenth Annual International Conference on Mobile Computing and Networking, New York, NY, USA, 20 September 2010; pp. 173–184. [Google Scholar]
- Wang, X.; Gao, L.; Mao, S.; Pandey, S. CSI-Based Fingerprinting for Indoor Localization: A Deep Learning Approach. Trans. Veh. Technol.
**2017**, 66, 763–776. [Google Scholar] [CrossRef][Green Version] - Xiao, Z.; Wen, H.; Markham, A.; Trigoni, N. Lightweight map matching for indoor localisation using conditional random fields. In Proceedings of the IPSN-14 Proceedings of the 13th International Symposium on Information Processing in Sensor Networks, Berlin, Germany, 15–17 April 2014. [Google Scholar]
- Wu, C.; Yang, Z.; Liu, Y.; Xi, W. WILL: Wireless indoor localization without site survey. IEEE Trans. Parallel Distrib. Syst.
**2013**, 24, 839–848. [Google Scholar] - Cosma, A.; Radoi, I.E.; Radu, V. CamLoc: Pedestrian Location Detection from Pose Estimation on Resource-constrained Smart Cameras. arXiv
**2018**, arXiv:1812.11209. [Google Scholar] - Lipton, Z.C.; Berkowitz, J.; Elkan, C. A critical review of recurrent neural networks for sequence learning. arXiv
**2015**, arXiv:1506.00019. [Google Scholar] - Mikolov, T.; Karafiát, M.; Burget, L.; Černockỳ, J.; Khudanpur, S. Recurrent neural network based language model. In Proceedings of the Eleventh Annual Conference of the International Speech Communication Association, Chiba, Japan, 26 September 2010. [Google Scholar]
- Wei, X.; Radu, V. Calibrating Recurrent Neural Networks on Smartphone Inertial Sensors for Location Tracking. In Proceedings of the 2019 International Conference on Indoor Positioning and Indoor Navigation (IPIN), Pisa, Italy, 30 September–2 October 2019; pp. 1–8. [Google Scholar]
- Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.
**1997**, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed] - Gers, F.A.; Schmidhuber, J.; Cummins, F. Learning to Forget: Continual Prediction with LSTM. Neural Comput.
**2000**, 10, 2451–2471. [Google Scholar] [CrossRef] [PubMed] - Hoang, M.T.; Yuen, B.; Dong, X.; Lu, T.; Westendorp, R.; Reddy, K. Recurrent neural networks for accurate RSSI indoor localization. IEEE Internet Things J.
**2019**, 6, 10639–10651. [Google Scholar] [CrossRef][Green Version] - Poulose, A.; Han, D.S. Hybrid deep learning model based indoor positioning using Wi-Fi RSSI heat maps for autonomous applications. Electronics
**2021**, 10, 2. [Google Scholar] [CrossRef] - Kang, J.; Lee, J.; Eom, D.S. Smartphone-based traveled distance estimation using individual walking patterns for indoor localization. Sensors
**2018**, 18, 3149. [Google Scholar] [CrossRef] [PubMed][Green Version]

**Figure 1.**Gyroscope drifts disturbing direction calculation when sampling from a straight-line walking.

**Figure 2.**Histogram of Received Signal Strength for 5 Access Points observed at a single location showing the complexity of WiFi fingerprints, with various distributions (binomial and skewed).

**Figure 3.**Histograms of one AP signal strength over a small time window (1 h), captured at different moments of time over a day and week. The fluctuating nature of WiFi signal, which makes this hard to model with simple function fitting.

**Figure 6.**Unrolled Chain of LSTM Neural Network, using the same block for each new observation together with the internal state of previous time-step.

**Figure 8.**MM-Loc: our proposed multimodal deep neural network architecture for indoor localization with two parallel single-modality feature extractors and a joint network structure to merge latent features at the top.

**Figure 9.**Screenshots of the Android application used to collect multisensory data. (

**a**) Location Input Interface; (

**b**) Sensors Control Options.

**Figure 10.**Compare between the original data and the downsampled data. It shows that loss of information is minimal across two time windows, being able to follow the trend in signal for the walking activity. (

**a**) Sensor values over a time window of 1 s; (

**b**) Sensor values over a wider time window of 7 s.

**Figure 11.**LSTM model performances with different time window settings. (

**a**) Validation Set CDF. (

**b**) Test Set CDF.

**Figure 12.**LSTM model performance with different overlapping rations on validation and test set. (

**a**) Validation Set CDF. (

**b**) Test Set CDF.

**Figure 13.**LSTM Model performances with different data compression strategies. (

**a**) Validation Set CDF. (

**b**) Test Set CDF.

**Figure 14.**Overall comparison of model performances with different calibration strategies. (

**a**) Overall Validation Set CDF. (

**b**) Overall Test Set CDF.

**Figure 15.**WiFi based DNN model performances with different network structures. (

**a**) Validation Set CDF. (

**b**) Test Set CDF.

**Figure 16.**Comparison of MDNN model performances with different fusion architectures. (

**a**) Comparison CDF on Scenario A; (

**b**) Comparison CDF on Scenario B.

**Figure 17.**MM-Loc prediction CDF compared with modality-specific models. (

**a**) MM-Loc Performance CDF on Scenario A; (

**b**) MM-Loc Performance CDF on Scenario B.

**Figure 18.**Comparison of MM-Loc model performances running at different WiFi sampling rates. (

**a**) MM-Loc Performance CDF on Scenario A. (

**b**) MM-Loc Performance CDF on Scenario B.

**Figure 19.**MM-Loc Prediction Box Plot. (

**a**) MM-Loc Box Plot on Scenario A; (

**b**) MM-Loc Box Plot on Scenario B.

**Figure 20.**Comparison of model performance CDF of MM-Loc and SOTA models. (

**a**) SOTA Performance CDF on Scenario A; (

**b**) SOTA Performance CDF on Scenario B.

**Figure 21.**MM-Loc Footpath Visualisation. (

**a**) MM-Loc Predicted Footpath on Scenario A; (

**b**) MM-Loc Predicted Footpath on Scenario B.

Datasets | Inertial Samples | WiFi Samples | Access Points | Time Duration |
---|---|---|---|---|

Scenario A | 24,450 | 25,541 | 102 | 407 Mins |

Scenario B | 29,836 | 8390 | 750 | 497 Mins |

**Table 2.**The raw WiFi Fingerprint data format. Missing APs from the current scan are indicated with Null. Each WiFi scan has an assigned collection location (X, Y) as label.

Time | AP${}_{0}$ | AP${}_{1}$ | … | AP${}_{\mathit{n}}$ | X | Y |
---|---|---|---|---|---|---|

${t}_{0}$ | Null | $-85$ | … | Null | ${x}_{0}$ | ${y}_{0}$ |

${t}_{1}$ | $-92$ | Null | … | Null | ${x}_{1}$ | ${y}_{1}$ |

${t}_{2}$ | Null | Null | … | Null | ${x}_{2}$ | ${y}_{2}$ |

${T}_{\prime}$ | $-92$ | $-85$ | … | Null | ${X}_{\prime}$ | ${Y}_{\prime}$ |

**Table 3.**Cross-sensor data format, showing the normalised and filtered/interpolated values. A one second time window holds 1000 samples from each sensor, a WiFi scan and the ground-truth location. Missing APs in the WiFi scan are indicated with a −100 value.

Time | Accelerator | Gyroscope | Magnetometer | AP${}_{0}$ | AP${}_{1}$ | … | AP${}_{\mathit{n}}$ | X | Y |
---|---|---|---|---|---|---|---|---|---|

${T}_{0}$ | ${a}_{0}$∼${a}_{999}$ | ${g}_{0}$∼${g}_{999}$ | ${m}_{0}$∼${m}_{999}$ | −100 | −85 | … | −100 | ${X}_{0}$ | ${Y}_{0}$ |

${T}_{1}$ | ${a}_{999}$∼${a}_{1999}$ | ${g}_{999}$∼${g}_{1999}$ | ${m}_{999}$∼${m}_{1999}$ | −100 | −100 | … | −100 | ${X}_{1}$ | ${Y}_{1}$ |

${T}_{2}$ | ${a}_{1999}$∼${a}_{2999}$ | ${g}_{199}$∼${g}_{2999}$ | ${m}_{1999}$∼${m}_{2999}$ | −70 | −100 | … | −65 | ${X}_{2}$ | ${Y}_{2}$ |

… | … | … | … | … | … | … | … | … | … |

${T}_{n}$ | ${a}_{n}$∼${a}_{n+999}$ | ${g}_{n}$∼${g}_{n+999}$ | ${m}_{n}$∼${m}_{n+999}$ | −100 | −100 | … | −100 | ${X}_{n}$ | ${Y}_{n}$ |

Parameter | Settings |
---|---|

Epoch | 100 |

Batch Size | 100 |

Hidden Units | 128 |

LSTM Layer | 1 Layer |

Learning Rate | 0.005 |

Learning Rules | RMSprop |

Parameter | Settings |
---|---|

Input Size | AP Number |

Epoch | 100 |

Batch Size | 100 |

Hidden Units | 128 |

DNN Layer | 3 Layers |

Dropout Rate | 0.5 |

Learning Rate | 0.001 |

Learning Rules | RMSprop |

Layers | Output Shape | ||
---|---|---|---|

LSTM Layer (sensor) | (Batch Size, 128) | ||

FC Layer.1 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.1 (WiFi) | (Batch Size, 128) | ||

FC Layer.2 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.2 (WiFi) | (Batch Size, 128) | ||

FC Layer.3 (WiFi) | (Batch Size, 128) | ||

Fusion Layer (concat or multiply) | (Batch Size, 128 × 2 or 128 × 1) | ||

FC Layer.4 (joint) | (Batch Size, 128) | ||

FC Layer.5 (joint) | (Batch Size, 64) | ||

FC Layer.6 (joint) | (Batch Size, 2) | ||

Batch Size | Learning Rate | Learning Rules | Dropout Rate |

100 | 0.001 | RMSprop | 0.5 |

Layers | Output Shape | ||
---|---|---|---|

LSTM Layer (sensor) | (Batch Size, 128) | ||

FC Layer.1 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.1 (WiFi) | (Batch Size, 128) | ||

FC Layer.2 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.2 (WiFi) | (Batch Size, 128) | ||

FC Layer.3 (WiFi) | (Batch Size, 128) | ||

Residual Layer (FC Layer.2 WiFi ) | (Batch Size, 128) | ||

Fusion Layer (LSTM, FC Layer.3, RL) | (Batch Size, 128 × 3) | ||

FC Layer.4 (joint) | (Batch Size, 128) | ||

FC Layer.5 (joint) | (Batch Size, 64) | ||

FC Layer.6 (joint) | (Batch Size, 2) | ||

Batch Size | Learning Rate | Learning Rules | Dropout Rate |

100 | 0.001 | RMSprop | 0.5 |

Layers | Output Shape | ||
---|---|---|---|

LSTM Layer (sensor) | (Batch Size, 128) | ||

Sensor Regression Output.1 (${X}_{1}$, ${Y}_{1}$) | (Batch Size, 2) | ||

FC Layer.1 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.1 (WiFi) | (Batch Size, 128) | ||

FC Layer.2 (WiFi) | (Batch Size, 128) | ||

Dropout Layer.2 (WiFi) | (Batch Size, 64) | ||

FC Layer.3 (WiFi) | (Batch Size, 32) | ||

WiFi Regression Output.2 (${X}_{2}$, ${Y}_{2}$) | (Batch Size, 2) | ||

Fusion Network (input:${X}_{1}$, ${Y}_{1}$, ${X}_{2}$, ${Y}_{2}$) | (Batch Size, 2 × 2) | ||

Regression Output.3 (${X}_{3}$, ${Y}_{3}$) | (Batch Size, 2) | ||

Batch Size | Learning Rate | Learning Rules | Dropout Rate |

100 | 0.001 | RMSprop | 0.5 |

Method | ScenarioA | ScenarioB | ||||||
---|---|---|---|---|---|---|---|---|

Min | Max | Mean | Std | Min | Max | Mean | Std | |

MM-Loc | 0.0331 m | 30.1591 m | 1.5530 m | 1.7790 m | 0.0031 m | 20.2881 m | 1.8859 m | 1.9679 m |

P-MIMO | 0.0866 m | 30.3014 m | 2.4021 m | 2.9929 m | 0.0059 m | 21.3284 m | 2.4363 m | 2.0115 m |

HDLM | 0.0337 m | 30.9279 m | 1.9946 m | 2.5993 m | 0.0031 m | 18.4450 m | 2.1128 m | 2.0198 m |

GRU-CNN | 0.0348 m | 33.1648 m | 1.5446 m | 1.7886 m | 0.0048 m | 22.7855 m | 2.1867 m | 1.8698 m |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wei, X.; Wei, Z.; Radu, V. Sensor-Fusion for Smartphone Location Tracking Using Hybrid Multimodal Deep Neural Networks. *Sensors* **2021**, *21*, 7488.
https://doi.org/10.3390/s21227488

**AMA Style**

Wei X, Wei Z, Radu V. Sensor-Fusion for Smartphone Location Tracking Using Hybrid Multimodal Deep Neural Networks. *Sensors*. 2021; 21(22):7488.
https://doi.org/10.3390/s21227488

**Chicago/Turabian Style**

Wei, Xijia, Zhiqiang Wei, and Valentin Radu. 2021. "Sensor-Fusion for Smartphone Location Tracking Using Hybrid Multimodal Deep Neural Networks" *Sensors* 21, no. 22: 7488.
https://doi.org/10.3390/s21227488