Smartphone-Based Traveled Distance Estimation Using Individual Walking Patterns for Indoor Localization

We introduce a novel method for indoor localization with the user’s own smartphone by learning personalized walking patterns outdoors. Most smartphone and pedestrian dead reckoning (PDR)-based indoor localization studies have used an operation between step count and stride length to estimate the distance traveled via generalized formulas based on the manually designed features of the measured sensory signal. In contrast, we have applied a different approach to learn the velocity of the pedestrian by using a segmented signal frame with our proposed hybrid multiscale convolutional and recurrent neural network model, and we estimate the distance traveled by computing the velocity and the moved time. We measured the inertial sensor and global position service (GPS) position at a synchronized time while walking outdoors with a reliable GPS fix, and we assigned the velocity as a label obtained from the displacement between the current position and a prior position to the corresponding signal frame. Our proposed real-time and automatic dataset construction method dramatically reduces the cost and significantly increases the efficiency of constructing a dataset. Moreover, our proposed deep learning model can be naturally applied to all kinds of time-series sensory signal processing. The performance was evaluated on an Android application (app) that exported the trained model and parameters. Our proposed method achieved a distance error of <2.4% and >1.5% on indoor experiments.


Introduction
Recently, the proportion of time in daily life spent in indoor spaces, such as the home, office, and shopping mall complexes has gradually increased, and the services have been normally provided outdoors, such as car navigation systems, are gradually expanding to provide indoor positioning or navigation systems. Although indoor positioning is difficult to conduct using the global position service (GPS) because GPS signals cannot be received, technical studies are now being conducted to provide various indoor location-based services [1]. In particular, smartphones with various sensors such as accelerometers, gyroscopes, magnetometers, and barometers have become popular, and network infrastructures (e.g., Wi-Fi) have expanded to a point at which location-based services are achievable for indoor spaces [2].
Indoor localization techniques can be classified into various categories according to the required location accuracy, available service area, specific application, and available sensors, and these can be implemented using propagation model-based, fingerprint, and dead reckoning (DR) methods [2]. A propagation model method estimates the user's location based on the received signal strength (2) Estimation of the average moving speed for segmented IMU sensor signal frames. In the case of conventional PDR, the traveled distance is estimated by calculating the step count and stride length using the handcrafted features of the IMU signal. However, we proposed a scheme to estimate the traveled distance by calculating the average moving speed and duration of the signal frame. (3) Combination of multiscaling for automatic pre-processing at different time scales and CNNs for nonlinear feature extraction and RNNs for temporal information along the walking patterns. Multi-scaling makes the overall trend for different time series input signals. Several stacked convolutional operations create feature vectors from the input signal automatically, and a recurrent neural network model deals with the sequence problems. (4) End-to-end time series classification model without any handcraft feature extractions as well as requiring any signal or application specific analysis. Many of the existing methods are time-consuming and labor-intensive for feature extraction and classification, and these are limited in their domain-specific application. However, our proposed framework is a general-purpose approach, and it can be easily applied to more kinds of time-series signal classification, regression, and forecasting.
location by estimating the step count, stride length, and direction [7]. In the case in which the IMU is attached to the ankle, the zero velocity update (ZUPT) and the zero angular rate update (ZARU) method based on the walking model can be used to estimate the number of steps, stride length, direction, and can compensate for the drift error that occurs in the IMU sensor [8]. However, it is difficult to reflect the walking models or patterns (e.g., holding a device in hand, swinging, and keeping in a pocket or bag) of a pedestrian using the PDR method with a smartphone [9]. In addition, most of the literature suggests a generalized formula that reflects a feature of the IMU signal from several subjects in a given experimental walking and running scenario [10][11][12].
In contrast with other studies, our method does not require any manually-designed features, and we explore a learning-based approach for indoor localization. Figure 1 shows an overview of our proposed system, which is composed of three phases. The first phase involves data collection and labeling. The main motivation of our study was to estimate the traveled distance regardless of the pedestrian's walking type, smartphone orientation, position, and shaking. In order to assign a constant label to variable sensor signal patterns in the above condition, we applied the average velocity over the unit time (e.g., 1 s) based on the corrected GPS trajectory. In other words, the velocity of the pedestrian is calculated in the GPS available area, and the dataset is constructed by mapping the velocity as a label and the IMU sensory signal as data generated at the same time step. A hybrid convolutional-recurrent neural network (CNN-RNN)-based supervised learning phase is next. The sensory inputs are transformed into nonlinear feature vectors that are utilized as the input of recurrent neural networks (RNNs), such as long-short term memory (LSTM) and gated recurrent unit (GRU) by multiple convolutional and subsampling layers automatically. The GRU [13] model learns between the input feature vector and the velocity label considering temporal sequence information recurrently. Finally, based on the learned deep learning model parameters, the raw sensory signal is estimated as the average velocity and cumulative traveled distance per time step when moving indoors.  An overview of our indoor localization system using a proposed outdoor walking pattern learning scheme. (a) Inertial measurement unit (IMU) signal and velocity mapping using global position service (GPS)-based user trajectory by walking pattern at outdoor. The collected dataset is used as training data to the following proposed deep learning model; (b) Hybrid CNN-RNN deep learning model. The nonlinear features of the training data were automatically extracted using the multiple convolutional layers with activation and pooling layers; (c) The RNN uses the extracted feature vector at each time step as an input, and learns using signal patterns and labeled velocities; (d) The pedestrian's average velocity at each time step and the cumulative traveled distance using the learned deep learning model parameters are estimated when moved indoors.
The remainder of the paper is organized as follows. Section 2 provides the related works and motivation of this work. Section 3 describes the proposed methods for personalized pedestrian step length estimation for indoor localization using a smartphone. Section 4 then summarizes the performance of our proposed model to produce the results of indoor localization. Finally, Section 5 summarizes and concludes our work.

Indoor Localization
There are a number of existing methods for indoor localization, including radio frequency, magnetic field signal, IMU sensors, and hybrid approaches [2]. Smartphone-based methods can employ any type of indoor localization techniques using built-in sensors (e.g., accelerometers, gyroscopes, magnetometers, camera, wireless local area network (WLAN) and bluetooth low energy (BLE)), except when additional infrastructure and hardware components are required, such as ultra-wideband (UWB) [14] and WiFi channel state information (CSI) [15]. Therefore, we describe the need and motivation of our study through existing literature.
Indoor localization using smartphones has been widely studied based on wireless signals as a mainstream approach, since most buildings have WLAN access points (AP) and smartphones have WLAN connectivity, and BLE is also available. Using received signal strength (RSS) measurements, the user location can be estimated using multilateration converted to distance from a smartphone to several reference points [16] or a comparison with a pre-collected radio map [17]. However, the accuracy of the RSS-based approach can be affected by the device orientation, attenuation due to human bodies or walls, the density of reference points, and changes in the indoor environment (e.g., removed or added reference points).
Using the distortion in the Earth's magnetic field caused by a metal structure in an indoor environment, a constant signal map can be constructed at a fingerprinting point rather than radio frequency signal-based approaches [18]. In general, the fingerprinting approach makes a database of magnetic field distributions through offline collection and mapping, and then compares the measured signal at arbitrary points with the database to find a corresponding location [19]. Therefore, the accuracy depends significantly on the resolution of the measurement points. Even though additional infrastructure is not required, such as APs and beacons, the collection and maintenance of high quality signal mapping is a time-consuming and labor-intensive process.
The major advantage of a PDR approach is that it is possible to estimate as a standalone, and it does not require additional infrastructure or any pre-collection of a database, because the current position of the pedestrian is estimated according to the moving distance and direction from the previous position using an IMU sensor. Conventional PDR methods consist of three phases: (i) step counting, (ii) step length estimation, and (iii) walking direction estimation [20]. The number of steps is determined by the threshold, peak detection, or zero-crossing method in a periodic change of the accelerometer signal [21,22]. The step length is computed using the generalized formulas related to the magnitude of the accelerometers and gyroscopes, frequency of the step, and height of the pedestrian [11,12,23]. The walking direction estimation uses gyroscopes and magnetometers to detect the relative direction for a short time and a smartphone's orientation in a global frame [24]. Considering the low accuracy of the MEMS sensor, the accumulated sensor error, and various pedestrian movements, indoor positioning is still a challenging problem when using smartphones particularly.
Even though each of the technical approaches described above has its strong and weak points, there is no single method that can be applied for all indoor spaces and services. Naturally, some of the literature indicated that two or more techniques have been blended. A typical combination involves Wi-Fi or BLE with PDR, and this can improve the accuracy by calculating the global absolute position that is less accurate using network infrastructure and combining the PDR with a locally higher relative position [2,9]. In Bao et al. [25], map matching is presented as a localization solution. To compensate for the accumulating and drifting error in the PDR, the map environment information, such as the distance of the corridor and the position of corner, was used (e.g., the length of the corridor is utilized to calibrate the average step length).
Although complementary techniques and calibration methods can increase the accuracy, individual localization technologies must be improved in order for the indoor localization system to be applied in a manner that is more precise and applicable to real life. Therefore, we proposed a deep learning-based traveled distance estimation method, even if the movement of the pedestrian and the placement of the smartphone change dynamically.

Deep Learning for Time-Series Sensory Signal Analysis
Deep learning is part of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using multiple nonlinear transformations. Recently, deep learning has become a state-of-the-art machine learning method, such as image classification, video recognition, and natural language processing [26]. In particular, CNN and RNN are excellent for automatic nonlinear feature extraction and the exhibition of temporal behavior for a time sequence on their input, respectably [27]. Many studies have already presented remarkable performance through CNN and RNN as well as a hybrid model compared to traditional signal processing based on handcraft feature extraction in time-series signal analysis.
In Kiranyaz et al. [28], a CNN model was applied in an anomaly detection with a one-dimensional (1D) electrocardiogram signal. Similarly, the goal of the approach in Ince et al. [29] and Abdeljaber et al. [30] was to autonomously learn useful features to monitor the mechanical condition, such as the motor and bearing fault based on a raw 1D signal. CNN and RNN models are widely used in human activity recognition using multiple sensor signals from smartphones as well [31,32]. In addition, DeepSense integrated CNN and RNN automatically extracts features and relationships on local, global, and temporal cases, and their experiments have shown that the feature vectors extracted by CNN can be effective when fed into RNN as inputs [33].
Deep learning for PDR-based indoor localization has been used to count the number of steps and estimate the stride length. Edel et al. [34] proposed a step recognition method based on BLSTM (bidirectional long-short term memory)-RNNs with three-axis acceleration from a smartphone. In Xing et al. [35], five manually-designed parameters from the IMU signal that are attached to the top of the foot and closely related to the linear walking model are fed into the artificial neural network (ANN) as input, and the desired network output is the stride length obtained through a regression. Hannick et al. [36,37] proposed a CNN-based stride length estimation and gait parameter extraction methods using publicly available datasets collected through an IMU attached to the shoe.
Recently, the number and type of sensors used in the internet of things (IoT), cyber physical systems (CPS), and wireless sensor networks (WSN) have been increasing, and also the field to utilize sensor data has become wider. In such a situation, deep learning is a good tool to exploit and analyze collected data without the need for special domain knowledge or signal processing techniques. Therefore, we considered a general time-series sensory signal classification framework that accommodates various fields of application, and reflects our proposed model, which referred to multiscale hybrid convolutional and recurrent neural network design. However, the deep learning approach is not always straightforward when calculating an optimal result compared to manually designed processing architectures.

The Proposed System Design
The main contribution of our study is to learn the outdoor walking pattern and then use it in an indoor environment. To enable this approach, the moving speed is calculated by the corrected user trajectory, and an appropriate deep learning model should be designed for feature extraction and training of the measured IMU sensory signal with labeled speed.

Automatic Dataset Collection Using the Corrected Pedestrian Trajectory with Kalman Filter
Our proposed method requires calculating the velocity per unit time using the corrected and filtered pedestrian trajectory based on raw GPS positioning. In most of the prior literature applying general purpose signal processing or deep learning models, the traveled distance is estimated by calculating the step count and stride length. However, we proposed a novel scheme to estimate the traveled distance by calculating the average moving speed and time duration using a deep learning approach with segmented IMU signal samples as the input.
One of most important things that can improve the performance in deep learning is the collection of large, high-quality datasets. It is quite a challenge to assign a label in real-time or automatically to a time-series signal that is measured on a densely attached sensor on the body or carried smartphone. Therefore, many studies manually assign a label to the measured sensor signal in a limited experimental condition (e.g., after a certain period of walking or running with a fixed gait, Sensors 2018, 18, 3149 6 of 18 smartphone placement, orientation, and same stride, through assigned a fixed stride as a label for the entire measured signal data).
We proposed a method to automatically configure the dataset outdoors where a GPS signal can be received with a high reliability. However, since the GPS contains a positional error, we use a Kalman filter to correct it. The Kalman filter is a method to estimate the optimum and filtering error by removing the statistical noise included in a series of measurements observed over time.
Our proposed dataset collection scheme is shown in Figure 2. Based on the corrected and filtered pedestrian trajectory, we calculated the velocity using the displacement between the current time position and the previous time position at every measurement time, and we assigned the velocity to the label of the segmented IMU sensor signal frame. The displacement and moving speed can be computed using the following equation: where p t denotes the corrected GPS position x, y at time t, which are transformed into the longitude and latitude into the universal transverse mercator (UTM) coordinate system, respectively. ∆d t and s t are the displacement and the velocity computed using positions at time t and t − 1, respectively. In addition, s t is assigned to the label of the IMU sensor signal measured from t − 1 to t. The above process is repeated every second when walking outdoors where GPS is available, and the automatically generated dataset is used for learning. velocity to the label of the segmented IMU sensor signal frame. The displacement and moving speed can be computed using the following equation: where pt denotes the corrected GPS position x, y at time t, which are transformed into the longitude and latitude into the universal transverse mercator (UTM) coordinate system, respectively. ∆dt and st are the displacement and the velocity computed using positions at time t and t − 1, respectively. In addition, st is assigned to the label of the IMU sensor signal measured from t − 1 to t. The above process is repeated every second when walking outdoors where GPS is available, and the automatically generated dataset is used for learning. The moving speed is obtained by the displacement of the current position at t and previous position at t − 1, and it is assigned to a segmented IMU signal every 1 s (i.e., the GPS data update rate is 1 Hz) corresponding to the label directly.

Multiscale and Multiple 1D-CNN for Feature Extraction
The main difference between conventional CNNs and 1D-CNN is the use of a one-dimensional time-series signal as the input data instead of two-dimensional (2D) pixels or three-dimensional (3D) voxels. Multiple time-series data (i.e., IMU sensor data) is fed into the CNN as the input in our proposed systems. The key role of the CNN is to extract the correlation of spatially and temporally adjacent signals using nonlinear kernel filters. The local feature of the input data can be extracted by applying several kernel filters, and the global feature vector can be generated through several pairs The moving speed is obtained by the displacement of the current position at t and previous position at t − 1, and it is assigned to a segmented IMU signal every 1 s (i.e., the GPS data update rate is 1 Hz) corresponding to the label directly.

Multiscale and Multiple 1D-CNN for Feature Extraction
The main difference between conventional CNNs and 1D-CNN is the use of a one-dimensional time-series signal as the input data instead of two-dimensional (2D) pixels or three-dimensional (3D) voxels. Multiple time-series data (i.e., IMU sensor data) is fed into the CNN as the input in our proposed systems. The key role of the CNN is to extract the correlation of spatially and temporally adjacent signals using nonlinear kernel filters. The local feature of the input data can be extracted by applying several kernel filters, and the global feature vector can be generated through several pairs of convolutional and pooling layers [26]. Our study focuses on indoor localization based on time-series sensor signal pattern learning under varying walking conditions. In such a case, a proper input data size considering the temporal properties should be considered to extract the features well. Therefore, we proposed a multiscale, multiple CNN architecture to incorporate inherent feature learning for various signal durations as well as overcome the limitations using only single input data in conventional CNN. This approach can simultaneously reflect the features of various input data measured for different time durations. The overall scheme of the multiscale and multiple CNN model to extract the feature is illustrated in Figure 3.
We motivated multiscale feature learning into the CNN architecture to improve the learning efficiency and extract more features [38]. However, the proposed multiscale method possesses the following two characteristics compared to conventional multiscale studies. First, the size of the converted signal by each scaling factor is always the same. Second, the feature vectors extracted by multiple CNNs are not combined into one feature map, and these feature vectors are independently used. The reason for our multiscale and multiple CNN concept is to utilize the same dimension as the input data for the RNN model, which will be introduced in Section 3.3.
Multiscale operation is performed by using the following expression: where x = {x 1 , x 2 , · · · , x n } denotes a measured one-dimensional discrete time-series signal, x i is the sample value at time index i, s is a scaling factor, and N denotes the number of samples in each segmented signal frame. The time-series signal x is divided into a non-overlapped window length s (i.e., the scaling factor), and then, window j's data points are consecutively averaged. We can construct a multi-scaled signal s x = s x 1 , · · · , s x j , · · · , s x n/s . used. The reason for our multiscale and multiple CNN concept is to utilize the same dimension as the input data for the RNN model, which will be introduced in Section 3.3. Multiscale operation is performed by using the following expression:  denotes a measured one-dimensional discrete time-series signal, xi is the sample value at time index i, s is a scaling factor, and N denotes the number of samples in each segmented signal frame. The time-series signal x is divided into a non-overlapped window length s (i.e., the scaling factor), and then, window j's data points are consecutively averaged. We can construct a multi-scaled signal Typically, CNNs consist of several pairs of convolutional and pooling layers to extract the feature map, and learning generates the desired output by optimizing learnable parameters through feedforward and backpropagation passes [26]. The objective of using the CNN architecture in our study is to achieve feature extraction. The signals transformed by the multiscale operation are fed into multiple CNNs as the initial input data, and the feature map of multiple CNN based on the Typically, CNNs consist of several pairs of convolutional and pooling layers to extract the feature map, and learning generates the desired output by optimizing learnable parameters through feedforward and backpropagation passes [26]. The objective of using the CNN architecture in our study is to achieve feature extraction. The signals transformed by the multiscale operation are fed into multiple CNNs as the initial input data, and the feature map of multiple CNN based on the multiscaled signal can be expressed as: where s x l−1 i and s z l j denote the input and output of the convolutional layer with the ReLU = max(0, x) activation function, respectively; k l−1 ij denotes the learnable kernel filter from the ith neuron in layer l − l to the jth neuron in layer l; and 1dconv() indicates the convolutional operation. b l j is the bias of the jth neuron in layer l, and M l−1 denotes the number of kernel filters in layer l − 1. s y l i denotes the output of max pooling layer l, as well as the input of the next convolutional layer l + 1. Consequently, pairs of convolutional and pooling layers reconstruct a feature vector as the input of the above recurrent neural network model. Additional details of the backpropagation by minimizing the cost function are available in Lecun et al. [39].
Moreover, the multiscale operation can be considered as the moving average without an overlapped window for low-pass filtering. In other words, high-frequency components and random noises in a raw time-series sensory signal can be naturally filtered. Although CNNs have not been verified to eliminate noise well, lightweight preprocessing (such as smoothing and averaging) can naturally improve the performance [40].

Hierarchical Multiscale Recurrent Neural Networks
Our walking pattern learning model is mainly motivated by the following observation and considerations. According to studies on pedestrian movement characteristics [41,42] and our experimental experience, the stride and walking speed tend to remain stable as long as there is no change in the external environment or internal user behavior. In other words, within a relatively short period (i.e., in a few seconds), the previously estimated walking speed may affect the current working speed.
Learning the temporal representations and dependencies is one of the challenges of deep neural networks. Recurrent neural networks, such as LSTM and GRU, have been considered as promising methods to solve this problem [13]. Therefore, we designed a new scheme, which is referred to as a hierarchical multiscale recurrent neural network, that can capture temporal dependencies with different timescales using a novel update mechanism. Figure 4 presents our proposed enhanced GRU cell.
The GRU mainly consists of an update gate and a reset gate. The input of the GRU cell is x t , s m t , and h t−s , the output is h t and h t+s for time t. Both gates are similarly computed as in the following equation: where W x , U h , and W m denote the learnable parameters that linearly combine the input vector x t , previous hidden state output vector h t−s , and additional multiscaled input s m t , respectively, b is a bias, superscripts z and r mean that the corresponding parameter belongs to the update gate or the reset gate, and the activation function σ is a sigmoid. The difference of the basic GRU is that the scaling factor s determines whether the recurrent path is activated or not.
networks. Recurrent neural networks, such as LSTM and GRU, have been considered as promising methods to solve this problem [13]. Therefore, we designed a new scheme, which is referred to as a hierarchical multiscale recurrent neural network, that can capture temporal dependencies with different timescales using a novel update mechanism. Figure 4 presents our proposed enhanced GRU cell.
where Wx, Uh, and Wm denote the learnable parameters that linearly combine the input vector xt, previous hidden state output vector ht−s, and additional multiscaled input smt, respectively, b is a bias, The candidate activation can be defined at a current state as follows: where r t is a set of reset gates, • is element-wise multiplication, and tanh is used as an activation function. The candidate activation h t is calculated by the current state W x x t , W ms m t , and previous hidden state Uh t−s , but it depends on the reset gate r t to activate the previous hidden state. The activation function of the reset gate is a sigmoid, σ(x) ∈ [0, 1]. If the reset gate value is 0, the previously computed state is forgotten, if it is 1, it allows it to maintain a previously computed state. The current state information is reflected, regardless of the reset gate. The output h t of the GRU at time t is computed as follows: where h t−s is the previous activation and h t is the candidate activation of the current state, and the update gate z t determines how much it updates each component. The update gate also uses the sigmoid; if z t is 0, all of the previous activation is forgotten, and only h t is activated, if it is 1, the previous activation h t−s is determined as the output of GRU. Our proposed hybrid multiscale convolutional and recurrent neural network model to learn and classify the pedestrian walking velocity using a one-dimensional time-series signal extracted during walking with a smartphone is expressed in Figure 5.
update gate zt determines how much it updates each component. The update gate also uses the sigmoid; if zt is 0, all of the previous activation is forgotten, and only t is activated, if it is 1, the previous activation ht−s is determined as the output of GRU.
Our proposed hybrid multiscale convolutional and recurrent neural network model to learn and classify the pedestrian walking velocity using a one-dimensional time-series signal extracted during walking with a smartphone is expressed in Figure 5. Our proposed hybrid multiscale convolutional and recurrent neural network model to train and estimate the moving speed for segmented and multiscaled sensory signal. The transformed signals with different timescales are fed into a corresponding CNN to extract the feature vector, and then, each feature vector is fed into the corresponding GRU cell as an additional input, smt. Only the feature vector xt (s = 1) is fed into the first GRU layer of the stacked RNNs as the input, and ht is used for the input of the upper GRU layer. The recurrent activation at each GRU layer is determined by the scaling factor of the additional input smt. Finally, the probability distribution for the target moving velocity is computed by Softmax.

Experimental Setup
The dataset was automatically constructed while walking outdoors in the real world with smartphone sensors. We developed and utilized an Android application (app) that can measure and Figure 5. Our proposed hybrid multiscale convolutional and recurrent neural network model to train and estimate the moving speed for segmented and multiscaled sensory signal. The transformed signals with different timescales are fed into a corresponding CNN to extract the feature vector, and then, each feature vector is fed into the corresponding GRU cell as an additional input, s m t . Only the feature vector x t (s = 1) is fed into the first GRU layer of the stacked RNNs as the input, and h t is used for the input of the upper GRU layer. The recurrent activation at each GRU layer is determined by the scaling factor of the additional input s m t . Finally, the probability distribution for the target moving velocity is computed by Softmax.

Experimental Setup
The dataset was automatically constructed while walking outdoors in the real world with smartphone sensors. We developed and utilized an Android application (app) that can measure and store the GPS and IMU sensors. The main scenario for our experiment is walking in a situation such as (i) handheld: hold the phone in front of the chest, (ii) swing: hold the phone in a swinging hand, and (iii) pocket: the phone in the trouser (back/front) pocket. We tried to walk as much as possible in real life when we collected data, and some data during calling, texting, and running were included in the dataset as well.
We used the tri-axial accelerometer in the IMU to measure the walking pattern with a 200-Hz sampling rate, and a magnetometer was not used due to the magnetic distortion caused by metal structures. We used the NMEA 0183 standard parser to get the GPS position, and transformed WGS84 to UTM coordination to compute the meter unit operation when a reliable fix of the GPS signal was obtained. The raw GPS position was corrected by a Kalman filter, and the displacement and velocity were calculated by Equation (1). Then, the velocity is assigned to the label of the measured sensory signal at the corresponding time. For the experiment, the tri-axial acceleration signals and the GPS position were sampled at 200 Hz and 1 Hz, respectively, and then they were stored in the Secure Digital (SD) memory card as log files. Therefore, we used a smartphone with an additional SD card slot. The log file consists of a list of measured data every 1 s (i.e., 200 × 3 × 4 bytes of acceleration data, 2 × 4 bytes of position data, 1 × 4 bytes of velocity data, the number 4 means a float type). For instance, approximately 8.7 MB of memory space is required to store the signal as a training dataset for 1 h.
The dataset was collected individually; it was composed of the measured IMU sensor and GPS data while walking 3.6 km to 41 km per subject, 3% to 11% of the collected data was eliminated from the dataset if the GPS reliability flag was not fixed. Naturally, the experiment was carried out using the user's own walking pattern signal measured by their own smartphone.
We utilized Ubuntu 16.04LTS on a desktop, TensorFlow, and Nvidia GTX1060 for learning, and then we exported our trained deep learning model and learnable parameters (e.g., weights and biases) to an Android app. The training set and validation set were randomly allocated 80% and 20% of the entire dataset, respectively, and a test was performed indoors using the Android app implemented by us that can display the traveled distance in real-time. Running the app requires a total of 600 MB of memory space for the data file, which includes the weight parameters and the metafile, including the model structure. In addition, it takes an average of 310 ms on Galaxy S7 (higher than Android-Nougat) to convert the input signal into the output velocity when the other apps are not working.
As mentioned above, our proposed dataset-constructing method requests the users to generate data, because the dataset should only be composed of data based on our own walked signal, position, and velocity, and it is used to learn the user's own walking patterns by the proposed deep learning model. This means that they need to run the specific app to collect data when they are outdoors. Although we manually stored the dataset on the server, it is also possible for users to send a collected dataset to a private cloud or private server using Wi-Fi or BLE. We managed each subject's walking dataset on one desktop server, and generated a data file that contains the weighted parameters for each corresponding subject. Table 1 shows the structure and parameters of our proposed simplified hybrid CNN-RNN model.

Performance Evaluation
Before our proposed method was designed, we needed to confirm that the accuracy of the corrected GPS positioning was appropriate for labeling for the measured time-series sensory signal. Table 2 shows the displacement error between the actual displacement and the corrected GPS position by each distance along with a pre-defined straight path. In general, a commercial GPS has a positioning error of more than several meters in the global coordination system. According to our experiment, there was an average of 10.2 m of error in the global coordination in open space. However, we only needed to apply a relative coordination system to compute the displacement between the start and the end points. Thus, we determined that it is appropriate to estimate the moving speed of the pedestrian through a trajectory corrected within less than 2 m of average displacement error, and the error did not linearly increase with any increase in distance. We used a built-in microphone on the smartphone and a short whistle sound to check the start and end times. In order words, while measuring the GPS and IMU sensors, the microphone was also measured, and the whistle was used as a timing marker to compute the displacement error of the experimental result.
Four metrics were computed to quantify the performance of different models: Accuracy means a correctly classified sample from the total population, (TP + TN)/(TP + TN + FP + FN); Precision is the ratio of correctly predicted conditions to the total predicted positive conditions for each class, TP/(TP + FP); Recall presents the ratio of correctly predicted positive conditions to all of the true conditions for each class, TP/(TP + FN); and the F − 1 score refers to the weighted average of Precision and Recall, 2 × (precision × recall)/(precision + recall). The definitions of the above metrics use the true positive (TP), true negative (TN), false positive (FP), and false negative (FN). Another performance metric, the distance error, means the difference the between actual and predicted traveled distance over the whole validation set.  Table 3 presents the performance of our proposed hybrid multiscale convolutional and recurrent neural network model with a mean of fivefold cross-validation, and the results are also compared to those of other models. The training and validation set were randomly allocated 80% and 20% data, respectively. In other models except for the proposed model, the segmented 2-s signal frame was fed into each model as the input. In this experiment, we used a dataset that walked around 41 km for subject 1. As expected before the experiment, the performance outperformed when using CNN than when using ANN without any manually-designed features and a basic RNN model, which means that the appropriate features are automatically extracted through a CNN learning process. Even though CNN's automatic feature extraction does not always outperform a handcrafted feature extraction scheme, it can drastically reduce the time and cost for the feature analysis. The performance of the LSTM and GRU, which can solve the long-term dependency, improved more than that for the CNN without considering temporal correlation and basic RNN models. According to our experimental result and other studies, both models show enhanced performance, the performance between LSTM and GRU was insignificant, hyperparameter tuning seems to be more important than the model chosen. Thus, we adapted GRU as a part of our proposed model, considering that it has a small number of parameters and short training time compared to the LSTM. Hybrid CNN and GRU were performed to reflect the effect of the CNN's automatic feature extraction, and it showed a slight improvement in the performance. Our proposed model demonstrated the best performance, we reduced a traveled distance prediction error to 1.27% with simple noise rejection and a trend of a different time scale using the multiscale method, enhanced GRU cell, and novel hybrid model architecture.
In addition, we compared the classification and the regression methods that have different ways of predicting the output. These two methods exhibit a difference in the dataset configuration. When the case of the regression method is applied, the label for the corresponding signal in the dataset is directly assigned using the calculated velocity from the corrected GPS trajectory. In the other case, the calculated velocities were clustered using an expectation-maximization (EM) algorithm, and then the representative values for each cluster were assigned to the label. The performance of the regression methods was slightly lower, even though a calibrated trajectory was used, and it seems that slightly different velocities are assigned as labels to a similar signal pattern.
The confusion matrix for the classification results is shown in Figure 6 for comparison with other methods. Classes 1 and 5 are the stationary and fastest class, respectively, so these are relatively well classified in all of the models. This means that these two classes are distinct compared to the other class signal patterns. Applying the enhanced deep learning model, the classification accuracy for classes 2 and 3 gradually improved. We exported and modified the trained proposed deep learning model to an Android app using TensorFlow Mobile to carry out the experiment indoors. The app works online and displays the traveled distance for the measured signal in real-time. We carried out an experiment with nine We exported and modified the trained proposed deep learning model to an Android app using TensorFlow Mobile to carry out the experiment indoors. The app works online and displays the traveled distance for the measured signal in real-time. We carried out an experiment with nine subjects to verify that the method of estimating the travel distance after learning their own walking pattern via their own smartphone outperforms previously studied smartphone-based PDR methods that use the step counts and step length estimation.
The experiments were carried out in a 60-m indoor corridor; subjects 1 to 6 walked with varying speeds every 20 m, and subjects 7 to 9 walked at the constant speed. Table 4 shows the duration and distance of the walking dataset collected by each subject, and corresponding smartphone devices for training. The experimental results of our proposed model and the existing methods are shown in Table 5. The proposed method that learns the user's walking pattern using their own smartphone showed the best performance, because it was considered by any change in the device, moving speed, and walking pattern, as well as the smartphone's orientation and placement. Since the proposed method is based on deep learning, the accuracy tends to improve as the amount of data increases. According to the experimental results, when using more than 10 km of a walking dataset, less than 2.4% and up to 1.5% of a distance error is generated. In the experiment based on Weinberg [12], although the K parameter was tuned for each subject, the performance deviation was large due to the change in the orientation and position of the smartphone according to the walking pattern, as well as the variation in the walking speed. Ho et al. [11] suggested a way to control the K parameter of Weinberg et al. [12] according to the walking speed change. However, this did not solve the problem caused by the unconstrained walking pattern either. Huang et al. [23] proposed a method to estimate the stride length when only the smartphone was held in a swinging hand by exploiting the cyclic features of walking, in order to overcome the limitation of carrying a smartphone in front of the chest in most smartphone-based PDR approaches. However, manually designed cyclic features based on a biomechanical model cannot reflect the diversity of each pedestrian's swinging action as well as the variation in the walking pattern; thus, a large performance deviation among the subjects is shown in the result.
In addition, to verify the performance compared with a deep learning-based model, we performed the experiment using the ANN-based model [35] that uses five parameters as the input closely related to the stride length and one output to estimate the corresponding stride length. For this experiment, we reconfigured the existing dataset with a pair of five parameters (e.g., stride frequency, maximum, standard deviation, mean of acceleration, and height of subject) and stride length, and these parameters and stride length were extracted from the measured IMU sensory signal and corrected GPS trajectory, respectively. Although we were not sure that the ANN's model size, hyperparameters, and optimization techniques were designed well to achieve the best performance, we confirmed that the proposed multiscale hybrid CNN-RNN model considering nonlinear feature vectors from multiple time scale signals and temporal correlation can be more effective to learn multivariate time-series signal patterns.

Discussion and Conclusions
Stride length estimation is one of the most important factors in PDR-based indoor localization, and it is affected by the pedestrian's movement behavior, speed, and physical properties as well as the smartphone's placement and orientation. Under the conditions mentioned above, precise estimation of the stride length is a challenging research topic. Although many studies have used network infrastructure such as Wi-Fi and BLE to calibrate and overcome the drawbacks of PDR, it is limited only to infrastructure that is based indoors. Therefore, we introduced a new precise PDR approach that can be utilized without network infrastructure by learning the personal walking pattern during the various activities in everyday life.
Since the dataset used in our experiment is automatically constructed, it may be of lower quality than a dataset measured under controlled experimental conditions. However, the experimental results showed an approximately 2% distance error when the data was collected over 10 km, and an approximately 1% error when the data was collected over 30 km. Although our experiments were carried out under limited conditions (i.e., three walking types) to apply the same condition to the subjects, we expect to outperform these if a larger dataset (e.g., longer period of time, more variable walking pattern, and smartphone placements) is conducted and the accuracy of the data collection has improved.
In addition, although this paper focuses on solving the problems of indoor localization applications, a multiscale hybrid convolutional and recurrent neural network model can provide a general time-series signal classification, regression, and forecasting framework that accommodates a wide range of applications i.e., collected signals via other single or multiple sensors that can be applied in our designed model directly. However, there is one thing that needs to be improved; at present, we need to learn the entire dataset when additional data with a new output class (i.e., velocity) is generated, because our proposed learning scheme uses a conventional supervised learning approach.
In the future, we would need to focus on researching how to set the initial position using minimum reference points, how to estimate the direction based on the proposed model, and how to match and correct the final position to the map during walking indoors to solve the cumulative error of a wrong estimation in the moving speed by using the minimum reference points, such as beacons and landmarks (e.g., corners or objects). Moreover, we will study online and lifelong [43] learning to overcome the inherent limitations of supervised learning, such as fixed output classes and additional datasets.
Finally, we hope that a major contribution of our study will be to motivate researchers who want to apply a new and easy approach for indoor localization as well as time-series sensory signal analysis.