Smartphone-Based Activity Recognition for Indoor Localization Using a Convolutional Neural Network

In the indoor environment, the activity of the pedestrian can reflect some semantic information. These activities can be used as the landmarks for indoor localization. In this paper, we propose a pedestrian activities recognition method based on a convolutional neural network. A new convolutional neural network has been designed to learn the proper features automatically. Experiments show that the proposed method achieves approximately 98% accuracy in about 2 s in identifying nine types of activities, including still, walk, upstairs, up elevator, up escalator, down elevator, down escalator, downstairs and turning. Moreover, we have built a pedestrian activity database, which contains more than 6 GB of data of accelerometers, magnetometers, gyroscopes and barometers collected with various types of smartphones. We will make it public to contribute to academic research.


Introduction
In the indoor environment, human activity contains rich semantic information, for example, if a user's activity is recognized as taking an elevator, the location of the user can be inferred to the elevator. These activities can be used as the landmarks for indoor localization and mapping [1][2][3][4][5]. For example, when a user's activity is detected as "elevator", his/her location may be at the elevator. The recognition of human activities has been approached in two different ways, namely ambient sensing methods and wearable sensing methods [6]. The ambient sensing methods make use of the devices fixed in predetermined point of interest to sense human activities, such as video camera [7] and WiFi signal [8]. Wearable sensing methods are based on the sensors attached to the user. Recently, with the development of sensor technology, wearable sensing methods have become more and more popular. The wearable sensing methods can be implemented directly in smartphones [9].
These ambient sensor-based methods are device-free and activity can be detected solely through the video/image or wireless signals. This advantage makes these methods especially suitable for security (e.g., intrusion detection) and interactive applications. The drawback of these methods is the predetermined infrastructures (video and WiFi devices) are difficult to attach to target individuals to obtain their data during daily living activities.

Wearable Sensors
The limitation of ambient sensors motivated the use of wearable sensors for human activity recognition. The first wearable sensor-based activity recognition was proposed last 1990s, which used an accelerometer for posture and motion detection [22]. The accelerometer is the most frequently used sensor for activity recognition [23][24][25]. Attal et al. [26] present a review of different techniques used for physical human activity recognition from wearable inertial sensors data.
Lara and Labrador surveyed the state of the art in human activity recognition making use of wearable sensors [6].
With the advance of sensing technology, a smartphone is equipped with several sensors, including an accelerometer, gyroscope, magnetometer, and barometer. These sensors can be fused to recognize activity. Shoaib et al. proposed an activity recognition method by fusing smartphone motion sensors [27].
Compared with ambient sensor-based methods, wearable sensors are not limited by coverage, which can used for capturing people's continuous activities.

Traditional Machine Learning Methods
Traditional machine learning methods usually consist of two parts: feature extraction and classification. The activity recognition performance of the traditional methods depends highly on the extracted features. The feature extraction process requires laborious human intervention, which is the major challenge. The feature used for activity recognition includes time domain features (e.g., mean, standard deviation, variance, interquartile range, etc.), frequency domain features (e.g., Fourier Transform, Discrete Cosine Transform), and others (Principal Component Analysis, Linear Discriminant Analysis, Autoregressive Model) [6]. For activity classification, there are numerous classic methods, such as decision tree [23], Bayesian [28], neural networks [29], k-nearest neighbors [30], regression methods [31], support vector machines [32], and Markov models [33].

Deep Learning Methods
Recently, deep learning has attracted considerable attention, and has been successfully applied in various fields, including human activity recognition. Zheng et al. [34] applied convnets to human activity recognition using sensor signals, but it just classified with a one-layer convolutional network which can hardly get high accuracy. Ronao and Cho proposed another deep learning method to classify six types of activities [35]. However, they use only accelerometers and magnetometers which can hardly differentiate vertical activities such as upstairs and downstairs. Gu et al. proposed a smartphone-based activity recognition method, which takes advantages of four types of sensor data, including an accelerometer, gyroscope, magnetometer, and barometer [36]. In their work, stacked denoising autoencoders is applied for activity recognition. The network just consists of two layers: encoding layer and decoding layer. The two-layer structure does not make full use of the no-linear advantage of deep learning, thus, leading to performance defects. Ravi et al. [37] propose a deep learning method for human activity recognition using low-power devices. To design a robust approach against transformations and variations in sensor properties, a manual feature selection strategy is adopted to extract features, which requires laborious human intervention.

Methodology
In this part, we will firstly introduce the activity categories for classification in detail, then give the experiment environment including software and hardware configuration. After that, the proposed CNN-based method is introduced as follows: the whole architecture of the CNN-based method, the data processing approach, the structure of our network, related theory and the vital training strategies. Finally, we are going to introduce the algorithm transplantation method which differs from the offline testing. Transplantation makes the online activity recognition with a Smartphone possible so that it can contribute to the indoor localization system.

Activities for Indoor Localization
In the indoor environment, human activity contains rich semantic information, which can be used for indoor localization. For example, if a user's activity is detected as taking an elevator, the location can be inferred to the elevator. In this paper, we focus on the activities which contain context information. There are nine types of activities, including down elevator, down escalator, downstairs, up elevator, up escalator, upstairs, turning, walking, and still. The description of each activity is shown in Table 1.

Hardware and Software Setup
The specific software and hardware configuration information is given in Table 2. All our experiments were conducted on a server with powerful computational capabilities. There is 64-GB memory in the server and it is equipped with two GeForce GTX 1080Ti graphics cards to accelerate computing. We have installed Ubuntu 16.04 in junction with Python. Python has very efficient libraries for matrix multiplication, which is vital when working with deep neural networks. Tensorflow is a very efficient framework for us to implement our CNN architecture, moreover, we have to install the other dependencies like CUDA Toolkit and cudnn before using tensorflow. The CUDA Toolkit provides a comprehensive development environment for NVIDIA GPU accelerated computing. CuDNN can optimize CUDA to improve the performance.

Proposed CNN-Based Method
3.3.1. Architecture Figure 1 describes the architecture of our proposed method for activity recognition. Firstly, the activity data is collected by the built-in sensors of a smartphone, including an accelerometer, gyroscope, magnetometer, and barometer. Secondly, the collected data are divided to different segments. Each segment is an activity sample. Then, the data sample is put into the CNN for activity recognition.

Acc
Gyro

Data Segmentation
Data recorded is a time-varying signal which needs to be separated into training examples; each example stands for an activity. Sliding-window is used for segmentation and the window size stands for how long we take to present an activity. In activity recognition problem, the window size is of high importance. A too small value may not precisely capture the full characteristics of the activity, while a too large sequence may include more than one activity. To explore the proper window size we should use, we have chosen the varying window size of 1 s, 1.5 s, 2 s, 2.5 s, 3 s, 3.5 s with 50% overlap. For example, when the window size is 2 s, 1000 values recorded from the four sensors at each timestamp, we choose 1000 values to represent an activity.

CNN for Activity Recognition
The structure of the convolutional neural network designed for activity classification mainly consists of three types of layers: convolutional layer, pooling layer and fully connected layer.
The input data is a vector containing data from an accelerometer, magnetometer, gyroscope and barometer, so 1D convolution is utilized to deal with the 1D data. If the input vector is x = [x 0 , x 1 , ..., x n−1 ] (n is the number of data in a sliding window), the filter is written as w = [w 0 , w 1 , ..., w m−1 ] (m is the filter size), and the output of the convolutional layer is z = [z 1 , z 2 , ..., z h ] (h is the length of the output vector), then for the jth element z j in z, it is satisfies the equation: where s refers to be the stride length of convolution. The relation between h, m, n, and s can be expressed as: After the convolutional layer, an extra activation layer is added to improve the express capability of the network, the output of activation layer is a = f (z), where f is the activation function. We use Relu as the activation function.
The pooling layer focuses on extracting more robust features by choosing the statistic value of nearby inputs. The maximum, minimum or mean value of a given region are computed as the output of the pooling layer, thus, reducing data shape, simplifying computing and avoiding overfitting. Max pooling is utilized in our proposed network [12].
After several layers of convolution and pooling operation, the fully connected layer follows as the classifier to recognize different activities. The softmax layer is applied to map the result of a fully connected layer to the range (0, 1), which can be seen as the probability of the sample being each type of activity. Suppose the output of fully connected layer is f = [ f 1 , f 2 , ..., f n ], the output of the softmax is y = [y 1 , y 2 , ..., y n ], it satisfies the equation that: We adjust the weights and biases in each layer by minimizing the loss between the initial prediction and the label to get the best prediction model, the loss can be described as follows: where y is the label of sample and y is the output of the network. It is a least square problem which can be solved by some existing optimization algorithms like SGD, Momentum, RMSprop and Adam. We choose Adam as our optimization algorithm for the complexity of our network and the necessary of quick convergence.

Training Strategies
The most time-consuming process of implementing activity recognition is training the model. There are two important tricks we have utilized in our classifier to accelerate training: Mini-batch Gradient Descent and normalization for the input data [38,39].
With the difference of the amount of training data, the optimization algorithm can be divided into Stochastic Gradient Descent, Mini-batch Gradient Descent and Batch Gradient Descent. The Stochastic Gradient Descent strategy just deals with one sample at a time while Batch Gradient Descent deals with all samples at a time. As one sample can hardly derive the right direction of gradient descent, the Stochastic Gradient Descent strategy always spends too much time to converge despite the high iteration speed. Batch Gradient Descent with low iteration efficiency is also not a good choice. Mini-batch Gradient Descent optimizing part of data is a proper strategy to filling the gap and we usually choose the power of 2 as the batch size.
Normalization of the input data can make the optimization easier and accelerate convergence. For the non-uniform distribution of input data in each dimension, there is much difficulty in finding the fastest gradient descent direction. Input data normalization can solve the problem by changing the distribution of the input data by this way: suppose an sample x = [x 1 , x 2 , ..., x m ] with mean µ and variance 2 , we normalize the sample tô where ε is a number close to zero for avoiding wrong division.

Dataset and Experimental Setup
For collecting the activity dataset, ten participants were invited to collect the data of nine activities defined in Table 1 using smartphones. The specific information of the participants is listed in Table 3. Data are collected from four types of sensors: an accelerometer, magnetometer, gyroscope and barometer. The data sampling rate is 50 Hz. 60% of the dataset are used to train and the rest 40% are used for the test data. The train and test data are from all the subjects. To avoid the effect of data dependence on the result, we shuffled the data in advance. Table 3. The specific information of the participants. 1  163  female  25  2  168  male  24  3  173  male  26  4  180  male  22  5  180  male  25  6  168  male  24  7  165  female  23  8  186  male  22  9  162  female  24  10  172  male  31 To evaluate the performance in identifying different types of activities, we choose the F-value as the evaluation metrics [40]. The F-measure is defined as F = (2 * P * R)/(P + R) where P refers to the precision and R refers to the recall. In our experiments, the precision for an activity is the number of correctly expected samples (true positives) divided by the total number of samples predicted as belonging to this activity (the sum of true positives and false positives). The recall is defined as the number of corrected predicted samples divided by the total number of sample that actually belong to this activity (the sum of true positives and false negatives). Both of the precision and the recall rate can hardly measure the performance accurately. The F-measure is computed according to the precision and the recall, which can evaluate the performance more accurately.

Hyperparameter Settings
We have trained the data on network with different convolutional layers to find the best architecture. In each architecture, we adjust the filter size, number of feature maps, the pooling size, the learning rate and the batch size in the hyperparameters tuning process to retain the best configuration. As a result, we choose the best architecture with the best parameter setting as the final configuration. Table 4 illustrates the list of hyperparameters and their candidate values. The value in bold is the best setting of each hyperparameter.

Number of Layers
To analyze the effect of number of layers, the CNN-based classifier was performed with different numbers of layers. Figure 2 shows the F-measure in each activity by using different number of layers. It is clear that the network with 5 convolutional layers outperforms others in each type of activities. The reason lies in the fact that network with convolutional layers less than 5 is not complex enough to extract appropriate features for activity recognition while network with six convolutional layers tends to cause over-fitting for the structure complexity. Network with five convolutional layers is just enough to obtain a good performance.

Filter Size
To obtain the most proper filter size, the classifier was performed with different filter size (the number of layer is set to 5). As shown in Figure 3, increasing the filter size from 2 to 10 improved the classification performance of each activity. When the filter size is larger than 10, the performance decreases with the increasing of filter size. Therefore, the filter size is set to 10.   Figure 4 shows the influence of the feature map on the classification performance. As shown in Figure 4, the best performance for each activity is achieved when the feature map is set to 100.   Figure 5 demonstrates the classification performance for each activity with different pooling size. It can be seen from Figure 5 that the performance improves with the increasing of the pooling size at first. The classification achieves the best performance when the pooling size increases to 5. After that, the performance decreases with the increasing of the pooling size. Therefore, the pooling size is set to 5.  Figure 6 shows the performance of each activity with different learning rate. It can be seen from Figure 6, when the learning rate is less than 0.001, the algorithm reaches steady and good performance while the learning rate bigger than 0.001 shows unstable results. The reason for the bad performance of big learning rate is that the variables update too fast to change to proper gradient descent direction timely. Though with good performance, the tiny learning rate is also not the best choice because it means the slow update of the variables and it leads to time consuming of the training. Therefore, the learning rate is set to 0.001.

Impact of Different Window Size
During the data segmentation process, window size is a key factor. To evaluate the impact of window size on the classification performance, we divide the data in different window sizes, namely 1 s, 1.5 s, 2 s, 2.5 s, 3 s and 3.5 s. Figure 8 shows the performance of each activity with different window sizes. As can be seen from Figure 8, a window size of 2 s shows the best performance. Therefore, we choose 2 s as the window size for data segmentation.   Figure 9 shows the performance of our method in recognizing each activity. Table 5 shows the performance of the proposed method with the optimal hyperparmeters. From Figure 9, we can see that the proposed method shows excellent performance in recognizing all the nine activities. The F-measures for all the activities are higher than 0.98 except A2 (down escalator) and A5 (up escalator), whose F-measures are 0.97. Table 5 shows that there is a little difficulty for the proposed method to distinguish A2 and A5.

Comparison with Other Classification Methods
We have compared the proposed method with four traditional machine learning methods, including IBK, J48, NaiveBayes and SVM. The traditional machine learning methods need to select appropriate features. We use 64 features including statistical features, time domain features and frequency domain features (refer to [6]). Moreover, we have also compared the proposed method with another deep learning recognition approach named Convnet [35], which has a proven better performance than traditional ones like DT (decision tree) and SVM (Support vector machine). Convnet classified six types of human activities such as walking, upstairs, downstairs, standing, laying and sitting just utilizing an accelerometer and gyroscope. Figures 10 and 11 shows the comparison result. Figure 11 shows that the proposed method achieves 0.987 average F-measures of nine types while J48 which performs best in other methods acquires the average F-measures of 0.955. The other four methods show poorer performance with the F-measures less than 0.85. Therefore, the proposed method performs better than state-of-the-art methods in dealing with indoor human activities. Figure 10 shows that only the proposed method and J48 can achieve relatively stable performance on every activity. However, J48 shows just 0.92 F-measures on A6, A7 and A8 which is obviously worse than the proposed method. Furthermore, traditional machine learning methods require laborious human intervention in extracting complex hand-crafted features, so the J48 is time-consuming in the feature extraction process. Convet can hardly recognize similar activities such as up elevator and down elevator. We think the main reason is that Convnet is trained without barometer readings. We think that barometer is of high importance to sense the changing of altitude, so it can easily distinguish activities such as upstairs and downstairs which causes the change of height in different ways.

Conclusions
This paper presents a smartphone-based activity recognition method using a convolutional neural network, which can be used for indoor localization. The proposed method can learn the features automatically to avoid time-consuming feature extraction and selection. An activity database is built, which contains more than 6 GB of data collected by smartphones. Experimental results show that the proposed method achieves approximately 98% accuracy in identifying nine types of activities, which outperforms other classification methods.
There are some limitations in this work. Firstly, we did not consider the energy consumption problem in designing the recognition algorithm. Secondly, the dataset is collected from ten persons, which may not be enough to build a good model for the general population. In the future, we aim to keep collecting training data from more people. Meanwhile, we will investigate a deeper and more complex deep learning framework to improve the current method. Moreover, we will take into account the energy consumption in designing the recognition algorithm, for example, reducing the data sampling rate and the number of sensors used.