A Framework of Combining Short-Term Spatial/Frequency Feature Extraction and Long-Term IndRNN for Activity Recognition †

Smartphone-sensors-based human activity recognition is attracting increasing interest due to the popularization of smartphones. It is a difficult long-range temporal recognition problem, especially with large intraclass distances such as carrying smartphones at different locations and small interclass distances such as taking a train or subway. To address this problem, we propose a new framework of combining short-term spatial/frequency feature extraction and a long-term independently recurrent neural network (IndRNN) for activity recognition. Considering the periodic characteristics of the sensor data, short-term temporal features are first extracted in the spatial and frequency domains. Then, the IndRNN, which can capture long-term patterns, is used to further obtain the long-term features for classification. Given the large differences when the smartphone is carried at different locations, a group-based location recognition is first developed to pinpoint the location of the smartphone. The Sussex-Huawei Locomotion (SHL) dataset from the SHL Challenge is used for evaluation. An earlier version of the proposed method won the second place award in the SHL Challenge 2020 (first place if not considering the multiple models fusion approach). The proposed method is further improved in this paper and achieves 80.72% accuracy, better than the existing methods using a single model.


Introduction
Human activity recognition has been an active research area for decades and has lots of practical applications such as in video surveillance [1][2][3], human-computer interaction [4] and gaming [5].Nowadays, as the ubiquity, portability and the development of sensors in the mobile phone, there has been a growing interest in the research of smartphone sensors based human action recognition [6][7][8][9][10].Research on smartphone sensors based activity recognition for indoor localization [9], real-time smartphone activity classification [10] and transportation recognition [8] has been actively investigated.
Different from the conventional video based human action recognition [11], the data captured from smartphone sensors shows some specific characteristics.For example, due to the mechanism of smartphone sensors, it has been shown [12] that the data are of periodic nature.Moreover, the sampling rate of the smartphone sensors is usually very high, resulting in a large number of very long-range data.Furthermore, different users have different living habits, and people usually place their mobile phones in different locations on their bodies, which causes large differences in the distribution of data.In addition to the large variance of data, the activity categories used in the smartphone sensors based classification are also different from the conventional human action recognition.Besides from the locomotion of a person, the transportation mode is also considered as an important classification task, including taking car, bus, train and subway, which could be very confusing.
To prompt the development of smartphone sensors based activity recognition, the Sussex-Huawei Locomotion (SHL) Challenge [13] has been organized for three years from 2018 to 2020.It is based on the large-scale SHL dataset that was recorded over seven months by three participants engaging in eight transportation activities in real-life settings, including Still, Walk, Run, Bike, Car, Bus, Train and Subway [14].This year's edition (2020) of the challenge [13] aims to realize the user-independence and location-independence.
There have been some works proposed in the literature for smartphone sensors based activity recognition, including conventional handcrafted features based and deep learning based methods.Especially with the rapid development of deep learning, lots of convolutional neural network (CNN) and recurrent neural network (RNN) based methods have been developed in the last few years.For the CNN based methods, EmbraceNet [15] and DenseNet [16] have been proposed for the task.However, due to the nature of convolution, its receptive field in the time domain is relatively small and the long-range temporal information cannot be well captured.On the other hand, due to the sequence processing capability of RNNs, RNNs are naturally appropriate for the task.In [17], LSTM (long short-term memory) is used to process the sequence information.However, for the conventional RNNs including the simple RNN and LSTM, they usually suffer from the gradient vanishing and exploding problem, or gradient decay over layers due to the use of gates with non-saturated activation functions.Especially for the smartphone sensors based activity recognition, a model with long-range processing capability is highly desired.
To address this long-range temporal processing problem, this paper developed a framework of combining short-term spatial/frequency feature extraction and long-term IndRNN recognition model.The contributions of this paper can be summarized as follows.
• A framework of combining short-term spatial and frequency domain feature extraction, and long-term IndRNN based recognition is proposed.The long-range temporal processing problem is divided into two problems to take advantage of the periodic characteristics of the sensor data.• A dense IndRNN model is developed to capture the long-term temporal information.Due to the capability of IndRNN in constructing deep networks and processing long-sequences, the dense IndRNN model can effectively process the short-term features to obtain the long-term information.
Preprocessing of derotating the sensor data to world coordinate system and postprocessing of transfer learning to new users in the test set are also used in the proposed method.Experimental results show that the proposed method achieves the state-of-the-art performance in the category of single-model based methods.An earlier version of the proposed method has appeared at a workshop paper for SHL Challenge 2020 [18].This paper further made a significant improvement by adding a detailed explanation of the proposed method and thorough analysis of the experiments with ablation study on the models and parameters.Moreover, feature augmentation with temporal changes is further developed, which improves the performance over the earlier one.
The rest of this paper is organized as follows.Section 2 describes the related work and the proposed method is presented and explained in Section 3. The experimental results and analyses are provided in Section 4. Section 5 concludes the paper.

Related Work
Vision based human activity recognition has been widely studied for decades, with many methods proposed in the literature.Such environmental sensors like cameras may become inconvenient in the open or crowed area to gather activity information of each individual.The distance between human and devices also affects the quality of signals, leading to differences in recognition accuracy.To address these issues, especially to collect the daily activity information on the basis of each individual in all areas, wearable sensors have become an attracting option.Some earlier wearable sensors, requiring markers on people, may also be intrusive and make people uncomfortable.However, with the quick popularization of the smartphone, smartphone sensors based human activity recognition is gaining interests since it does not require further devices other than the smartphone (most people already take them during the day).Many studies have been conducted for the activity recognition tasks based on smartphone sensors including recognizing the indoor activities [9], the nursing activity to better care patients [19], also the movements people doing on their smartphones like typing and scrolling [20].Different approaches have been proposed for the smartphone sensors based activity recognition including the conventional handcrafted features based and the deep learning based methods, which will be briefly described in the following.

Conventional Handcrafted Features based Methods
In the conventional handcrafted features based methods, spatial/temporal and frequency features are first extracted using techniques including statistical features such as mean, variance, standard deviation, maximum value, minimum value, energy, entropy and Fourier transform spectrums.Such features are engineered to capture the information over the sensor data.After features are extracted, some conventional machine learning methods such as Decision Trees [21], KNN (k-nearest neighbors) [22], Hidden Markov chain [23] and SVM (support vector machine) [24] can be used for the classification of the activity.In [22] and [21], KNN and decision trees are used as classification models and the abovementioned spatial and frequency domain features are selectively used as input.In [25], a 'one-versus-one' SVM is used to perform pairwise combinations selection and a Gaussian kernel is applied to process the features in a high dimensional space.In [26], random forest is used to predict the activity category of each frame first, then activities are smoothed over time with Hidden Markov chain considering that the activities in the daily life are continuous.

Deep Learning based Methods
With the increasing applications and success of deep learning in many research areas, deep learning including both CNN and RNN has also been explored to perform the smartphone sensors based activity recognition.For the CNN based methods, Zeng et al. [27] and Zheng et al. [28] used just one convolution layer as a spatial feature extractor to obtain the features at each time step, and then pooling is applied in the time direction to summarize the temporal information.However, with the shallow network and simple temporal processing technique, they cannot extract high-level spatial-temporal features and did not achieve a high accuracy.Charissa et al. [29] employed a CNN using filters with a large time span to explore the long temporal correlation, and pooling over time is gradually used alternating with convolutional layers to reduce the loss over time.Zhu et al. [16] proposed to use a 1D DenseNet model in order to take advantage of deeper CNNs.The DenseNet is first applied on each sensor independently and then combined together.All the data in the time domain are sampled and provided as one input to network to better explore the temporal information.Considering the large volume of the temporal data, this also results in a large number of parameters.Choi et al. proposed an EmbraceNet [15] to fuse multiple CNN models together.It also processes each sensor independently and then combine them together.In all, the CNN based methods usually process the temporal sequence with pooling or convolution, which is not very effective in the long-range problem.
Since the smartphone sensors based human activity recognition is a temporal sequence processing task, RNN can be naturally selected with its temporal processing capability.Francisco  framework [17] using convolution and LSTM (long short-term memory) together where the convolution extracts the spatial feature and LSTM helps learn the long-term temporal information.However, the gate mechanism in LSTM makes it difficult to construct deep networks.Some researchers migrate the dense and residual architecture to LSTM to assist constructing deep networks, but the performance of improvement is not significant [30].In [31], Rui et al. first used a dilated convolutional neural networks to extract local  short-term features.Then a shallow dilatedSRU is developed to model the long temporal dependencies.In a word, the conventional RNN models used for classification are usually shallow and cannot effectively construct deep models due to the gradient decay within each layer.On the contrary, the recently proposed IndRNN [32,33] has been show to be able to better explore the high-level and long-term information, which has also been used in the last two years' SHL Challenge [34,35] as the base module with only the spatial information or FFT magnitudes using a relatively shallow network.This paper further proposes a framework of combining short-term spatial and frequency features, and long-term deep dense IndRNN models for activity recognition.

Overall Framework
This paper proposed an Independently Recurrent Neural Network based long-term activity recognition method based on short-term spatial and frequency domain features.The framework of the proposed method consists of four modules as shown in Fig. 1: preprocessing, short-term spatial and frequency feature extraction, long-term IndRNN model and transfer learning for postprocessing.Among them, the preprocessing and short-term feature extraction modules process the input data to short-range spatial features and frequency domain features to accommodate the periodic nature of the smartphone sensors data.Then the IndRNN model, taking advantage of its capability of processing very long sequences and constructing deep models, is applied as the main recognition model to solve the long-range classification problem.Finally, transfer learning is adopted as postprocessing to fine-tune the model in order to realize user-idependence.Details on each module are presented in the following.

Preprocessing
For the current smartphones such as HUAWEI Mate 9 used to collect data in the SHL dataset [14,36], the sensor data is measured in a coordinate according to the smartphone position.The basis of triaxial sensors is (x b , y b , z b ) where, for most phone, x b is along the shorter side and pointing right, y b is along the longer side and pointing up and z b is perpendicular to the screen and pointing out.The accelerometer and magnetometer sensors, two of the smartphone sensors, measure the acceleration of the device and the magnetic field of the earth at the device location, respectively.They are represented by two 3-dimensional vectors, which represent the acceleration of the phone and the magnetic field of where the phone is, respectively.Since the data is measured in the coordinate according to the smartphone position, the sensor data can be inconsistent in the world coordinate when only the phone is rotating without the user's body movement.In turn, it will affect the classification accuracy of the user's activity without preprocessing.Therefore, to reflect the real movement of the user in the world coordinate, the sensor data needs to be derotated to the consistent world coordinate system.
In this paper, the NED (North-East-Down) coordinate system is used to transform the sensor data as shown in Fig. 2, where x n points toward East, y n points toward magnetic North and z n points up toward the sky.The transform can be performed by multiplying the raw sensor data with the rotation matrix R derived from orientation sensor of the device in quaternions [q w , q x , q y , q z ] as shown in equation ( 1) and (2).
y + q 2 z 2 q x q y − q w q z 2 q x q z + q w q y 2 q x q y + q w q z 1 − 2 q 2 x + q 2 z 2 q y q z − q w q x 2 (q x q z − q w q v ) 2 (q v q z + q w q x ) 1 − 2 where (x n , y n , z n ) represents the transformed data in the NED coordinate system, which is consistent to the user's movement.The transformed data can then be used for the following feature extraction.

Short-term Spatial and Frequency Domain Feature Extraction
For the sensors used in the HUAWEI Mate 9, the sampling rate is 100Hz.Data from a window of 5 seconds is used for each classification, resulting in 500 frames of data.Generally, processing of a very long-range data such as 500 steps is difficult due to the complex temporal pattern.Also the data from the smartphone sensors has been shown to be periodic [12].Therefore, some short-term spatial and frequency domain features are extracted first as explained following.
First, the data of each 500-frames (5 seconds) sample was segmented into 21 100-frames (1 second) overlapping sliding windows as shown in Figure 3.Each segmented window contains short-term signals and long-time signals can be obtained by combining them over time.The data from 7 sensors are provided for classification, including accelerometer, gyroscope, magnetometer, linear acceleration, gravity, orientation, and ambient pressure, resulted in a total of 20 channels of data.Since accelerometer is a superposition of the linear acceleration and gravity, the linear acceleration and gravity data are not used in order to reduce the size of the data input.Also since orientation is used to derotate the other sensors data, it is no longer used after the preprocessing.In all, the data from gyroscope, derotated data from accelerometer and magnetometer, and pressure are used in our method, which contains 10 channels.
For each segmented window, some spatial features over time are first extracted including mean, numbers above mean, numbers below mean, standard deviation, minimum value, maximum value similarly as in [20].And for pressure, the data is normalized per sample and used as input to show the change within each sample.The pressure data performs not well in activity recognition, but well in location recognition model introduced later.The description of the features are shown in Table 1.On the other hand, due to the strong periodicity of the smartphone sensor data, fast Fourier transform (FFT) is used to transform the data into the frequency domain.The FFT amplitude spectrums are then extracted as features where only the magnitudes of the coefficients are used (half of the total data).Some examples of the FFT amplitude spectrums from all the classes are shown in Fig. 4. It can be seen that the distribution of FFT amplitude spectrums can be quite different among different classes.Therefore, in addition to the amplitude spectrum, some statistical features on top of the frequency features including mean and standard variation are also extracted and combined with previous features.
Table 1.Extracted short-term features in the spatial-temporal domain and their definitions.

Mean
The average value of the data for each axis in the window Numbers above Mean The numbers of values above the mean of the window Numbers below Mean The numbers of values below the mean of the window Standard Deviation Standard deviation of each axis in the window Minimum Value The minimum value of the data for each axis in the window Maximum Value The maximum value of the data for each axis in the window Per Sample Normalized Pressure The normalized pressure of each sample

Long-term IndRNN (Independently Recurrent Neural Network) Model
With the short-term spatial/temporal and frequency domain features extracted, a long-term recognition model is further proposed for the final recognition.In this paper, our previously proposed Independently Recurrent Neural Network (IndRNN) [32,33] is adopted as the basic model.The structure of the IndRNN [32,33] follows where x t ∈ R M and h t ∈ R N is the input and hidden state at time step t, respectively.W ∈ R M×N , u ∈ R N and b ∈ R N are the weights for the current input and the recurrent input and the bias of neurons.represents the Hadamard product and σ is the nonlinear activation function of neurons.N is the number of neuron of this IndRNN layer.With this form, each neuron in IndRNN is independent from each other and the gradient backpropagation can be calculated for each other.Accordingly, by regulating the recurrent weights, it well addresses the gradient vanishing and exploding problems.Therefore, it can process very long sequences.Also it can work very robustly with non-saturated functions such as ReLU, thus is able to construct very deep networks.
In this paper, we propose to use a deep dense IndRNN as the main classification model.The diagram of the proposed dense IndRNN model is shown in Fig. 5(b) and the detailed illustration of each dense layer and dense block is shown in Fig. 5(a).The overall architecture follows [33].It consists of three dense blocks with 8, 6, 4 dense layers, and each dense layer contains two IndRNNs as shown in Fig. 5(b).Batch normalization is used after each IndRNN layer to accelerate training.Dense architecture concatenates feature output from all the previous dense layers in a dense block as the input for the next dense layer.It facilitates the feature reuse of the relatively shallow layers.After each dense block, a transition block with one IndRNN layer is followed to compress the features as a bottleneck, where the outputs are usually reduced to the half of the input features.At last, a classifier with one linear function and softmax activation is used at the last time step for the final classification.
The cross-entropy loss is used as the objective function for training, which is where t i is an indicator variable, which equals to 1 when the prediction is right, 0 when the prediction is wrong.p i is the predicted probability of this sample.The categorical cross-entropy has been widely used for classification.

Transfer Learning for Post-processing
With the above preprocessing, short-term feature extraction and long-term IndRNN based recognition, different activities can be classified.However, considering that the smartphone can be placed at any place by the user such as holding in the hand, bag, or in the lap pocket, the sensor data can be of large differences.Directly classifying different sensor data captured from different locations can be difficult, and the most appropriate features used for classification under different locations may also be different.Therefore, in view of the differences among different sensors, the location of the sensor data is first recognized.Then in the test, we can pinpoint the location of the data and use an appropriate model for classification.In this process, the labels of the sensor data are changed to the locations of the sensors.A simple plain IndRNN model of stacking 6-layer IndRNNs is used for the classification.
The location recognition result in terms of the confusion matrix is shown in Fig. 6(a), where four locations are used, including bag, hips, torso and hand.It can be observed that while different locations The architecture of the dense IndRNN model.can be recognized with a relatively good accuracy, there are still some confusion among different classes,  especially between bag and hand, hips and torso.If locations are recognized into two groups, bag and hand, hips and torso, the classification of two groups can be very accurate as shown in Fig. 6(b).It indicates that the features of the data from each group can be similar while the features from different groups can be very distinguished.Therefore, in the proposed scheme, a group based location recognition is used, where the data is first classified into two groups and then further recognized as different activities.Note that in the SHL dataset used in the experiment, all the data from the test set comes from one unknown location, thus is classified first to one location group and only one model is constructed for this recognized location group.
On the other hand, due to limitation of the dataset which only contains data from three users (although with a large amount of data -196072 frames), transfer learning is used to quickly generalize the model to different users.In the SHL dataset, only user1 is used as training data, a amount of data from the other two users are used as validation data, and the remaining data from the user2 and user3 are kept for test.To fully take advantage of the validation data (which is allowed in the challenge), the validation data is first split and part of it is used to transfer the model learned on the training data of user1 to the test data of user2 and user3.For simplicity, the learned model is directly fine-tuned on the transfer data.The most common way of transfer learning is to use a half of the validation set as transfer training set and another half acts as transfer validation set.However, in this challenge, splitting the validation set directly into parts may lead to over-fitting because labels of the validation set distribute unevenly as shown in Figure 7. Therefore, the data with same labels is first stacked together, then divided with a similar proportion of data from all the classes to construct the transfer training set and the transfer validation set for the transfer learning process.
When conducting the transfer learning process, it leads to different accuracies using the first half and the second half of the original validation set for training because of the limited size of the validation set.Accordingly, we further swap the transfer training and transfer validation set to learn two models, noted as TransferA and TransferB, and then fuse them together to take advantage of all the data.The diagram of the transfer learning is shown in Fig. 8.

SHL Dataset
SHL dataset [14][36] is used for evaluation in this paper, which is also the dataset used in the SHL Challenge 2020.It was recorded over seven months in 2017 from three users (user1, user2 and user3).The aim of this dataset is to use machine learning methods and heuristics to realize the recognition of users' 8 locomotion modes and transportation (Still, Walk, Run, Bike, Bus, Car, Train and Subway).The smartphone used to collect data is put on four locations on the body (Bag, Hips, Torso and Hand).The dataset aims to realize the user-independence and location independence.To be specific, the training set contains 272 × 4 hours from four locations of user 1.The validation set consists of 40 × 4 hours data from four locations of the combination of user2 and user3.The test set contains 160 hours data of user2 and user3 from an unknown location (Hips after the Challenge result is published).
The data is collected from 7 raw sensors, including accelerometer, magnetometer, gyroscope, magnetometer, linear acceleration, gravity, orientation, and ambient pressure, which combines a total of 20 channels.The sampling rate is 100Hz, and all of the data is segmented into 5 seconds windows.So the data size of the training set, validation set and test set are 196072 × 500, 28789 × 500 and 57573 × 500, respectively.

Training Setup
For training, Adam [37] is used for optimization.The learning rate of our model is set to 2 × 10 −4 at first.To restrain the slightly larger fluctuation at the beginning of the training process, it is set to 2 × 10 −5 at the first 10 epochs as a learning rate warmup strategy.The learning rate drops 10 times once the validation accuracy does not increase (over a large patience 100).Mini-batch with 128 batch size is used to train our model.The dense block configuration is set to (8,6,4), where in the first, second and third dense block, 8, 6 and 4 dense layers are used, respectively.This keeps a relatively similar number of neurons in each dense block.The growth rate is set to 48.

Dense IndRNN
Dense IndRNN 8 Kinds of Activity Labels 8 Kinds of Activity Labels  In our model, ReLU is applied as an activation function.Compared to the tanh and sigmoid function, it not only reduces the amount of computation but also helps to alleviate the problem of gradient vanishing.In order to reduce over-fitting, dropout is applied after the input (0.5), each dense layer (0.5), each bottleneck layer (0.1) and each transition layer (0.3).

Evaluation
The final performance is evaluated using the F1 score.It can better reflect the performance when the distribution of the data is imbalanced among different classes.Traditionally, the F1 score is used in evaluating binary classifications and can be defined with precision and recall as follows: Precision = TP TP+FP Recall = TP TP+FN (5) where TP represents True Positive, TN is True Negative, FP is False Positive and FN is False Negative.Among them, Precision focuses on assessing how much of all the data that is predicted to be positive is the true positive.Recall focuses on how many samples are successfully predicted to be positive among those real positive.In multi-categories classification, the precision and recall are calculated for each class separately, and the overall precision, recall, F1-score can be obtained as follows: In the SHL Challenge, since the location is unknown, location recognition is first performed to recognize the location of the test set.In this paper, since the location is already reported, the validation data from the known location (Hips) is used for validation.It is observed that there is no large difference using a group based location or a specific location.Also in the practical applications, we argue that the  locations are always unknown and the group based location may better describe the data as shown in Fig. 7.

Ablation Studies on Models, Augmentation and Learning Rates
Firstly, three different model architectures are evaluated including the plain IndRNN, residual IndRNN and dense IndRNN.The results on the test set are shown in Table 2 and the confusion matrices are shown in Fig. 9.It can be seen that the dense IndRNN performs the best.Therefore, in the following experiments, dense IndRNN is used as the baseline of the model.
On the other hand, feature augmentation is also explored in the proposed method.In addition to the input data and features at each time step for input of the network and deeper layers of the network, this paper also augments the input data and features with the temporal difference.The augmentation can be viewed as a form of optical flow in the video based classification tasks.It provides the first-order change information for better processing.The result is also shown in Table 2, and it can be seen that the feature augmentation improves the performance.
Considering the large differences between the training data and validation/test data (from different users), the learned model tends to become overfiting when the learning rate is too small.Therefore, the  effects of different learning rates are further studied on the final performance.The results are shown in Fig. 10.It can be seen that the network performs similarly in a wide range of learning rates.The learning rate is set to 8 × 10 −5 in the experiments.

Transfer Learning
The dense IndRNN model trained above with the feature augmentation and learning rate is used for the transfer learning [38] to further improve the performance on the final test dataset as described in Subsection 3.5.The learning rate in the transfer learning is set to 2 × 10 −5 in the training empirically.In this paper, the simple fine-tuning of the model on the transfer learning sets is used.The result is shown in Table 3.It can be seen that after transfer learning, the accuracy of validation set increases to 80.72%, which means that cross-user transfer learning is useful for testing on the data from different users.It is noticed that the performance of transferB model is better than the transferA, which is due to the uneven distribution of the two transfer learning datasets.By comparing the confusion matrices before and after transfer learning shown in Fig. 9 and Fig. 11, it can be further observed that the recognition accuracy increases a lot for most classes (except bike and Bus).For "Still", transfer learning further brings an accuracy improvement around 6%, which eliminates the mistakes of being predicted as "Bike", "Car" or "Bus".For "Walk", the accuracy increases around 3%, mainly reducing the confusion with Train or Subway.Moreover, The accuracy improvement for  "Run" is significant from 43% to 94%.Before transfer learning, over 40% "Run" samples are predicted as "Bike", while after that, it has been largely improved.It indicates that the activity "Run" is of strong user-dependence.The recognition accuracies of three motor-powered activities, including "Car", "Train" and "Subway", have also been improved while "Bus" slightly decreases and misclassified as "Car".While the proposed method achieves a relatively high performance on the other locomotions, the accuracies of the four motor-powered activities are still relatively low due to their strong similarities.Methods on distinguishing the small differences among them are highly desired, which will be investigated in the future.

Comparison with State-Of-The-Art Classification Methods
The proposed method is further compared with the existing methods [39][40][41][42][43][44][45][46][47][48][49][50].The results are shown in Table 4 including comparisons with the existing machine learning and deep learning methods.It can be seen that the proposed IndRNN long-term temporal recognition greatly improves the performance over other single-model based machine learning and deep learning methods.However, it is slightly worse than the model fusion method of DenseNetX + GRU [39] (the first place of the SHL Challenge 2020), which fuses the CNN and RNN models together and also fuses the features of each sensor processed individually.It indicates that the spatial processing and effective combination of all the sensors may be important for the recognition.On the other hand, the proposed IndRNN model can also be equipped with the enhanced spatial processing and combination of sensors to further improve the performance, which will be studied in the future.

Conclusion
In this paper, we have presented a framework of combining short-term spatial/frequency feature extraction and long-term IndRNN model for smartphone sensors based activity recognition.The short-term spatial and frequency domain features are first extracted with the Fourier transform to deal with the periodic nature of the sensor data.Together with the conventional statistical features, the FFT amplitude spectrums and the statistical features of the FFT spectrums are extracted to characterize the data of a short-term window.Then a dense IndRNN model is further developed to learn the long-term temporal features on top of the short-term spatial and frequency domain features.Finally, transfer learning is adopted in the experiments to realize the user-independence, which further improves the performance on the test set.Experiments show that our model achieved an accuracy of 80.72 % on the SHL dataset, which is better than the existing single-model based methods.

Figure 2 .
Figure 2. Derotation of coordinates from the smartphone coordinate system to the NED (North-East-Down) coordinate system.

Figure 4 .
Figure 4. Example FFT amplitude spectrums from one segmented window of different classes.
The structure of dense IndRNN layer and dense IndRNN block.

Figure 5 .
Figure 5. Illustration of the proposed dense IndRNN structure.
(a) Confusion matrix of the location recognition on the validation set: on four locations.(b) Confusion matrix of the location recognition on the validation set: two groups -Bag and Hand, Hips and Torso.

Figure 6 .
Figure 6.Confusion matrices of the location recognition on the validation set.
(a) Distribution of user2's labels over the validation set.(b) Distribution of user3's labels over the validation set.

Figure 7 .
Figure 7. Distribution of labels over the validation set.

Figure 8 .
Figure 8. Diagram of the fused transfer learning.

Figure 10 .
Figure 10.Illustration of using different learning rates.
(a) The transferA model.(b) The transferB model.(c) The final model.

Figure 11 .
Figure 11.Confusion matrices of the different transfer models.

et al. proposed a deep Framework of the Proposed Method
TimeFigure 1. Framework of the proposed method.

Table 2 .
Results on using different model architectures and augmentation.

Table 3 .
Results of the different transfer learning models

Table 4 .
Results of the proposed method in comparison with the existing methods.