In this section, we evaluate our framework for joint activity and the position recognition task. The detailed information of experimental datasets is introduced first. The experimental settings are given, including basic evaluation rules and optimization hyper-parameters. Following that, the benefits of two main components in our framework, coordinate transformation and joint recognition with MTL strategy will be reported separately.
The two components of our framework will be evaluated on two datasets individually. The widely investigated human activities and smartphone positions in various kinds of literature are investigated in these datasets.
To verify our solution to the smartphone orientation variation problem, we collected a dataset labeling the sensor data with smartphone orientations, positions, and human activities. We collected a dataset with motion sensor data of 4 activities from 2 subjects. The human activities include walking, ascending stairs, descending stairs, and running. For each activity, there were 3 different Android devices placed at different body positions: a Huawei Nova 2s placed in a pants pocket, a Samsung A9 Star held in hand, a XiaoMi RedMi 4X inside backpack pocket. The smartphones in pants pockets and hand are placed on 4 orientations; and 8 orientations are labeled when a smartphone was stored in a backpack. In total, each activity was measured for around 80 min with a frequency of 50 Hz. The sensor type includes acceleration, gyroscope, magnetic, gravity, linear acceleration and rotation vector. Although this dataset covers a relatively small number of activities, on-body positions, and users, it is still fair enough for testing algorithm considering the smartphone orientation in the recognition task. This is because investigating the problem of smartphone orientation variation is not affected by other factors.
To evaluate our joint recognition strategy, we exploited the RealWorld HAR [6
] dataset which collected the sensor data of 8 activities in 7 on-body positions from 15 volunteers. To the best of our knowledge, this dataset includes the most smartphone positions so far. The experimental data was gained from different physical characteristics of eight males and seven females (age
, and weight
). The activities performed by each one includes climbing downstairs (A1), climbing upstairs (A2), jumping (A3), lying (A4), standing (A5), sitting (A6), running/jogging (A7), and walking (A8). Every user was equipped with a set of smartphones (Samsung Galaxy S4
) and a smart-watch (LG G Watch R
). These devices were located on seven different on-body positions (chest (P1), forearm (P2), head (P3), shin (P4), thigh (P5), upper arm (P6), and waist (P7)). For each activity, the sensor data on different on-body positions were collected concurrently at a sampling rate of 50 Hz. The more important, the recorded videos show that the data was collected in a real-world setting. For instance, users could stroll in the city or jog in the forest; the users’ movements were performed in their preferred ways, such as walking at different speeds, sitting/standing while eating, or holding the phone. However, in this dataset, the subjects commonly place the smart devices in fixed orientations and the rotation vector sensor is unavailable. Therefore, it cannot be used for verifying our assumption about coordinate transformation.
4.3. Optimization Hyper-Parameters
The machine learning model requires several hyper-parameters in the optimization procedure. Normally, an optimal approach for setting the hyper-parameters is to select them from a set of hyper-parameter candidates by prior experiences, such as grid search. However, due to large computational demands for deep learning architecture, the aforementioned hyper-parameter search method is infeasible.
In this study, we set manually the same hyper-parameters given to all the recognition models. The maximum training epoch was set to 1000. In each epoch, the recognition model was trained with several iterations, where each iteration contains a batch of training data with batch size 64. To minimize the joint cost function, we applied a stochastic gradient descent algorithm by using Adam optimizer [45
] with a fixed learning rate of
. The gradient was clipped by setting the global norm value [46
] as 5. Meanwhile, there was a
] on learned representation before inputting the output layer. We used early termination to select the best model. When facing the divergence, i.e., the loss of model on testing dataset no longer decreases more than 50 epochs, we selected the model with the minimum loss on the testing dataset.
4.4. Coordinate Transformation
In this section, the coordinate transformation as a preprocessing approach will be evaluated by using the collected dataset. To this end, the acceleration data before and after coordinate transformation will be employed as an input of the model separately. The baseline performance was built by using a 10-fold cross validation where the training and test data are Independent and Identically Distributed (i.i.d.). Moreover, a Leave-one-orientation-out cross validation was taken as main evaluation mode that samples the training and test data from exclusive orientations. Additionally, the coordinate transformation method was quantitatively evaluated by changing the different window sizes of acceleration data.
The benefit of coordinate transformation can be observed from Figure 8
-scores of single-task models for activity and position recognition. The input data instances of these models are original or transformed acceleration data within 10 seconds. Using original acceleration data without the coordinate transformation, recognition models in 10-fold cross validation
produce a strong baseline. However, the performance in Leave-one-orientation-out cross validation
decreases dramatically. As discussed in Section 3.2
, such phenomenon is caused by smartphone orientation variation problem which produces test data samples Out-of-Distribution. After a coordinate transformation, there is no longer any apparent gap between two evaluation modes. It is obvious that coordinate transformation improves the generalization ability of models on test dataset collected from unseen smartphone orientations. This is mainly achieved by transforming all acceleration data to a unified earth coordinate system so that the data of all orientations are expressed with identical data distribution. Although our dataset contains a relatively small number of activities and positions, these results still are persuasive because the coordinate transformation technique is independent of activities, positions or subjects.
The results in Figure 9
are presented for evaluating the impact of window size with respect to coordinate transformation. All results were produced under the Leave-one-orientation-out cross validation
in which the acceleration data under the earth coordinate system are taken as input. Given the acceleration data with different window sizes
, the corresponding total numbers of data instances are
respectively. In general, the longer the window size, the higher the recognition performance will be derived. Interestingly, the results of models present different changing tendencies along with progressively shrinking window sizes. The performance of CNN and MLP tend to descend when smaller window sizes were employed. In contrast, the performances of LSTM were improved with shortening the sequence length and reached the highest score at the
4.5. Multi-task Learning for Joint Recognition
We evaluated our joint recognition approach with the RealWorld HAR dataset [6
] which contains more human activities, smartphone positions, and subjects. To make a fair comparison, we adopted the same setting in the original research [6
]. When the window size is one-second-long and overlaps in half, the number of data instances is
The performances of our solution were evaluated by three types of experiments. First, we contrasted our results with the original results, the state-of-the-art solutions [6
] which aim to recognize dynamic and stationary activities by employing smartphone positions as prior knowledge. Second, to show the advantage of our method, we added an extra related task to identify smartphone users from human movement. Finally, considering that jointly recognizing smartphone positions and users from human movement is more reasonable, we applied our solution only on dynamic activities for evaluation of actual use.
To further examine how MTL leverages useful information from related tasks, the experiments were conducted on two types of learning strategies. Taking results in Table 1
as an example, in each type of model, the ST
respectively denote the single-task model and joint recognition model-based MTL strategy, and the number of task-specific layers (hidden layers of classifiers) is marked at the end of learning types. For instance, the ST-3L
represents a single-task model using a backbone network for feature extraction and 3-layer-classifier for prediction; the MT-0L
denote the multi-task models using 0 and 3 task-specific layers receptively. However, the MT-0L
means using shared representation directly as input for multi-task recognition. Correspondingly, the model of MT-3L
further transforms a shared feature vector to multiple task-specific features, where each task has 3 simple non-linear hidden layers. Additionally, the S-A/P
refers to results of subject-specific and activity/position-specific model used by Sztyler et al. [6
-scores of models when the sensor data are collected from stationary and dynamic activities. The performance of our models all surpass the original results significantly. Our best
-scores of activity and position recognition are around
, which much higher than
, even with a simple MLP model that was also used by Sztyler et al. [6
]. Moreover, based on the MTL strategy, using one global model for joint recognition is conducive to reducing the number of models than S-A/P
. The S-A/P
is a position-aware HAR solution whose best performances were produced by stacking three levels of Random Forest classifiers: in the beginning, dynamic and static activities were distinguished using first-level classifiers; next, the second-level classifier was used to identify where the smartphone is placed on the human body; the third-level included a set of activity recognition classifiers each belonging to a specific position. Meanwhile, all their results were reported under the subject-specific models, where a position-aware model was trained for each subject on the data of all activities and positions.
Although the models can maintain a certain degree of accuracy, their performances still can be further corrected because of the existing sensor data of stationary activities (e.g., lying (A4), standing (A5) and sitting (A6)) seem to confuse the classifiers. For example, identifying smartphone positions or users from stationary activities is not reasonable. In this case, the models still work due to the gravity measurements contained in acceleration data which might be useful. Therefore, to clearly observe the feasibility, we further evaluated our models on only dynamic activities.
The performance of all tasks is both improved by only employing dynamic activities data in Table 2
. Especially the activity recognition and user identification in LSTM-MT-3L
-scores have reached to
respectively. Meanwhile, the best average
-scores are all raised than results in Table 1
). It suggests that jointly mining human activity, smartphone position and user information from dynamic movement is more reasonable in practice, which is in line with the intuition.
The benefit of MTL can be observed by contrasting single-task models with joint recognition models. On the whole, the promotions of all average
-scores are slightly higher than the single-task models except for LSTM-MT-3L
in Table 1
. The highest one achieved is
in Table 1
. For specific tasks, such as user identification, all performances are consistently improved under joint models, where the maximum improvement is
in Table 1
. Even the MTL strategy cannot perfectly improve the performance of all tasks, it works well for the reduction of computation demand and latency by efficiently leveraging shared parameters. In contrast with single-task models, the joint model only employs one backbone network but achieves comparable results. Additionally, adding task-specific layers has an obvious effect on results. As can be seen, the performances in learning type of MT-0L
in Table 2
are not optimal because the shared feature of multi-tasks commonly plays a role for regularization in small size dataset.