Investigating the Impact of Possession-Way of a Smartphone on Action Recognition

For the past few decades, action recognition has been attracting many researchers due to its wide use in a variety of applications. Especially with the increasing number of smartphone users, many studies have been conducted using sensors within a smartphone. However, a lot of these studies assume that the users carry the device in specific ways such as by hand, in a pocket, in a bag, etc. This paper investigates the impact of providing an action recognition system with the information of the possession-way of a smartphone, and vice versa. The experimental dataset consists of five possession-ways (hand, backpack, upper-pocket, lower-pocket, and shoulder-bag) and two actions (walking and running) gathered by seven users separately. Various machine learning models including recurrent neural network architectures are employed to explore the relationship between the action recognition and the possession-way recognition. The experimental results show that the assumption of possession-ways of smartphones do affect the performance of action recognition, and vice versa. The results also reveal that a good performance is achieved when both actions and possession-ways are recognized simultaneously.


Introduction
With the advances in technology, a lot of research efforts have been put into developing autonomous systems for convenient human lifestyles [1][2][3]. Developing such systems involves effective modeling of human behavior as the systems are required to conduct many things for humans. Action recognition is one of these techniques, and has been utilized in diverse applications such as health-care monitoring systems [4,5] and surveillance systems [6,7].
Although there are many existing studies on developing action recognition systems using wearable sensors, most of them are not so practical as the wearers may find it troublesome to wear and carry the sensors around with them. In order to meet practical needs, along with the growing number of smartphone users, many studies in action recognition have been conducted using smartphones, which are equipped with various types of sensors such as accelerometers, gyroscopes, light sensors, etc.
Since many users carry their phones closely with them all the time, utilizing the devices for action recognition seems more practical than wearing extra sensors.
However, many existing works in action recognition using smartphones are somewhat limited in the sense that they commonly assume that the users possess or carry the devices in certain ways, for example, by hand, in a pocket, in a bag, etc. This assumption is difficult to hold in the real world as different users carry their phones in different ways. Furthermore, even the same user may carry the phone differently in various situations. For example, a user may prefer to use the phone while walking, in which case he/she holds the phone with a hand; however, while running, the user may prefer to put the device in a pocket or a bag.
In this paper, we extend our previous work [8] to investigate the relationship between the action recognition and possession-way recognition using smartphones, adopting a larger dataset than the previous work and additionally utilizing state-of-the-art algorithms that are designed to learn the temporal dependencies among the sensed data. Seven users were recruited to gather the sensor data separately; they are asked to perform two actions (walking and running) while carrying their phones in five different ways (by hand, in a backpack, in an upper-pocket, in a lower-pocket, and in a shoulder-bag).
Experiments with the larger dataset confirm our previous findings that simultaneously recognizing both actions and possession-way improves the overall performance. In addition, we closely investigate how the performance varies in accordance with the length and the number of time intervals (windows) that features are computed for, as well as with various machine learning algorithms.
The remainder of the paper is organized as follows. Section 2 outlines the background for this study, while Section 3 describes the proposed approaches in detail. We illustrate the experiments and present the results in Section 4, and discuss the experimental findings in Section 5. Finally, the paper is concluded in Section 6 with some directions for future works.

Background
Over the past few decades, a lot of studies have been conducted in the field of action recognition. These works can be arranged into three categories according to the form of data they employ: (1) images [9][10][11]; (2) videos [12][13][14]; and (3) a variety of sensor data [15][16][17][18][19][20]. This paper comes under the third category: action recognition using sensor inputs. Typically, the input streams utilized in such works are obtained from wearable sensors in the form of wrist-type, pad-type, or necklace-type sensors. Such studies that make use of these wearable sensors have demonstrated high accuracy in action recognition.
However, one major drawback of using these wearable sensors in everyday application is that the users often feel uncomfortable wearing them. In addition, having to remember to wear the devices can be troublesome as well. This issue is especially critical in the health-care domain; an effective sensor device does not only refer to a device with technical and clinical advantage, but also the one that the end-consumers find acceptable to wear [21]. In other words, designing a good wearable sensor requires a careful assessment of user wearability (usability), which is another important topic but quite different to the topic of improving the technical performance of sensor applications.
In this work, we instead focus on an alternative device that most people do not live without-the smartphone. In South Korea, it is estimated that 83.0% of Koreans own a smartphone, and on average, spend 3 h and 39 min a day purely on using smartphones [22]. Smartphones have become so integral to our life that most people find it natural to carry them around all the time. Furthermore, as they are equipped with various types of sensors (e.g., accelerometer, gyroscopes, proximity sensor, etc.), exploring diverse combinations of sensor inputs is possible.
As a matter of fact, much research has been done on action recognition using the sensors of smartphones over the past few years. For example, Dai et al. [23] proposed a pervasive fall detection system on mobile phones called PerFallD, utilizing the accelerometer sensor and magnetic field sensor of smartphones, and an additional magnetic field accessory.
More recently, He et al. [24] developed another fall detection system solely on smartphones, which notifies the caregivers of the fall accidents through Multimedia Messaging Service (MMS) containing a map of suspected location and time. Using the built-in tri-accelerometer, their system classifies five body motions-vertical activity, lying, sitting or standing, horizontal activity, and fall. However, it assumes that the smartphone is mounted on the user's waist.
Song et al. [25] analyzed users' daily behaviors in terms of movements (e.g., sit down, run), actions (e.g., phone call, read mail), and situations over time (e.g., home, car, subway). An HMM (Hidden Markov Model)-like model is trained from users' activities over a series of days, and utilized to extract behavior patterns by time-movement correlation, time-action correlation, etc.
The aforementioned studies have shown the possibility of action recognition using the smartphones. However, they commonly assume that the possession-way of a smartphone is fixed, for example, by hand, in a pocket, in a bag, etc. In reality, users carry their phones differently in various situations; thus, it is necessary to develop action recognition systems that are invariant to the possession-ways. Inspired by such a viewpoint, this paper investigates the following questions: 1.
"How much impact does the information of possession-way of smartphones have on the action recognition?" 2.
"Likewise, how much does the recognized action influence the recognition of the possession-way of a smartphone?" 3.
"If each task does indeed influence the other, which one should be carried out first (performance-wise)?" To the best of our knowledge, our published conference paper [8] was the first study to delve into these questions. We hope that our findings can help the other related tasks that involve action recognition using smartphones, as the results provide information in dealing with the possession-ways of the device.

Overall Approach
The purpose of this study is to investigate the relationship between the action recognition and the possession-way recognition using the smartphones. To do so, we propose three experimental approaches, and compare the results:

1.
Conducting the possession-way recognition followed by the action recognition.

2.
Conducting the action recognition followed by the possession-way recognition.

3.
Conducting both of the recognition tasks simultaneously.
Let us assume that there are A actions and P possession-ways. Given an unseen piece of data X, these approaches aim to find which action is performed (action recognition), and how the smartphone is carried by the user (possession-way recognition). Figure 1 summarizes these approaches.
The first approach, possession-action recognition, consists of two steps. In the first step, it recognizes the possession-way of the unseen data X without considering the action. This step is simply a classification task over P classes. The second step is to recognize the action given the recognized possession-way, so it can be seen as a classification task over A classes. Note that if the recognized possession-way in the first step is incorrect, then it may deteriorate the performance of the action recognition in the second step.
The second approach, action-possession recognition, also has two steps. The first step refers to the classification over A classes, and the second step, the classification over P classes. If one wishes to develop only an action recognition system, then the second step is not necessary as the action is recognized in the first step.
The third approach, concurrent recognition, involves classifying the action and the possession-way simultaneously. Therefore, it is a one-step classification over A × P classes. One may argue that the action recognition using smartphones is inherently a challenging task as it involves dealing with the direction or the angle of the smartphone carried by the user. Fortunately, advances in hardware have equipped the phones with gyroscope sensors which compute the angle of the device, making it easy to access the angular data.

Feature Definitions
Let us denote the number of sensor dimensions to be S. The exact number may vary with different models of smartphones, as well as the versions of Android API (Application Programming Interface); we describe the details of the sensor types in Section 4.3.
Given a window size of W seconds, we compute the mean, minimum, maximum, and variance values for each window. Therefore, for each window, we have 4 × S distinct features for S sensor dimensions. The feature engineering is kept simple as the main focus is on the verification of the relationship between the action recognition and possession-way recognition using the smartphones.

Classification Algorithms
We employ five classification algorithms: naive Bayes (NB), random forests (RF), support vector machine (SVM), deep neural networks (DNN), and recurrent neural networks (RNN) to compare the performances of the three approaches. Note that, in our previous work, we utilized three classification algorithms: naive Bayes, decision trees (DT), and artificial neural networks with one hidden layer. However, we empirically found that the extended dataset of this study required more powerful machine learning algorithms to effectively learn the greater variances in the data gathered by the seven users. Therefore, we added RF instead of DT, along with SVM, DNN, and RNN models. The hyper-parameter settings of these algorithms, which were determined by grid searching, are as follows: • NB: Gaussian. • RF: 100 decision trees, Gini impurity. • SVM: Liblinear, l2 penalty, hinge loss, tolerance of 0.0001.
The structure of DNN and RNN models are depicted in Figure 2. The input layer of the DNN model takes n feature windows, each consisting of 4 × S distinct features as described in Section 3.2. Note that an experimental analysis on this parameter n is conducted in Sections 4.4-4.6, where we observe the changes in the performance of each approach when n is varied.
The DNN model has two fully connected hidden layers with a hyperbolic tangent (tanh) as the activation function. The dimensions of the first and the second hidden layers are n|W| 2 and n|W| 3 , respectively. The softmax output layer classifies the instances as one of L class labels, where L can be A, P or A × P depending on the classification tasks.
The RNN model consists of a gated recurrent unit (GRU) [26] layer and two fully-connected layers. A GRU works in a similar fashion to its older cousin, long short-term memory (LSTM) units [27], in the sense that it adaptively updates or resets its memory content via gating mechanism. Nevertheless, the GRU has a slightly simpler structure than LSTM, often reducing the overall time in training. For this reason, we chose to employ GRU, and it indeed converged faster than LSTM without sacrificing the performance. The GRU layer is structured in a many-to-one fashion, meaning that only the last hidden state is passed on to the next layer. Similar to the input dimension of the DNN model, n feature windows are fed in as n time steps. The fully connected hidden layer outputs a vector of n|W| 2 dimensions, which is, in turn, classified into one of A, P or A × P class labels.

Dataset Construction
As there are no publicly available dataset for this study, we have gathered a dataset by implementing an Android application that continuously logs the sensor values of a smartphone. We target two actions, walk and run and five possession-ways of a smartphone: hand, backpack, upper-pocket, lower-pocket, and shoulder-bag ( Figure 3). In our previous study [8], the dataset is collected for a single user only, using a Samsung Galaxy Nexus smartphone, (Samsung Electronics, Suwon, South Korea); in this study, the new dataset is gathered by seven users, using various Android-based smartphones. We note that the seven participants were volunteers from the department of computer science. The participants had a wide range of physical attributes in terms of gender, age (23 to 33 years old), height (167 cm to 182 cm), and weight (45 kg to 101 kg), which we believed to be relevant factors in our tasks as these attributes may influence the frequency and amplitude of the gathered sensor data.
Identical to the previous study, each user performed each action for 10-11 min. However, in this study, each sensor value is recorded at the sampling rate of 40 Hz instead of 10 Hz.
The statistics of the raw dataset are described in Table 1 where the values represent the number of gathered samples; as each action is performed for 10-11 min by seven users, the number of gathered samples is approximately 40 × 60 × 10 × 7 for each action-possession combination.

Preprocessing the Dataset
Prior to generating features, the raw dataset has undergone three preprocessing steps. Firstly, the sensor values from the eight sensors are interpolated with 100 millisecond intervals. Although the sensors are programmed to measure a value every 40 Hz, in reality, all measurements are not perfectly synchronized. Therefore, we take the initial time stamp of the very last sensor that begins to measure as our starting time stamp for all sensors. Similarly, the earliest final time stamp of a sensor is taken as the finishing time stamp for all sensors as well. Given the time intervals, linear interpolation is conducted for all sensors that measure continuous real values. The three sensors, light, proximity, and pressure, provide two discrete values, either 0 or a fixed integer smaller than 10. For the three sensors, the nearest neighbor approach is taken to fill the missing values in the time range.
Secondly, the raw values are normalized by min-max normalization where the values (v raw ) are linearly transformed to fit a given range, [r 1 , The minimum and maximum values of each sensor are obtained by consulting the relevant materials on the Android API documentations.
Lastly, we disregard the first and the last 15 s of the data, as users generally took such time to begin or end the data logging. At the end of preprocessing, the normalized values from the eight sensors (19 distinct values in total) are lined up on one coherent time line with 100 millisecond interval.

Feature Generation
In Section 3.2, we explained that the four features, (mean, minimum, maximum, and variance), are computed using the S sensor dimensions for each window W. We employ eight sensors that are equipped in a typical smartphone: light, proximity, pressure, gravity, accelerometer, linear accelerometer, gyroscope, and rotation sensors (we mention that two sensors, orientation and magnetic, are no longer utilized in this study due to API mismatch between the Android application and the different phones). The light, proximity, and pressure sensors generate a one dimensional real value, while the rotation sensor generates four dimensional real values. The other remaining sensors produce three dimensional real values. In total, we have S = 19 sensor dimensions, and 76 features (4 × 19) obtained based on the feature definition.
We experiment with varying the length of window |W| for which the 76 features are computed, and the window sizes are 0.3, 0.5, 1, 3, 5, and 7 s. For example, when |W| = 0.3, the features are computed using the three rows of the data as each row is 100 milliseconds apart. The statistics of the generated feature samples are summarized in Table 2, where the values represent the number of samples after the feature generation. The number of generated feature samples decreases as the length of the window increases. In addition to varying the length of the time window, we experiment with varying the number of windows while fixing the length of the window as 1 s (|W| = 1). We clarify that the number of windows refers to how many consecutive sets of generated features are taken as the input, while the length of the window represents how many seconds of the raw data are used to generate one set of features. Evidently, varying the number of windows will produce input vectors with different lengths. For each classification task, experiments are performed by setting the number of windows n as 1, 3, 5, and 7. Besides the two experiments, we also look at the performance of each model when applied to a new user who has not been considered by the model in the training process, i.e., one-user-out cross validation (CV). The data gathered by six users are utilized in the training process, while the data from the remaining one user is used as the testing data. Since there are seven users in total, we have seven rounds of validation, and the results are averaged for each model. Note that for the one-user-out CV, we set the length of window |W| to be 1 and the number of window n to be 3. Figure 4 illustrates two sample graphs of the linearized tri-accelerometer values of a user while (a) running and (b) walking with the phone placed in a lower-pocket. The graphs are plotted with 50 samples, where each sample represents the mean value of a 0.5 s interval. We can observe that the amplitude of running is greater than that of walking.
While differentiating the two actions from a single user seems to be apparent, the task gets harder when more users are involved. For example, some of the users were fast-walkers, and their patterns were closely akin to the running patterns of slow runners.

Possession-Action Recognition
The first approach, possession-action recognition, consists of two steps: (1) possession-way recognition, and (2) action recognition. The five classification algorithms described in Section 3.3 are employed to compare the performances under fivefold (Tables 3 and 4) and one-user-out (Table 5) CV after shuffling the dataset. Table 3 presents the experimental results for the different lengths of the windows, while Table 4 shows the results for the different number of windows. The results for the one-user-out CV are presented in Table 5.
Step 1 and Step 2 in both tables indicate the accuracies of possession-way recognition and action recognition, respectively. For example, Step 2 (hand) shows the accuracies of action recognition when the possession-way is by hand.
We specify that the following abbreviations are used in the subsequent tables: support vector machine (SVM), random forests (RF), naive Bayes (NB), deep neural network (DNN), and recurrent neural network (RNN). Notably, under the fivefold CV, the random forests (RF) model works consistently well, outperforming the two deep learning approaches on almost all settings. However, as we have not sufficiently explored the many possible layouts and hyper-parameter settings of the deep learning models, we cannot decide the superiority of one model over another in terms of accuracy. Moreover, it is often acknowledged in the literature [28,29] that a machine learning model typically requires training data for at least 10 times its degree of freedom. As the deep learning models consisted of a greater number of weight parameters than the other machine learning models, a lot more data samples would have been necessary for effective learning.
Nevertheless, it is important to note that the RF model only took a few seconds to train, while the deep learning models took considerably longer time (from a dozen minutes to hours depending on the number of inputs). As depicted in both tables, the accuracies of Step 1 are less than that of Step 2. This implies that the action recognition task (Step 2) becomes easier when the possession-way is given or assumed, as many existing studies have done so. The overall performance of action recognition is calculated by multiplying each (possession-way) accuracy of Step 1 by the corresponding accuracy of Step 2, and computing the mean of the multiplied accuracies: In the case of RF models, we can roughly see that the overall accuracy is 66 to 68. Table 3 shows the accuracies of the five algorithms at varying lengths of the window. Each classification algorithm exhibits a slightly different pattern; however, in general, the accuracies are increasing as the window length reaches 3 s, and slightly deteriorate afterwards. The exception is RF models where the accuracies are consistently high at 82, reaching its best at |W| = 0.3. Table 4 shows the accuracies of the five algorithms for the different number of windows. Again, each classifier behaves differently with the increasing number of windows. For example, the accuracy of possession-way recognition (Step 1) of SVM decreases as n increases, in contrast to that of NB. Overall, the best performance is achieved by RF when n = 1. As shown in Table 5, the results of the one-user-out CV show quite a different trend. Firstly, the overall performance of the top three classification models, RF, RNN, and DNN, has dropped significantly due to the sharp decrease in the performance for Step 1, possession-way recognition. On the contrary, the performance for Step 2, action recognition, has actually increased by a small amount compared to the results obtained from the fivefold CV. The results show that the task of possession-way recognition for an unseen user is not a trivial one.
Currently, the process of our feature engineering is kept simple (Section 3.2) as the aim of the study is to explore the relationship between the three approaches. Improving the accuracy of the recognition tasks would require more thoughtful feature definitions. For example, as each participant carried his/her phone in an arbitrary orientation, it is possible that the some specific orientations of the phones, rather than their more general representation, could have been reflected on the models' learning process. Therefore, a rotation-invariant feature [30] would be a good solution here.
It is also worth mentioning that the overall performance by the models on this dataset is lower than the performance on the previous dataset [8]. This is because the previous dataset only consisted of data from a single user, while this dataset is contributed by seven users. Therefore, the action recognition models learned from this dataset is more general than the models learned in the previous work.

Action-Possession Recognition
The second approach, action-possession recognition, also consists of two steps: (1) action recognition, and (2) possession-way recognition. Similar to the previous subsection, the five classification algorithms are employed and evaluated under both fivefold and one-user-out CV. The results are summarized in Tables 6-8.
Step 1 in the tables represents the accuracies of action recognition, while Step 2 shows the accuracies of possession-way recognition when the action is known. For instance, Step 2 (walk) shows the accuracies of possession-way recognition when the user is walking.
As shown in Table 6, the performance of action recognition (Step 1) generally gets better as the window length increases. Again, the RF is marked as an exception as its performance stays quite consistent throughout the experiments under fivefold CV.
Similar to the results of possession-action recognition (Section 4.4), the overall accuracies of one-user-out CV are lower than the ones obtained from fivefold CV, despite the increases in accuracy for Step 1, action recognition.
Under both evaluation criteria, most classifiers (with the exception of RF in fivefold CV) found it easier to recognize the possession-way of a smartphone when a running action is assumed. One possible explanation is that the influence from the surroundings of the smartphone is maximized when the user is running rather than walking, hence producing sensor values with richer information. Similar to the possession-action recognition, the best performance is achieved by RF when the length of windows |W| = 0.5 and the number of windows n = 1.
Another important point to note is that the accuracies of Step 1 (action recognition) are generally lower than that of Step 2 (action recognition) in the possession-action recognition shown in Section 4.4. This is because Step 1 of action-possession recognition involves directly classifying the actions regardless of the various possession-ways in which the actions are blended.

Concurrent Recognition
The third approach, concurrent recognition, aims to classify both actions and possession-ways of a smartphone simultaneously. The five classification algorithms are used again, and the results are described in Tables 9-11. In all tables, Conc., P-A, and A-P refer to the overall accuracies of the concurrent, possession-action, and action-possession recognitions, respectively. As the concurrent approach classifies A × P classes, the accuracies of most classifiers are lower than that of individual steps in the previous approaches, where A or P classes are recognized separately. Therefore, for a fair comparison, the mean values of the combined accuracies of the possession-action and action-possession recognitions are presented in the tables. Comparing the overall performance, the concurrent approach performs better than the rest of the two approaches even under the one-user-out CV (Table 11).
In general, the results show a similar trend with the previous two approaches: the performance increases with the lengths of windows, and decreases with the number of windows. The RF produces consistently high performance while the other algorithms are hindered by the increased number of class labels.
Similar to the previous two approaches, the extended dataset of this study results in lower performance than the single-user dataset from the previous study [8]. We also need to point out that the results of concurrent recognition in this study are slightly different to what we found in the previous study using the single-user dataset. In the previous study, the accuracy of concurrent recognition was as good as, or just slightly lower than, the accuracies of the separate tasks (Step 1 and Step 2) in the two approaches. In this study, however, the concurrent recognition does indeed seem to be a harder task than the two tasks. We believe that this was due to the fact that the single-user dataset had made all three of the tasks very easy. In the single-user dataset, the data samples for each action or possession are coherent to each other as they are from one user. In contrast, the samples gathered by the seven users are not necessarily coherent, as different users have their own style of walking, running, holding the phone, etc. Such difference in the datasets is illustrated clearly by the results of the one-user-out CV where the learned models had difficulty in recognizing the possession-ways of the smartphone for an unseen user.

Discussion
From the experiments, we have explored how the performance of a classifier varies according to the length and the number of windows for each classification task. Although the detailed patterns are a little different among classifiers, a few remarks about the general trend can be made:

•
The length of windows (|W|) seems to be a more important factor in improving the performance than the number of windows (n). In other words, for the tasks of recognizing individual action and possession-way of a smartphone, the independent features computed from each time interval play a bigger role than a series of n features that represent the sequential patterns.

•
In general, the performance of action recognition increases as the classifiers observe data for longer period of time, i.e., greater n and |W|. In contrast, the performance of possession-way recognition is not so much affected, perhaps because the differences in patterns of possession-ways are not very significant to each other. For example, the sensor patterns obtained from placing the smartphone in a shoulder-bag may actually be similar to the pattern gathered by placing the device in a backpack; and increasing the length and number of windows would not affect the classification performance significantly.

•
The concurrent approach seems to work best when n and |W| are around 3 to 5. • However, the RF remains an exception to these trends. While consistently producing the best results under the fivefold CV, it particularly works well when |W| is 0.3 to 0.5. This is probably due to the inner workings of the RF; similar to tree bagging, the RF repeatedly selects a random sample from the training set, and fits trees to these samples. When |W| is small, many data samples are available for the random sampling. We suspect that sampling from this larger pool results in the increased overall performance of the RF.

•
The RNN performs the second best in the tasks under the fivefold CV. It is interesting to see that when |W| or n are 1, its performance is as high as the model learned with longer data samples, illustrating the RNN's capability to learn temporal patterns even with the shorter inputs.

•
However, under the one-user-out CV, the performance of these top three models have decreased significantly for all three tasks. We suspect that, as these models tend to have a greater number of weight parameters to adjust, they had been overfitted to the training data. A greater number of training samples along with more careful feature definitions would be required to achieve a more stable performance.

•
Nevertheless, under both one-user-out and fivefold CV, the concurrent approach produces better results than the other two approaches-possession-action and action-possession recognition.

Conclusions
We investigated the relationship between the action recognition and possession-way recognition using smartphones. In order to further investigate our previous findings [8], we extended the previous dataset to encompass sensor data from seven users rather than a single one. We proposed the three approaches-possession-action recognition, action-possession recognition, and concurrent recognition-and experimentally verified that the assumption of possession-way of the smartphone does affect the performance of action recognition, and vice versa. We observed that the concurrent recognition, which classifies both action and possession-way simultaneously, produces good results compared to the overall accuracies of the two other approaches.
For future work, conducting the same experiment with additional actions such as cycling would be interesting. Moreover, a series of user actions (i.e., a long term behavior) rather than a single one could be learned, possibly taking full advantage of the power of RNN architectures or conditional random fields. However, we would need to utilize more advanced methods in feature engineering to stabilize the performance of both action and possession-way recognition for unseen users.