Real-Time Physical Activity Recognition on Smart Mobile Devices Using Convolutional Neural Networks

: Given the ubiquity of mobile devices, understanding the context of human activity with non-intrusive solutions is of great value. A novel deep neural network model is proposed, which combines feature extraction and convolutional layers, able to recognize human physical activity in real-time from tri-axial accelerometer data when run on a mobile device. It uses a two-layer convolutional neural network to extract local features, which are combined with 40 statistical features and are fed to a fully-connected layer. It improves the classiﬁcation performance, while it takes up 5–8 times less storage space and outputs more than double the throughput of the current state-of-the-art user-independent implementation on the Wireless Sensor Data Mining (WISDM) dataset. It achieves 94.18% classiﬁcation accuracy on a 10-fold user-independent cross-validation of the WISDM dataset. The model is further tested on the Actitracker dataset, achieving 79.12% accuracy, while the size and throughput of the model are evaluated on a mobile device. of the reference one. Regarding the rest of the activities, the results are similar.


Introduction
Human Activity Recognition (HAR) is an active area of research, since it can automatically provide valuable knowledge and context about the actions of a person using sensor input. Its importance is evidenced by the variety of the areas it is applied to, including pervasive and mobile computing [1,2], context-aware computing [3][4][5], sports [6,7], health [8,9], elderly care [9,10], and ambient assisted living [11][12][13]. In recent years, there has been a growing interest in the field, because of the increase of the availability of low-cost and low-power sensors, especially the ones embedded in mobile devices, along with the improvement of the data processing techniques. This growing interest can also be attributed to the increasingly aging population [14] in the case of health and elderly care applications.
At the same time, several datasets have been collected over the years, including a range of modalities [15]. There has been a great contribution from human activity recognition video datasets; some of the most important include the 20BN-something-something Dataset V2 [16], VLOG [17], and EPIC-KITCHENS [18]. However, using video datasets for activity recognition entails potential pitfalls. Compared to using inertial sensors, there is a greater risk of violating personal data; they require a camera setup or a person actively recording the video at runtime, and their processing is heavier. The ubiquity of mobile devices and their embedded sensors have produced a multitude of datasets, which include accelerometer, gyroscope, magnetometer, and ECG data, such as the MHEALTH [19,20], the OPPORTUNITY Activity Recognition [21], the PAMAP2 [22], the USC-HAD [23], the UTD-MHAD [24], the WHARF [25], the WISDM [26], and the Actitracker [27] datasets. A wide range of sensors has been also used to create Adaptive Assisted Living (AAL) HAR datasets, including reed switches, pressure mats, float sensors, and Passive Infrared (PIR) sensors [28].
In the context of human physical activity recognition, competitive performance can also be achieved using only the input data from a mobile device's sensors. This offers potential unobtrusiveness and flexibility, but also introduces a number of challenges [29]. The motion patterns are highly subject dependent, which means that the results are heavily affected by the subjects participating in the training and testing stages. The activity complexity increases the difficulty of the recognition either because of multiple transitions between different motions or because of performing multiple activities at the same time. Energy and resource constraints should also be considered, since the capacity of mobile devices is limited, and demanding implementations will drain their battery. Localization is important to understand the context of a situation, but implementing it using GPS is problematic in indoor environments, while inferring distance covered with motion sensors usually results in the accumulation errors over time. Of course, it is also crucial to adapt research implementations to real-life problems, such as elderly care, workers' ergonomics, youth care, and assistance for disabled people. At the same time, the privacy of sensitive data collected from the subjects must always be respected and protected, even at the expense of a solution's performance.
The conventional method to extract knowledge from these datasets is with machine learning algorithms such as decision tree, support vector machine, naive Bayes, and hidden Markov models [30]. They perform well in specific cases, but rely on domain-specific feature extraction and do not generalize well easily [31]. Contrary to machine learning methods, deep learning methods perform the feature extraction automatically, and specifically, Convolutional Neural Networks (CNN), which are also translation invariant, have achieved state-of-the-art performance in many such tasks [15].
In this paper, a CNN model is proposed, which takes as the input raw tri-axial accelerometer data and passes them through the 1D convolutional and max-pooling layers. These layers effectively extract local and translation-invariant features in an unsupervised manner. The output of these layers is concatenated with an input of 40 statistical features, which supplement the previous with global characteristics. These are passed through a fully-connected and softmax layer to perform the classification. The model is trained and tested on two public HAR datasets, WISDM [26] and Actitracker [27], achieving state-of-the-art performance. Finally, the model is tested online on a mobile device, performing in real time, measuring and presenting significantly improved throughput and reduced size.
The remainder of this paper is structured as follows: Section 2 summarizes the related work. Section 3 presents the proposed solution, datasets, and metrics. Section 4 describes the experimental setup and presents the results, and Section 5 concludes this paper and proposes future improvements.

Related Work
Traditionally, machine learning approaches were followed in order to achieve human activity recognition [32]. Decision trees are preferred because of their interpretability and low computational cost compared to more complex models [30]. The J48 algorithm is an open source Java implementation of the C4.5 decision tree algorithm in [33] and is usually used for HAR from motion sensor data, using features extracted by time domain wave analysis [34] or statistical features [26]. Other popular classifiers also used for HAR from motion sensor data are K-nearest neighbors and Support Vector Machines (SVMs). K-nearest neighbor classifiers were used with raw accelerometer data and shown to be inferior to using a CNN [35], while Tharwat et al. [36] used particle swarm optimization to produce an optimized k-NN classier. K-nearest neighbor is an Instance Based Learning (IBL) algorithm, which is relatively computationally expensive because it requires the comparison of the incoming instance with every single training instance. However, it offers the ability to adapt to new data and eliminate old data easily. Using an implementation of SVMs that does not require much processing power and memory, six indoor activities were recognized with an accuracy of 89% in [37]. Ensemble classifiers, such as the bagging and boosting ensemble meta-algorithms [3,38], combine the outputs of several classifiers of the same type in order to get better results. On the other side, ensemble algorithms are more computationally expensive, since more base level algorithms need to be trained and evaluated [30].
The state-of-the art performance of deep learning methods has been reported in various works coming from different approaches. For example, Hammerla et al. [39] used a bi-directional Long Short-Term Memory (LSTM), which contains two parallel recurrent layers that stretch both into the "future" and into the "past", to achieve a 92.7% f1-score on the Opportunity dataset [21], while a combination of convolutional and recurrent layers, named DeepConvLSTM, was used by Ordonez et al. [40], achieving 95.8% accuracy on the Skoda dataset [41]. However, both of the above datasets require multiple wearable sensors, which enable the recognition of more complex activities, but also render the approach more intrusive. This work will focus on non-intrusive implementations, where data coming only from a mobile device's sensors are used. A commonly used dataset is the UCI HAR dataset [42] with a waist-mounted mobile device. Using its embedded accelerometer and gyroscope, three-axial linear acceleration and three-axial angular velocity were captured at a constant rate of 50 Hz. It was partitioned into two sets, where 70% of the volunteers were selected for generating the training data and 30% for generating the test data, which standardized the results of the implementations. Sikder et al. [43] proposed a classification model based on a two-channel CNN that makes use of the frequency and power features of the collected human action signals. Ronao et al. [44] opted for a four-layer CNN with raw sensor data input augmented by the information of the Fast Fourier Transform (FFT) of each input channel, and Ignatov et al. [35], with a shallow CNN, raw sensor data input, and 40 statistical features, achieved top performances on the test dataset of 95.2%, 95.75%, and 97.63%, respectively. However, the sensor signals (accelerometer and gyroscope) were pre-processed by applying noise filters and then sampled in fixed-width sliding windows of 2.56 s and 50% overlap (128 readings/window) for the dataset, which means that it adds extra steps, thus time and processing power, for a real-time implementation.
Finally, the WISDM (WIreless Sensor Data Mining) lab has produced two datasets using a pocket-placed mobile device, the WISDM [26] and Actitracker [27] datasets. Both datasets contain time series of tri-axial accelerometer data during six physical activities with a frequency of 20 Hz, of which the four ones are common (Walking, Jogging, Sitting, Standing), while the other two activities are different for the WISDM dataset (Upstairs and Downstairs) and for the Actitracker dataset (Stairs and Lying Down) . There is no pre-selection of training data and test data, so there is a lack of consistency in reporting the results of the implementations that use these datasets. For the WISDM dataset, all previous works developed user-dependent solutions, except for [35,45,46], who trained their models on the data of subjects different than the ones on which the models were tested. Kolosnjaji et al. [45] proposed using a combination of hand-crafted features and random forest or dropout classifiers on top of them and achieved 83.46% and 85.36%, respectively, using the leave-one-out testing technique. Huang et al. [46] proposed an architecture of two cascaded CNNs, performed a seven-fold user-independent cross-validation, where five subjects were left out as a testing dataset and the rest were the training dataset, and reported an average f1-score of 84.6%. Ignatov et al. [35] selected the first 26 subjects for their training set and tested the performance on the remaining 10 users, achieving 93.32% and 90.42% accuracy for an interval size of 200 and 50 data points, respectively. Alsheikh et al. [47] evaluated WISDM only and achieved 98.23% accuracy by using deep belief networks and hidden Markov models, but there was no mention of how the dataset was split into training and testing. Milenkoski et al. [48] trained an LSTM network to classify raw accelerometer data windows, but they reported an accuracy of 88.6% on an unspecified 80%/20% train/test split of the dataset. Pienaar et al. [49] used a similar random 80%/20% train/test split, and they trained a combination of an LSTM and an RNN and achieved an accuracy of 93.8%. Wang et al. [50] proposed a Personalized Recurrent Neural Network (PerRNN), which is a modification of the LSTM architecture, and achieved 96.44% accuracy on a personalized, i.e., user-dependent, 75%/25% dataset split. Another LSTM architecture was proposed by [51], which achieved 92.1% accuracy on an unspecified testing dataset split. Xu et al. [52] trained a CNN model on randomly selected 70% of the dataset and evaluated on the remaining 30% of the dataset, reporting an accuracy score of 91.97%. Another CNN model was proposed by Zeng et al. [53], which scored 96.88% accuracy on a 10-fold user-dependent cross-validation of the WISDM dataset. As for the Actitracker dataset, which includes more subjects in the data collection, the problem persisted since Alsheikh et al. [54] scored 86.6% accuracy, but did not explain which part of the dataset was used for testing, while Shakya et al. [55], who achieved an impressive 92.22% accuracy, did not take the subjects into account and divided the dataset randomly, thus not excluding the data of subjects who were included in the training set. The seminal work of Ravi et al. [56,57] achieved high performance on both datasets, 98.6% on WISDM and 92.7% on Actitracker, using spectral domain pre-processing, deep learning, and shallow learning, but the 10-fold cross-validation was again user-dependent. Yazdanbakhsh et al. [58] transformed the input window for every channel into an image and passed these through a dilated convolutional neural network. They evaluated in both a user-dependent fashion against Ravi et al. [56] and a user-independent fashion against Ignatov et al. [35] on WISDM, achieving comparable performance.
Important factors regarding the dataset and the implementation are the sampling frequency and the window sizes used to create the time-series input. Most solutions use a sampling rate between 20 Hz and 50 Hz, while datasets sampled at 50 Hz include the MHEALTH dataset [19,20] and the UCI HAR dataset [42]. The custom dataset's frequency collected by Siirtola et al. [59] was 40 Hz, while the WISDM [26] and Actitracker datasets [27] were sampled at 20 Hz. According to [60], ninety-eight percent of the power for the walking activity was contained below 10 Hz, and ninety-nine percent was contained below 15 Hz. Furthermore, no amplitudes higher than 5% of the fundamental existed after 10 Hz. Regarding on-device implementations, the main frequencies were lower than 18 Hz when a mobile device was carried in a hip location [61]. Lower sampling rates can reduce battery usage and memory needs. Much work has been directed toward finding the optimal window size, but most implementations use window sizes in the 1-10 s range. Most implementations use fixed window sizes of 1. 28 [35]. An overview of the works presented in this section can be seen in Table 1.

Datasets
In order to ensure the reproducibility of our results, we chose to train and evaluate our methods on two separate publicly available datasets, the WISDM [26] and Actitracker [27] datasets. Details on each dataset are elaborated in the following sub-sections.

WISDM Dataset
The WISDM dataset [26] consists of tri-axial accelerometer data samples from 36 volunteer subjects while performing a specific set of activities. These subjects carried an Android phone in their front pants' pocket and were asked to walk, jog, ascend stairs, descend stairs, sit, and stand for specific periods of time. In all cases, the accelerometer data were collected every 50 ms, so there were 20 samples per second. We can see a detailed description of the dataset in Table 2.

Actitracker Dataset
The Actitracker dataset [27] is a real-world equivalent of the WISDM dataset. It also consists of tri-axial accelerometer data samples, but now, there were 563 volunteer subjects. They performed a similar set of activities, but in an uncontrolled environment, in contrast to the WISDM dataset. These subjects carried an Android phone in their front pants' pocket and walked, jogged, ascended or descended stairs, sat, stood, and lay down for specific periods of time. In all cases, the accelerometer data were collected every 50 ms, so there were 20 samples per second. We can see a detailed description of the dataset in Table 3.

Data Pre-Processing and Feature Generation
The proposed implementation combines the automatic feature creation of convolutional layers with statistical features, which are then fed into a fully-connected layer. Before feeding the raw data into the network, it is necessary to apply pre-processing and create statistical features.
The accelerometer data of each channel are first divided in time windows, in order to exploit the temporal information and periodicity of the signals. If the time duration of each window is N seconds, this means that there are N w = 20 · N data points in each window, since the frequency of the accelerometer data collection is 20 Hz. Furthermore, the step between time windows is constant and is 20 data points or 1 s. Consequently, there are 3 vectors a x,y,z , one for each axis.
Convolutional layers create a representation of the raw data, but the extracted features are local, so the global characteristics of the time series also need to be encoded. This is achieved by creating 40 statistical features for each window, which were proposed by the creators of the dataset themselves [26]. The features are described below, with the number of features generated for each feature type noted in brackets: The range of values for each axis (max − min) is calculated; it is divided into 10 equal sized bins, and then, what fraction of the N w values fell within each of the bins is recorded.
Each time window of each channel is first centered around its average, before being fed to the network. The description of the datasets after preprocessing with window sizes of 50 and 200 can be seen in Tables 4 and 5.

Convolutional Neural Network
CNN is a hierarchical Feed-Forward Neural Network (FFNN) whose structure is inspired by the biological visual system. Its principal difference from standard neural networks is that apart from fully-connected layers, it has a number of convolutional layers, where it learns filters that are sliding along the input data and applied to its sub-regions.
The fully-connected layer is the building block of neural networks. There are different ways to describe the intuition behind this layer, but at its core, it is a non-linear function that maps the inputs to the outputs. In this case, the input is a one-dimensional vector x, sized M × 1. Every layer has a weight matrix W, which contains the weights connecting every element of the input to the corresponding node of the layer, so W is an N n × M matrix, where N n is the number of the layer's nodes. In addition to the weight matrix, a bias vector b, which outputs constant values for any input, is also part of every layer and is added to the output. Another integral part of a layer is its non-linear activation function f (·). Three commonly used activation functions are sigmoidal, hyperbolic tangent, and Rectified Linear Unit (ReLU) [62]. The third one is defined as f (Z) = max(0, Z), which is a thresholding operation. Finally, the overall function of the layer can be described as: where the output s is an N n × 1 vector and b is a bias vector.
The convolutional layer provides feature extraction by exploiting the temporal information of the data. 1D convolution is applied, which means that the applied filters are slid to the direction of only one dimension and not necessarily that the data themselves are one-dimensional. The convolutional layer's parameters consist of a set of learnable filters. During the forward pass, each filter is slid (more precisely, convolved) across the temporal axis of the input volume, and dot products are computed between the entries of the filter and the input at any position. The output's width is calculated as: where K is the filter's kernel size, S is the stride, and P is the number of zero padding that is added, while the output depth will be equal to the number of filters F. The output of convolving the input with the layer's filters is the following: The matrices W kj represent the filters that are convolved with the input, while b j is the bias vector that is added to the output. Finally, j runs from 1 to F, the number of convolutional filters, and f (·) is the activation function just like in the fully-connected layer, which is the ReLU activation in this case.
Pooling layers are usually used after a convolutional layer to reduce the complexity of the implementation and compress the representation. There are two dominant variations, the max-pooling layer and the average pooling layer [63]; the former is used in this model. A max-pooling layer accepts an input sized H 1 × W 1 × D 1 , has the kernel size F and stride S as parameters, and produces an output The output consists of the max of every window sized F × 1, which is slid across the input with stride S.
Finally, the output of the last layer is commonly passed to a softmax layer that computes the probability distribution over the predicted classes. It is a fully-connected layer, which has the softmax function (4) as an activation function.
where K is the number of predicted classes.

System Architecture
The proposed CNN structure can be seen in Figure 1 and the layer details in Table 6. It consists of the following steps: • The accelerometer data, sized N × 3, are fed to the first convolutional layer with 192 convolutional filters and a kernel size of 12, and the stride of the convolution is 1. The ReLU function is applied to its output. • A max-pooling layer follows with a kernel size of 3 × 1 and a stride of 3, which reduces the feature representation by 3. • Another convolutional layer is added with 96 convolutional filters and a kernel size of 12, and the step of the convolution is 1. This will help to learn more abstract and hierarchical features. The ReLU function is applied to its output. • A final max-pooling layer has a kernel size of 3 × 1 and a stride of 3, which further reduces the feature representation by 3. • The output of the max-pooling layer is then flattened and concatenated with the statistical features described in Section 3.2. The joint vector is passed to a fully-connected layer that consists of 512 neurons. The ReLU function is applied to its output. • A dropout layer is added with a dropout rate of 0.5 to avoid overfitting. • Finally, the output of the fully-connected layer is passed to a softmax layer, which computes a probability distribution over six activity classes.

Optimization
The optimizer used for training the proposed model is stochastic gradient descent [64] with momentum β [65] and a constant learning rate λ. The update rule is described by Equations (5) and (6).
where L(·) is the loss function and V 0 = 0.
where W are the weights of the model and V t the update. The loss function used is the cross-entropy, which in the examined case resolves to the same thing as log loss. For the proposed approach, where there are K = 6 classes, cross-entropy is described by Equation (7).
where y is a binary indicator if class label c is the correct classification for observation o and p is the predicted probability of observation o being of class c.

Accuracy
The accuracy, which in this case is the classification accuracy, is the ratio of correct predictions over the total predictions.
where TP (True Positive) is the sum of the predictions in which a specific class was predicted and this prediction was successful, TN (True Negative) the sum of the predictions in which another class was predicted and it was indeed another class, FN (False Negative) the sum of the predictions in which another class was predicted, but, in fact, it was this class, and finally, FP (False Positive) is the sum of the predictions in which the specific class was predicted while the real label was another class. The precision metric computes the rate of correct predictions of a class over the total prediction of this class. It is defined as follows: The recall measures the fraction of correct predictions of a class to the total real data points of the class. It is defined as follows: F1-score is a combination of the precision and the recall metrics and is defined by the following equation:

Statistical Feature Analysis Methods
In order to validate that the features used as an intermediate input contribute to the performance of the implementation, we performed an ablation study and a dimensionality reduction of the features. The ablation study involves adding the groups of features of each feature type, as described in Section 3.2, one-by-one and comparing the results. The dimensionality reduction was performed by using principal components analysis on the normalized features. In effect, PCA is a linear dimensionality reduction using the singular-value decomposition of the data to project them to a lower dimensional space. It also provides the values of explained variance for each created component. Therefore, it is possible to choose the components that contribute most to the variance of the data, thus reducing the dimensions of the data.

Implementation
The goal is to create a solution that can be applied in a mobile, real-time environment. To this end, the trained models are exported in a mobile compatible format, using the TensorFlow Lite framework. More specifically, the Keras and TensorFlow frameworks are used for the training and TensorFlow Lite and Android for the on-device inference. The exported models ran on a mobile device device with a HiSilicon Kirin 970 CPU, 6 GB of RAM, and Android Version 9.0. The size of the exported models and their inference throughput were compared, where the inference throughput was the number of inferences per second that can be performed by the model.

Experimental Setup
The performance of the implementation is measured in terms of the classification quality, the on-device throughput, and the size of the network. To achieve the first, a set of experiments was run measuring a set of performance metrics described in Section 3.6. Each experiment ran with the proposed network, referred to from now on as DCNN (Deeper Convolutional Neural Network), and the reference network, described in [35], referred to from now on as RCNN (Reference Convolutional Neural Network).

Parameters
The training parameters of the model are the window size N w of the input, the epochs e of the training, and the parameters of the optimizer momentum β and learning rate λ. For all of the experiments, we had e = 100, β = 0.9, and λ = 0.01. The training ran for two different window sizes, N w = 50 and N w = 200, to replicate the evaluation of the reference implementation [35]. The datasets used for training and testing are described in Section 3.1.

Performance Experiments
All of the performance results were measured with the metrics described in Section 3.6. First, the experiment of the reference implementation's paper was replicated, in which the 26 subjects of the WISDM dataset were used for training, which left the remaining 10 subjects of the data for testing. This train/test split of the dataset, although user-independent, was arbitrary. Consequently, a 10-fold cross-validation was implemented, in which the dataset was split into 10 groups of users, in order to keep the evaluation user-independent. The process was repeated on the Actitracker dataset, which contains data from a considerably higher number of users and collected in an uncontrolled environment, thus being more realistic.

Implementation Evaluation
In Table 7, the results of the on-device evaluation of the models are presented. The size of the DCNN model was around five to eight times smaller than the RCNN [35] one, and the same holds for the number of parameters. This is really important for a mobile implementation, where storage space is limited. Additionally, its throughput was two to four times higher, ranging from 115 to 405 inferences per second, which is clearly more than enough for a real-time implementation, and even if it is used on a mobile device with lower specifications (the specifications of the device used can be seen in Section 3.8), thus lower throughput, it will still most probably be more than an inference per second. Table 7. On-device throughput and size of the models.

Original Train/Test Split
In Table 8, the accuracy scores of the two models for a window size of 50 and 200 data points respectively are presented after running the training and testing of the original paper. More specifically, the first 26 users were used as a training set, and the remaining 10 were used as a test set. It is evident that although the parameters of the model are considerably fewer than those of the reference one, it not only matches, but surpasses the accuracy of the reference model from 0.35% to 1.04%, for window sizes of 50 and 200, respectively. The mean and standard deviation of the accuracy for the DCNN model were calculated by running 10 trainings with different initializations. There is no standard deviation-or average accuracy-for the RCNN model, since for this comparison, we used the results published in their paper [35]. We can see that the standard deviation is quite small and ensures that the accuracy is reliably higher for the DCNN model. In Figure 2, there is an overview of the precision scores of both models for every class. Both models struggle mostly with recognizing the Upstairs and Downstairs activities, which was expected, but DCNN manages to achieve considerably higher precision in both of these activities. It is, however, less precise in identifying the Standing activity. Similar conclusions can be also deduced from the results of Figure 3, the recall scores. The performance is considerably better in the Downstairs and Upstairs activities, while slightly lower recall in the Standing one. The recall is also better in the Walking activity.  The f1-score results in Figure 4 validate the above findings, since the score is higher for the Downstairs, Upstairs, and Walking activities, while lower for the Standing one. This means that DCNN achieves better results for the dynamic activities. This probably happens because they are periodic, and the two-layer structure offers a wider receptive field.  Finally, we can see the confusion matrices for window sizes of 50 and 200, respectively, in Table 9. The model can easily distinguish between stationary and dynamic activities, except for a small confusion between the Upstairs and stationary activities. This is reduced when using a wider window size, which is expected since it ensures that the whole periodic pattern of the dynamic movement is included in the window multiple times, and thus, the difference between the stationary (and non-periodic) and dynamic activities is more pronounced. There is naturally a common misclassification of Standing and Sitting activities between each other. Surprisingly, Downstairs is more usually confused with Walking and vice versa and not the Upstairs activity. Jogging is mostly confused with the Walking activity, but it is rarely confused. Using a window size of 200 helps decrease the confusion among all of the classes, except of the one between the stationary activities.

Cross-Validation
For a more comprehensive testing, a 10-fold cross-validation was used, while still preserving the user-independent nature of the initial experiment. From the average accuracy scores in Table 10, it can be seen that this approach has an edge over the reference, despite its considerably smaller size. The standard deviation of the accuracies is high, which means that the difference and difficulty of each data split have a big impact on the performance of the network. However, it is important to underline that as seen in Table 8, the variation of the results for different initializations of each network for the same dataset split is low. This holds for both the small and bigger window sizes. Additional metrics were taken into account, but for the sake of conciseness, they are presented in Appendix A.

Performance Evaluation on the Actitracker Dataset
The Actitracker dataset was gathered in an uncontrolled environment, in contrast to the WISDM dataset, but has the same data structure, and thus, the same model structure can be used. Consequently, the model size and throughput will be the same, which means that another online evaluation is unnecessary, since the results in Section 4.3 still hold. The real-world nature of the data ensures that the results are realistic. Moreover, the 10-fold cross-validation mitigates the probability of the results being biased from an arbitrary train/test split.

Cross-Validation
Consistent with previous results, the average accuracy of the model in Table 11 is considerably higher for both window sizes. It validates that the performance of the model is better regardless of the specific dataset being used. The increase in accuracy ranges from almost 1% to 2%. The standard deviation of the accuracies is quite high due to the varying difficulty of the dataset splits, which means that we cannot know with certainty that this will be the actual accuracy produced by the networks for an arbitrary split of the dataset. However, it is quite expensive to re-run a 10-fold cross-validation with different data splits for a dataset as big as Actitracker, and more importantly, the comparison of the networks was on the same dataset splits; this is enough to prove the superiority of DCNN. In Figure 5, there is an overview of the precision scores of the models. Both suffer greatly in the Stairs activity, but the model has considerably higher score. They also perform poorly in the Lying Down activity, and the model is a bit worse than the reference one. In the rest of the activities, their performance is quite comparable. The recall scores in Figure 6 indicate that the models perform poorly in the Stairs and Lying Down activities regarding this metric as well. However, the model achieves an improvement of around 300% compared to the scores of the reference one. Regarding the rest of the activities, the results are similar. The f1-score in Figure 7 validates the previous deductions. The increase in performance regarding the Stairs and Lying Down activities is evident, although the performance is still quite poor. The reason for these poor performances probably is the few data points of these two activities, since they constitute only 1.9% and 9.3% of the dataset, respectively. The performance in the rest of the activities is more or less the same except for small differences. Finally, we can see the confusion matrices for window sizes of 50 and 200, respectively, in Table 12. The Walking activity is mostly confused with Jogging and vice versa, while both are also confused with the Sitting activity. This is more pronounced in this dataset than the WISDM one [26] because the data points are noisier and the Sitting activity data points make up a larger percentage of the dataset. This is evident in the confusion of Lying Down, which only makes up 9.3% of the dataset, with Sitting (22.3%) and the confusion of Stairs (1.9%) with Walking (42.1%). This is slightly alleviated by using a wider window size. Standing and Sitting are mostly misclassified between each other, while the latter is also commonly misclassified as Lying Down. It is important to note that the confusions occur mostly between dynamic activities or between stationary activities and less across these groups, which is desirable. It is evident that this dataset being collected in an uncontrolled environment affects the obtained results visibly compared to the ones using the WISDM dataset. The noise of the collected accelerometer data and possible errors in the ground-truth labeling decrease the performance of the model for every activity. Additionally, the extreme imbalance of the Stairs activity, 1.9% of the total dataset, and the similarity of the Lying Down with the Sitting activity produce really low scores on these activities, and this further drops the overall scores of the model.

Statistical Feature Analysis
The statistical features that were input in the latter stage of the network were reported in [35] to increase the performance of the network. However, there was no further ablation study to explore the contribution of these features. Furthermore, we investigated the use of PCA to decrease the number of features and its effect on the performance of the network.

Ablation Study
The statistical features that describe the global characteristics of the input window can be grouped into the following feature types: • Average, which consists of the features, one for each axis • Standard deviation, which consists of three features, one for each axis • Average absolute difference, which consists of three features, one for each axis • Average resultant acceleration, which consists of just one feature, since it combines the input from all the axes • Binned distribution, which consists of 30 features, 10 for each axis All of the above is explained in detail in Section 3.2. In order to study the effect of each feature, we ran experiments using no features (DCNN-0), and then, we started adding every feature one-by-one. This resulted in five different experiments, where the number of features was respectively 3, 6, 9, 10, and 40. The respective experiment names are DCNN-X, where X is the number of features.
The mean and standard deviation of the accuracy after running 10 trainings with different initializations for each feature selection can be seen in Table 13. The results were acquired with a window size of 200. We can see that there is a significant drop of performance, when using only the raw input. By just adding the average of the time series in the three axes, there is an adequate increase of accuracy, which can be explained by the fact that the accelerometer data are centered. This means that there is no way for the network to infer this information from the raw data, even ignoring the fact that being a convolutional neural network, it produces local features. Adding the standard deviation results in a slight increase, if any, while also adding the average absolute difference results in a more considerable increase. Both of them describe the variability of the data, so it would be expected to have a similar effect. The difference is that the standard deviation, being the sum of squares, will give more weight to high variations of individual points. Adding the average resultant acceleration gives another slight boost, which is understandable, since it is a measure of magnitude and is partly described by means of each axis. Finally, the addition of the binned distribution, which further improves the performance by providing more detailed information about the variability of the data, brings us to our proposed solution. It is the one with the highest accuracy, and since the addition of 40 data points to the input of the final fully-connected layer is not expensive, it justifies using all of the proposed statistical features.

PCA and Normalization
Another approach is to use PCA to reduce the dimensionality of the features and keep only the ones that contribute effectively. However, the main problem with this approach is that there will be a need to normalize the features, before applying PCA. This need stems from the fact that the individual features lie at different scales, and since PCA is sensitive to the variation of the data, it will not be correctly applied unless they are normalized. To underline the effect of the normalization, we will also produce results with the normalized features without using PCA, which will be reported as "DCNN norm".
We applied PCA with 40 principal components to imitate the 40 statistical features. A guide for the number of dimensions to keep is the explained variance ratio of each component. In our case, the first six axes account for 98.9% (42.97 + 27.49 + 18.65 + 4.73 + 2.84 + 2.22) of the total variance. Consequently, we kept the first six dimensions, and the respective results will be reported as "DCNN PCA-6".
The mean and standard deviation of the accuracy after running 10 trainings with different initializations for each approach can be seen in Table 14. The results were acquired with a window size of 200. It is clear that the normalization of the features significantly drops the performance of the network. The features keep important information about the global statistics of the input window, such as the mean and the standard deviation. These features lie at different scales and are distorted by the normalization. This results in loss of information, which, in turn, explains the drop of performance. Consequently, even though using PCA, the dimensionality is reduced, this only marginally increases the performance, since the normalized features do not offer significant supplemental information. It is telling that the accuracy with normalized features-93.69%-achieves inferior or at best equal performance even compared against the one with no statistical feature input-93.73% (Table 13).

Window Size Analysis
In order to explore the performance of the network with varying window sizes, we ran experiments with window sizes ranging from 20 to 200. More specifically, the chosen window sizes were {20, 50, 100, 150, 200}. As for the experiment itself, we used the proposed solution, performed 10 training sessions on the original train/test split of the WISDM dataset with different initializations for each window size, and report the average and standard deviation of the accuracy. The standard deviations of the accuracy for each window size were found to range from 0.1% to 0.25% and thus were excluded from the graph, since their impact is negligible.
In Figure 8, the rate of increase of the accuracy for each range of values is clear. The higher increase rate can be found in the range from 20 to 50, which explains the choice of 50 as the window size for the faster implementation, since it provides a favorable efficiency to performance trade-off. The average accuracy continues to increase in all the ranges, but as we approach 200, the rate of increase is lower. This is demonstrated in the minor difference (0.36%) between the performance of the model with the window size of 150 and the one with the window size of 200, compared to the 2.12% difference between the ones with window sizes of 50 and 100.

Discussion and Conclusions
When designing an implementation for mobile use, where the storage capacity is restricted, decreasing its size is a main goal. The size of a model depends on the number of its parameters, so it is the first variable that must be analyzed. We will provide an analysis for the variation of the model with an input size of 200 in favor of conciseness, but the results are similar for the one with an input size of 50. It is clear in Table 15 that the fully-connected layer is the one that contains the bulk of the model's parameters. The proposed model has almost ten times less parameters than the original one, which is due to two factors. The layer has half as many nodes as the original, and the addition of the second convolutional layer further decreases the size of the input to this layer. This convolutional layer does contribute additional parameters to the network, but this increase is negligible compared to the decrease in the fully-connected layer. This explains the considerably smaller model size of the proposed network. The smaller model size combined with the faster inference time makes the DCNN model a better fit for real-time mobile implementations. Furthermore, the smaller and more efficient design not only does not decrease the prediction performance, but consistently slightly increases it, achieving state-of-the-art performance amongst subject-independent implementations. It is considered important to note here the motivation behind choosing and comparing against only subject-independent solutions. The main goal is to correctly evaluate the generalization capability of the network, since "seeing" data points of a subject in the training set can help the inference of the same subject in the test set, and there is no guarantee that this will work as effectively for a subject that was never in the training set [66]. Moreover, targeting a real-time mobile use case means that the training will be performed on a desktop and the model will be used on a mobile device for inference. In most cases, the dataset for training the model will not contain data points of the end user, especially if the application becomes publicly available. Finally, the comparison of the results of subject-independent and subject-dependent solutions is unfair, since the scores of the subject-dependent solutions are expected to be higher.
Statistical feature analysis is also performed to evaluate the importance of the features and the possible alternative approaches. More specifically, the results indicate clearly that the use of the 40 statistical features improves the performance of the network, and at the same time, all of the different feature types contribute, albeit with varying impact, to this improvement. Using PCA would help decrease the dimensionality of the feature input, but in our case, it results in inferior performance, instead. The main reason is that it requires normalization, in order to properly estimate the variance of each feature, and thus, it distorts the scale of these features. This is supported by the fact that using no features at all achieves higher performance than using normalized features.
Finally, a window size analysis will enable each implementation of our solution to select the window size that more closely fits their specifications. It also justifies the selection of window size of 50 for a good trade-off of performance and efficiency and the selection of window size of 200 as the best performing one, if speed is not the most important factor.
In conclusion, a novel model to recognize physical activities using only the accelerometer data from a mobile device in real time is proposed. It improves upon the state-of-the-art solution by employing a deeper convolutional neural network structure and changing the optimization methods. This improvement is evident at many levels. Firstly, the number of parameters and the size of the model are five to eight times smaller than the previous state-of-the-art model, for instance 1.383 million model parameters instead of 10.092 million for a window size of 200. The throughput is also improved, since it is two to three times higher, namely 405 inferences per second compared to 206.88 of the previous state-of-the-art and a window size of 50. Moreover, the classification performance is better and more robust, since both models are tested on two datasets and 10-fold cross-validation is also used. Indicatively, the average accuracy on the 10-fold cross-validation on the WISDM and Actitracker datasets for a window size of 200 is 94.18% and 79.12%, 0.5% and 2% higher than the previous state-of-the-art. This means that this solution is better in terms of size, throughput, and performance, all of which are crucial for a real-time on-device application, as can be seen in Figure 9. As an additional contribution, we investigated the impact of the statistical features in order to validate their importance and offer a comparison between the 40 statistical features and the ones created using PCA.
Among the various limitations acknowledged in this work, certain factors have been identified as fields of further research to be pursued towards improving the results of the proposed system. Even deeper convolutional neural network structures can be explored, but the size of the network needs to always be in check. A probable solution would be to also use 1 × 1 convolutions. Another convolutional neural network with a higher receptive field could also be an alternative for substituting the global features. Finally, an implementation that also uses the gyroscope data would be interesting. Figure 9. Overview of the performance of the proposed model. It is noted that in the accuracy and throughput categories, the higher the better, while the inverse holds for the size and the number of parameters.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: