A Lean and Performant Hierarchical Model for Human Activity Recognition Using Body-Mounted Sensors

Here we propose a new machine learning algorithm for classification of human activities by means of accelerometer and gyroscope signals. Based on a novel hierarchical system of logistic regression classifiers and a relatively small set of features extracted from the filtered signals, the proposed algorithm outperformed previous work on the DaLiAc (Daily Life Activity) and mHealth datasets. The algorithm also represents a significant improvement in terms of computational costs and requires no feature selection and hyper-parameter tuning. The algorithm still showed a robust performance with only two (ankle and wrist) out of the four devices (chest, wrist, hip and ankle) placed on the body (96.8% vs. 97.3% mean accuracy for the DaLiAc dataset). The present work shows that low-complexity models can compete with heavy, inefficient models in classification of advanced activities when designed with a careful upstream inspection of the data.


Introduction
Physical activity monitoring with wearable sensors has various scientific, medical and industrial applications, such as physical activity epidemiology [1], fall detection in the elderly population [2] and for smartwatch applications [3]. Among the existing sensors, accelerometers (sometimes coupled with gyroscopes [4]) are regularly used for activity monitoring, mainly because of their relatively high accuracy, low price and small size [5,6]. Methods for human activity recognition (HAR) using wearable motion sensors were thoroughly investigated and reported in the scientific literature, and a large number of studies demonstrated their ability to predict activity with a high level of accuracy [7,8].
Despite these advances in the field, studies in physical activity epidemiology have mostly used opaque, proprietary algorithms [9][10][11], hence limiting comparability between studies and innovation in the spectrum of activities studied. This situation is probably due to the complexity of the algorithms proposed in the literature, which have grown long and difficult to implement as the HAR tasks became more challenging. Thus, there is a need for a simple yet performant algorithm that scientists could easily implement when analyzing accelerometer data.
Existing transparent HAR methods usually rely on supervised machine learning models to map between motion signals and activities. All methods rely on the assumptions that different physical

The DaiLAc Dataset
The DaLiAc (Daily Living Activity) dataset consists of the signals of accelerometers and gyroscopes placed on the chests, wrists, hips and ankles of 19 adults performing thirteen daily activities in semi-controlled conditions. The activities include a wide range of simple and complex activities: lying, sitting, standing, dish washing, vacuum-cleaning, sweeping, walking, running, ascending stairs, descending stairs, bicycling with a resistance of 50 Watts, bicycling with a resistance of 100 Watts and rope jumping. Details about the subjects and the experimental designs can be found elsewhere [20].

Processing
Acceleration signals are known to be composed of a dynamic component (acceleration of the body) and a gravitational one. As a consequence, some authors suggested applying a low-pass filter to the acceleration signal in order to isolate the gravitational component and infer the inclination of the device in space [8,24]. Using a Butterworth filter (first order, with a threshold of 2 Hz), we separated the accelerometer signals into dynamic and gravitational components (AC and DC components, respectively). Unlike the widespread approach, we treated raw acceleration, AC and DC components as three separate signals all along the feature extraction process. AC and DC components reflect two different aspects of physical activity, orientation and motion, and as such should be treated as two independent signals. For instance, periodicity metrics extracted for the signals can be different, but equally interesting, when looking at orientation and motion over time. Thus, we ended up, for each sensor, with the following time-series: three total acceleration signals (along each axis), three AC, three DC and three gyroscope signals. All signals were downsampled to 51.2 Hz (we sampled every fourth datapoint from the original data) and normalized.
All signals were segmented along the time axis into windows of five seconds with a 50% overlap, as done by other authors [25], in order to make evaluation comparable with other algorithms tested on the same data [15].

Feature Extraction
We define as x the signals (raw accelerometer and gyroscope data, AC and DC) over an N-length window (here, we used 5-s windows and a sampling frequency of 51.2 Hz, hence N = 256). For each windowed signal x, the following statistics were computed in the time-domain: To the mean-subtracted signal x = x − x, we applied the Fourier transformation. We define an amplitude vectorx as the absolute values of the Fourier transform: The following frequency domain features were computed for all vectorsx: -Energy: E = sum (x 2 ); Sensors 2020, 20, 3090 4 of 12 Maximum frequency: argmax(f (ξ)).

Classification
Classification was done using a two-level hierarchical system, as illustrated in Figure 1. For all classification tasks in the system, the following classifiers were tested: LR (with a L2 regularization and a penalty coefficient equal to one); KNN with k = 5; gradient boosting (GB) (500 estimators, selecting 10 features at a time); and SVM. For additional comparability, a convolutional network was also tested (architecture in Figure 2) taking as input the four signals (AC, DC, accelerometer and gyroscope) and their Fourier transform. Classification was done using all 15 possible combinations of device locations on the subjects' body (e.g., ankle, ankle + chest and ankle + chest + wrist).
We used Python's Scikit-learn [26] and Tensorflow [27] libraries for the analysis, and unless otherwise specified, their default parameters. The Python scripts of the project are available on the Github repository (see Supplementary Materials).

Evaluation Method
In order to evaluate the performances of the proposed models, a leave-one-subject-out procedure was followed: models were tested against data from one subject after being trained on all others, for each subject of the 19 subjects in the dataset. This procedure was adopted by the first study on the dataset and followed by several subsequent studies (Table 1). Reserving a fraction of each subject's data for testing instead a fraction of the subjects themselves can result in an upward bias of the estimate of the performance metric, since models learn the patterns that are specific to the subjects and can better classify them during testing. Moreover, averaging scores of all iterations in a leave-one-subject-out procedure is preferable to a single hold-out test on a several subjects, as it reduces bias in the accuracy estimator, especially in small datasets [20].
For all models, we reported the mean and standard deviation of the accuracy (rate of correctly classified samples) for the 19 leave-one-subject-out rounds. To present a complete picture, for models based on the four devices, we also presented the confusion matrix, and the f-score, which is the harmonic mean of precision (true positives/(true positives + false positives)) and recall (true positives/(true positives + false negatives)). also tested (architecture in Figure 2) taking as input the four signals (AC, DC, accelerometer and gyroscope) and their Fourier transform. Classification was done using all 15 possible combinations of device locations on the subjects' body (e.g., ankle, ankle + chest and ankle + chest + wrist).   also tested (architecture in Figure 2) taking as input the four signals (AC, DC, accelerometer and gyroscope) and their Fourier transform. Classification was done using all 15 possible combinations of device locations on the subjects' body (e.g., ankle, ankle + chest and ankle + chest + wrist).

Generalization on the mHealth Dataset
The algorithm presented in this article was designed to address the specific classification task of the DaLiAc dataset. It was therefore deemed desirable to validate this algorithm on other data, collected in different conditions and presenting a different classification task. To do so, we used the Sensors 2020, 20, 3090 6 of 12 algorithm on the mHealth dataset [23] that contains labelled body-worn accelerometer, gyroscope and magnetometer signals collected while subjects were performing different activities. The accelerometer, gyroscope and magnetometer sensors were placed on the lower arm and the ankle. In addition, a device placed on the chest recorded accelerometer data only. Data for the activities were collected in an out-of-the lab environment with no constraints on the way activities must be executed; subjects were asked to try their best when executing them. The activities were the following: standing still, sitting and relaxing, lying down, walking, climbing stairs, bending the waist forward, frontal elevation of arms, bending the knees (crouching), cycling, jogging, running and jumping forwards and backwards. We trained and tested the data using the exact same algorithm, hyper-parameters and validation procedure as those presented here for the DaLiAc dataset. We used a flat classification, since classes seemed clearly distinct from each other.

Results for the DaLiAc Dataset
For the five classification models (LR, GB, KNN, SVM and CNN), accuracy is reported for each combination of devices and for each task in the hierarchical system ( Table 2, and in Tables A and B in Supplementary Materials). Overall classification accuracy was highest for LR (based on data from all four devices) with 97.30% accuracy, followed by GB (all devices) with 96.94%, SVM (all devices) with 96.84%, CNN (three devices, ankle, chest and wrist) with 95.42% and KNN (three devices, ankle, chest and wrist) with 91.82%. When looking at sub-tasks in the hierarchical classification system, GB is very slightly better than LR in the base-level classification (99.23% vs. 99.21%). GB outperformed LR also in distinguishing between standing and washing dishes (97.40% vs. 97.06%) and between walking, ascending and descending stairs (99.08% vs. 98.72%). When we combined the best classifiers for all sub-tasks, overall mean accuracy rose by 0.04%. As this improvement remains very marginal, we refer to the system based exclusively on LR as the best algorithm. The confusion matrix for the final classification with LR is shown in Table 3.
The training time varied significantly across the models studied. Using Google Colab (with GPU accelerator) and the parameters mentioned above, training and predicting data following the leave-one-out procedure (i.e., 19 times) for the DaLiAc dataset lasted 4.5 min with LR and KNN, 7.2 min for SVM, 10.7 min for GB and over half an hour for CNN ( Table 2). The entire feature extraction phase for the 19 subjects (over six hours of observations in total) took about 30 s.
Regarding the locations of the devices on the body, the best choices of one, two and three locations out of the four studied were chest (93.39% with SVM), ankle + wrist (96.81% with LR) and ankle + wrist + chest (97.06% with LR), respectively ( Table 2). Table 4 shows a comparison of the classification accuracies based on both accelerometers and gyroscopes with those obtained with accelerometers only. The loss in mean accuracy was relatively small when leaving out gyroscopes (−0.4%, −0.4%, −0.1% and −2.5% for the best four, three, two and one locations using LR, respectively).

Discussion
Compared with previous works tested on the DaLiAc data set, the proposed algorithm, based on careful handcrafted features extracted from the signals, represents a threefold improvement. First, the proposed algorithm performs better than major works tested against the DaLiAc dataset (97.30% accuracy with LR versus 96.40% for the best model so far with CNN [15]) (see Table 1). Likewise, our algorithm with GB and LR yielded less than 2% classification error on the mHealth dataset. By comparison, Jordano et al. [31] identified seven studies evaluated against the mHealth dataset, and when applying the same leave-one-subject-out procedure, the accuracy for the best algorithm was 94.66%. Zdravevski et al. [18] using a hold-out dataset for testing (subjects 7-10) reached 99.8% accuracy. By applying the same procedure and the same windowing strategy, we reached an accuracy of 99.7% with our algorithm (LR).
Second, compared to state-of-the-art CNN, the proposed algorithm performed best with fast-training models, such as logistic regression (32 min for the former versus 4.5 min for the latter).
Third, these superior results were obtained with simple and robust tools in machine learning that do not require preliminary hyper-parameter optimization and feature selection, such as LR. In fact, hyper-parameters optimization of classifiers (most notably neural networks) and feature selection can be a daunting, time-consuming task, and was shown to lead to over-fitting and poor generalization [32]. This was corroborated by the validation of the algorithm against the mHealth dataset. Simple classifiers based on handcrafted features, which required no or little hyper-parameter tuning, generalized very well on a new dataset, while CNN, which performed well on DaLiAc, for which it was tuned, yielded poor results on mHealth.
It is difficult to fully explain how our algorithm outperformed previous algorithms using classical machine learning classifiers by around 4%, as authors do not always specify all the decisions that they make during data processing before reaching the results. Using the DaLiAc dataset, we undertook a few steps to identify the innovations that made our algorithm more accurate. First, running our algorithm with a flat classification system instead of the hierarchical system proposed here resulted in 1.81% decrease in mean accuracy. Second, by extracting features performed on the acceleration signal only, without including the AC and DC components as we did, the decrease in accuracy amounted to 2.63%. The additional 1.27% difference with the two best-performing algorithms using classical methods by Chen [28] and by Zdravevsky [18] can be attributed to a good trade-off between the number of features and their informativeness. In fact, the former study omitted very important features (i.e., no frequency domain features were extracted), while the latter may have had too many of them (4871 before selection).
Large-scale past public health studies in activity monitoring, such as NHANES [1], have relied only on accelerometer sensors to derive activities. Yet, many of the state-of-the-art algorithms have been developed for a combination of accelerometer and gyroscope data. We have shown here that with our algorithm, the decrease in accuracy following the removal of gyroscope signals was marginal. This will help designers of future studies make an informed decision about the trade-off between cost and accuracy.
Despite this promising improvement, two caveats need to be highlighted. The first caveat relates to the nature of our data. HAR algorithms are tested against clean data of activities performed in a characteristic manner as part of a relatively structured protocol. Realistic data, however, can contain fewer characteristic activities (e.g., slouching) which represent a greater challenge to classify. To that extent, very recent attempts to create benchmark activity datasets simulating real conditions [33] are an important development in the field and new algorithms should preferably be assessed using these data. In addition, people in real conditions tend to switch rapidly between activities. Consequently, windows of five seconds are probably too long to capture a single activity. A possible solution would be to view sets of activities that are often performed together (e.g., standing and walking around) as activities per se. Another solution is to consider smaller windows, for instance, of one second. Smaller windows are known to be less good when aiming to capture cyclical activities [25] and can result in a decrease in total accuracy and longer training. In fact, running our algorithm on one-second windows resulted in a drop of 2.9% and lasted almost five times as long as with the five-second windows commonly used (data not shown). Limiting this loss in accuracy by applying dynamic windowing methods [25,34] is an interesting direction for future development.
A second caveat pertains to the ranking of the models tested in this study. A better choice of the hyper-parameters of the powerful SVM, GB or CNN models could have resulted in another ranking. Our points are to emphasize that a simple approach based on domain knowledge can result in a fast, robust and performant model; and that issues of generalizability and tedious processes of model selection must be acknowledged in the evaluation of a new algorithm.

Conclusions
In this paper, we propose a novel algorithm for HAR from motion signals (accelerometers and gyroscopes), which significantly improves upon previous work in terms of computational expenses, inferential robustness and classification accuracy. Using a hierarchical classification system with LR, and a relatively small set of features extracted not only from the acceleration signal, but also from low-pass filtered and high-pass filtered signals, proved highly useful in solving our classification task. From a practical perspective, we showed that two devices placed on the wrist and the ankles resulted in an accuracy that is practically as good as with two additional accelerometers on the chest and the hip, and that using the method proposed here, the additional information brought by the gyroscope was marginal.
Future research should focus on data that better simulate real life conditions, with their swift transitions between activities and less characteristic behaviors. New, simple models should be developed to better adapt to these conditions, while relying, as much as possible, on domain knowledge.