Classical Machine Learning Versus Deep Learning for the Older Adults Free-Living Activity Classification

Physical activity has a strong influence on mental and physical health and is essential in healthy ageing and wellbeing for the ever-growing elderly population. Wearable sensors can provide a reliable and economical measure of activities of daily living (ADLs) by capturing movements through, e.g., accelerometers and gyroscopes. This study explores the potential of using classical machine learning and deep learning approaches to classify the most common ADLs: walking, sitting, standing, and lying. We validate the results on the ADAPT dataset, the most detailed dataset to date of inertial sensor data, synchronised with high frame-rate video labelled data recorded in a free-living environment from older adults living independently. The findings suggest that both approaches can accurately classify ADLs, showing high potential in profiling ADL patterns of the elderly population in free-living conditions. In particular, both long short-term memory (LSTM) networks and Support Vector Machines combined with ReliefF feature selection performed equally well, achieving around 97% F-score in profiling ADLs.


Introduction
Physical inactivity is classified as one of the four leading factors causing mortality. It contributes to 6% of worldwide deaths [1]. It is considered one of the primary causes of life-threatening diseases, since inactive lifestyles can trigger the prevalence of health conditions such as breast cancer, colon cancer, heart disease, and diabetes [1]. On the other hand, physical activity (PA) is essential to improve the quality of life and functional health of the elderly population. Promoting physical activity in daily life can improve physical and mental health, particularly at an older age [2,3]. A study by the European Commission suggested that the elderly population in the EU is expected to increase above 150 million by 2060 [4], and that this will require health and public infrastructures to take extraordinary measures to accommodate the ever-increasing elderly population and to promote healthy ageing and wellbeing. Therefore, there is a clear need to develop feasible and sustainable methods that can potentially monitor the activities of daily living (ADLs) of the elderly population. By capturing accelerations and angular velocities, wearable inertial measurement units (IMU) can provide unobtrusive, reliable, and low-cost measurement of ADLs.
Several wearable IMU-based physical activity classification (PAC) systems have been developed in the past. They can be broadly categorized into two primary machine learning (ML) branches, i.e., classical ML and deep learning.
The processing pipeline of classical ML-based PAC systems [5,6] consists of several stages: pre-processing (e.g., denoising, filtering), feature engineering (time and frequency domain descriptors), feature selection, and classification algorithms (e.g., support vector machines (SVM) [7], decision trees [8], k-nearest neighbours [9], and artificial neural networks [10]). In the feature engineering stage, handcrafted features are extracted by relying on the domain knowledge and, sometimes, on the biomechanical characteristics of human motion. Such a process provides an acceptable level of performance to classify ADLs. However, this manual stage could lead to potentially important information being missed [11].
Conversely, deep learning [12] automatically performs feature extraction without human intervention. The deep learning algorithms, or deep neural networks (DNNs), learn complex features automatically by adding non-linearity in the feature space (which is often overlooked in handcrafted feature extraction). This approach enables the DNN to learn complex patterns from the underlying raw data streams. The performance of such DNNs depends to a high degree on various hyperparameters linked to the optimization procedure and on the internal architecture of the DNN. The commonly used deep learning algorithms comprise (but are not limited to) convolutional neural networks (CNNs), recurrent neural networks (RNNs), and long short-term memory (LSTM) networks [13]. The existence of CNN [14] and RNN [15] deep learning algorithms goes back to the 1990s; however, these algorithms were unable to gain much attention due to the unavailability of powerful computational resources and a sufficient amount of data. More recently, deep learning algorithms have seen unprecedented levels of participation in almost every domain, ranging from digital health [16], energy forecasting [17], autonomous cars [18], and speech recognition [19], to the finance industry [20], due to the availability of high-performance computing resources and the presence of a growing amount of labelled data to train machine learning models. These deep learning algorithms have also gathered significant attention from the research community working in the domain of PAC. Therefore, several deep learning-based PAC systems have been developed in the last few years to classify ADLs [21][22][23][24][25][26][27][28]. However, these deep learning-based systems were mostly trained and tested on young adults [22,26,29,30], while very few systems have been developed for older adults focusing on PAC [31] and falls [32].
None of the PAC systems developed so far on older adults' data have been validated in free-living conditions. In a previous benchmark study [33], we highlighted that ADLs performed in free-living conditions are different from those performed in laboratory settings or constrained environments. The performance of existing classical ML-based PAC systems highly deteriorates when tested in free-living conditions. This is because ADLs performed in a laboratory-based environment lack ecological validity and differ from those performed in free-living conditions. Therefore, PAC systems designed for elderly populations in free-living conditions should ideally be trained and tested on data recorded in the same age group and setting. The benchmark study [33] also highlighted that the performance of such PAC systems is highly dependant on several factors: the dataset, the number and placement of sensors, the feature set, the feature extraction window size, and the classifier.
In light of this, we previously developed a classical ML-based PAC system for older adults to classify their ADLs in free-living conditions [34]. The current work continues our previous efforts by developing deep learning-based PAC systems which have never been trained and/or tested on the elderly population, to the best of our knowledge. Using a fully validated free-living dataset of older adults' ADLs, we aim to compare classical ML-based PAC systems and deep learning-based PAC systems. Recently, only a couple of studies [35,36] have investigated the performance of classical ML versus deep learning algorithms. Nevertheless, these studies focused on young adults performing ADLs in a laboratory-constrained environment.
In summary, the objectives of the current study are: 1.
To develop a physical activity classification (PAC) system for an older population in free-living conditions using a deep learning approach.

2.
To compare the performance between classical machine learning-based PAC system and deep learning-based PAC system.

Dataset
The dataset used in this study is a subset of a larger dataset collected by the Department of Neuromedicine and Movement Science, Faculty of Medicine and Health Sciences at the Norwegian University of Science and Technology (NTNU) under the ADAPT project (A Personalized Fall Risk Assessment System for promoting independent living) [37]. The ADAPT dataset was collected in free-living conditions, where the subjects were free to perform ADLs in an unsupervised way. The way of performing activities was natural and unstructured. A total of 20 older adults (76.4 ± 5.6 years) participated in the protocol, performing various ADLs. The subjects were instrumented in the lab (i.e., they wore sensors to record movements and a chest-mounted camera to obtain labels of activities), after which they went home to perform the ADLs in free-living conditions. Subjects were instructed to naturally perform their usual ADLs, but to include a set of defined activities as a part of the free-living protocol, without any instruction or supervision on how to perform them. The activities classified in this work were: sitting, standing, walking, lying. A subset of four of the sensors used in the (out-of-the-lab) free-living protocol from the ADAPT dataset was analysed in this study. The choice of this subset was motivated by the highest performance (F-score) achieved in our earlier work [34]. The subset of sensors is presented in Figure 1, and the sampling frequency of sensors was 100 Hz. The chest-mounted camera shown in Figure 1 served as ground truth [37] to validate the performance (F-score) of sensor-based PAC systems. Five raters performed the video labelling of the subjects' movements using the video recordings obtained through the chest-mounted camera, achieving a very high inter-rater reliability of above 90% in labelling the free-living ADLs.

Splitting Training and Testing Data
Each IMU sensor contains six signals (3 for linear acceleration, 3 for angular velocity), resulting in 24 signals. Windows of 5 sec were used, resulting in windows of 500 samples (W). The window length of 5 sec was chosen to maintain consistency with our earlier work [34] and provide comparable results. The N windows were divided into training and testing before developing the ML models and analysing their performance. The data samples of 16 participants out of 20 were used in this study. The data of the remaining 4 participants were not used due to technical issues with the wrist sensor. The dataset of the 16 participants contained a total of 36,139 windows. A data split was performed following the 70%(train)/30%(test)% method (which is one of the common methods to cross-validate the performance of machine learning models). Data from 11 participants were used to train the ML model (N = 26,115 windows), and the remaining data from 5 participants' (N = 10,024) were used to test the performance of the trained model, as presented in Table 1. The F-score was used as a performance measure for the comparative analysis of PAC systems and will be used interchangeably with performance throughout this study. Table 1. Training and testing data split, giving the number of sample windows per participant and classified activity.

Splitting Training and Testing Data
The LSTM network (a variant of RNN) was used as the deep learning algorithm to develop the PAC system. The LSTM networks were shown to perform better [38] over simple RNNs, due to their ability to remember long-term dependencies of time series data. The LSTM network remembers data dependencies through the explicit memory cells allocated within its architecture and stores information regarding when to keep or forget information from long data sequences. The training data of the four wearable IMU sensors (Figure 1) was fed into the LSTM network. The input data structure is presented in Figure 2. The N windows show the total number of data instances across all participants in the training and testing scenarios ( Table 1). The specifications of the proposed LSTM model for the PAC system developed are listed in Table 2.

Classical Machine Learning Algorithm for PAC
The methodology used in this study is the same as the one proposed previously [34]. However, instead of using leave-one-subject-out cross-validation, this study used the training and testing data split presented in Table 1. The performance analysis of classical machine learning-based PAC used the same set of sensors highlighted in Figure 1.
The set of features extracted from the wearable sensors are represented in Table A1 in Appendix A. Three feature selection approaches were used, combined with a weighted SVM classifier to compute the overall performance and performance by class. The feature selection approaches are: correlation-based feature selection (CFS) [42], fast correlation-based filter (FCBF) [43] and ReliefF [44]. The performance of all features, without using any feature selection approach (Table A1, PAC-All-Feat) was also computed.
The F-score was computed as a performance measure to compare the classical machine learning with the deep learning PAC system using the expression below: where TP = True Positive, TN = True Negative, FN = False Negative, and FP = False Positive. The subscript "c" is used to denote class metrics. The overall F-score was calculated by averaging the F-score of all classes.

Performance Analysis of LSTM based PAC System
The LSTM-based PAC system performed well in classifying the ADLs of older people, achieving an overall F-score of 97.23%. The performances by class using the test set for walking, sitting, standing, and lying, as well as overall performances, are presented in Table 3, in which the results of the classical machine learning and deep learning approaches are compared. The respective confusion matrix for the LSTM-based PAC system is shown in Table 4.  It is evident from the findings that the LSTM-based PAC system can classify each ADL with a very high F-score of above 94%, which confirms the strength of deep learning methods. The sitting and lying classes achieved the highest F-score, at around 99%, while the walking and standing classes demonstrated lower scores (94.48% and 96.09%, respectively).
The detailed performance analysis of LSTM-based PAC system using a different sensor combination is presented in Appendix B (see Table A2). It is quite evident from the findings that the LSTM-based PAC system developed using combinations of sensors (two or more) outperformed the single-sensor-based system. A plateau in performance is achieved when three sensors are used, beyond which adding more sensors does not improve performance.

Performance Analysis of Classical Machine Learning Based PAC System
The classification performances obtained through the four scenarios obtained from machine learning-based PAC systems are presented in Table 3, and the corresponding confusion matrices are shown in Table 5. These performances were obtained using the same dataset and train/test data split used for the LSTM-based PAC system reported in Table 1. All the classical machine learning-based PAC systems were able to perform well with an acceptable performance level (F-score > 90%, Table 1). The best performance (F-score) was obtained using the ReliefF-based PAC system; this produced an F-score of 96.83%, which is quite promising and shows the capabilities of the proposed PAC system in classifying ADLs. The second-best performance, of 94.33%, was achieved using all the feature sets. The PAC systems developed on correlation-based feature selection methods, i.e., PAC-CFS and PAC-FCBF, achieved slightly lower F-scores of 93.25% and 91.17%, respectively. To illustrate the impact of feature selection on the PAC system's performance, the number of features used by each classical machine learning-based PAC system is presented in Table 6. Table 6 shows that CFS and FCBF selected the smallest number of features among all the feature sets analysed and still performed well in classifying the four analysed ADLs. The CFS-and FCBF-based PAC systems used 18 and 17 features, respectively, and the ReliefF-based PAC system used 105 features. The total number of features, without any feature selection approach, was 326. This significant reduction in the feature sets of the correlation-based feature selection methods (CFS, FCBF) could be explained by a slight performance degradation compared to the other two approaches (all-feature set, ReliefF). However, the difference in the performance of these systems was less than 3% and, interestingly, the correlation-based feature selection methods reduced the feature set size up to 94%. The reduction in the feature set can significantly reduce the computational complexity, making the system more feasible and applicable in real-life conditions, which is in line with our earlier findings [34]. The high performance of ReliefF is in line with our earlier analysis [34], where it was shown that ReliefF achieves better performance when the PAC system is implemented over multi-sensor feature sets (which is the scenario in the present study).  The overall performances obtained through classical machine learning algorithms and LSTM-based deep learning algorithms (see Table 3) suggest that both methodologies can accurately classify the ADLs. The best PAC system obtained in classical machine learning approaches is based on the feature set obtained through ReliefF, and its performance is quite close to the one obtained through deep learning, with a difference of 0.4% (97.23% vs. 96.83%). To get a better insight into class performance, the F-score obtained through all PAC systems is depicted in Figure 3, for both the classical machine learning-and the deep learning-based approaches. All the ADLs, i.e., sitting, standing, walking, and lying, are accurately classified by these PAC systems (PAC-ReliefF, PAC-LSTM) with very high performance by class (above 90%) and the differences in performance among these PAC systems for all classified ADLs are minimal (less than 1%, Table 3-columns 2 and 6). Moreover, the confusion matrices obtained from the PAC systems (Tables 4 and 5) suggest that the walking and standing classes are quite often confused with each other in both cases, i.e., in classical machine learning and deep learning, which is the reason for their low F-score. This could be because three out of the four IMU sensors (chest, lower back, and thigh-see Figure 1) have a similar orientation during standing and walking, which could have contributed to this slight degradation in the performance and confusion among the classes. On the contrary, the sitting and lying classes possibly have more distinctive properties, as three out of the four IMU sensors (thigh, chest, lower back) change their orientation from sitting to lying. Therefore, we can suggest that neither of the approaches, i.e., classical or deep learning, outperformed the other in this work. This result could be related to the fact that a plateau in performance was reached, suggesting that after reaching a certain level of performance, further enhancement might not be possible, regardless of which of the two machine learning approaches is used, as there is a narrow range for improvement and from which to differentiate between the performances of the various PAC systems. Recently, Baldominos et al. [36] performed a similar type of analysis to observe classical machine learning performance versus CNN-based PAC systems (although they analysed the ADLs of younger adults in a constrained environment, rather than in free-living conditions, and they used a CNN instead of an LSTM network). They concluded that the classical machine learning PAC system performed better than the deep learning-based PAC system, which suggests that deep learning methods are not always optimal when referring to wearable sensors based on physical activity classification systems. Their finding is somewhat in line with our present work, as our proposed classical machine learning and deep learning PAC systems performed equally well, with marginal performance difference (<0.4%).
The findings of our study are interesting and show the similar strength of classical and deep learning-based PAC systems in profiling the free-living activities of an elderly population. However, it is essential to mention that the dataset analysed in this study, although quite unique, is not very large, and the nature of the classified activities might not be very challenging in terms of DNNs, as they perform better on larger datasets. PAC systems might behave differently when exploited on datasets from larger cohorts and different populations, with a larger number of activity classes, but this requires further validation in a future study. These observations emphasize that the choice of an appropriate ML algorithm (classical ML or deep learning) depends, to a high degree, on the nature of the problem domain and the quality and the quantity of the labelled dataset. However, it is important to highlight that the dataset used in the study is the first of its kind, in that it included older people in free-living conditions, and underwent an extensive and detailed validation/ground truth annotation process by multiple raters [37]. Moreover, the performance of free-living protocols in the home environment generated more natural patterns and distributions of ADLs than could be obtained in a laboratory-based setup [45]. Future work should focus on exploring other DNNs, such as CNNs or hybrid CNN-LSTMs, or using a temporal CNN as a feature extractor and then feeding the results to a classical ML classifier, such as an SVM.

Conclusions
This study investigated the performance of classical machine learning-based PAC systems and a deep learning-based PAC system. The dataset used in this study was based on the activities of daily living performed by older people in free-living conditions. There were no constraints on how and when to perform a specific activity, and the participants performed the study protocol in their residential settings. A subset of four wearable inertial sensors from the ADAPT study was analysed in order to classify the daily living activities. The classical machine learning-based PAC system was developed by applying weighted SVM and feature selection. The deep learning-based PAC system was developed using the LSTM approach, by directly feeding in the raw data from the inertial sensors. This study demonstrated that both approaches (classical machine learning and deep learning) can accurately classify the daily living activities of the elderly population with very high performance (F-scores of around 97%). Neither approach was found to be clearly superior to the other, suggesting that both the machine learning and deep learning approaches can classify the activities equally well, in terms of the dataset used in this work. Institutional Review Board Statement: The study was approved by the Regional Committee on Ethics in Medical Research in Central Norway (reference number 2014/1121).

Informed Consent Statement:
All participants provided written concept to participate.

Data Availability Statement:
The script from the ADAPT validation data set used in this study is available on request to Jorunn L. Helbostad (jorunn.helbostad@ntnu.no).
Conflicts of Interest: L.P. and L.C. are co-founders and own shares of mHealth Technologies. All other authors declare no competing interest.

Appendix B
Performance Analysis of LSTM based PAC System on Test Set