Significant Features for Human Activity Recognition Using Tri-Axial Accelerometers

Activity recognition using wearable sensors has become essential for a variety of applications. Tri-axial accelerometers are the most widely used sensor for activity recognition. Although various features have been used to capture patterns and classify the accelerometer signals to recognise activities, there is no consensus on the best features to choose. Reducing the number of features can reduce the computational cost and complexity and enhance the performance of the classifiers. This paper identifies the signal features that have significant discriminative power between different human activities. It also investigates the effect of sensor placement location, the sampling frequency, and activity complexity on the selected features. A comprehensive list of 193 signal features has been extracted from accelerometer signals of four publicly available datasets, including features that have never been used before for activity recognition. Feature significance was measured using the Joint Mutual Information Maximisation (JMIM) method. Common significant features among all the datasets were identified. The results show that the sensor placement location does not significantly affect recognition performance, nor does it affect the significant sub-set of features. The results also showed that with high sampling frequency, features related to signal repeatability and regularity show high discriminative power.


Introduction
Increasing Physical Activity (PA) is a significant part of daily life, which can have beneficial health effects [1]. Physical inactivity has been identified as one of the top four risk factors for global mortality: inactive people are vulnerable to several health conditions such as high blood pressure, diabetes, cardiovascular diseases and cancer [2]. Physical inactivity is a significant concern amongst the globally ageing population. In 2017, there were twice as many children under the age of 15 as the number of people aged 60+ in the world, but by 2050, these numbers are expected to be equal [3].
Managing the expected demand in social and health services will need to harness opportunities from new technologies. Ambient Assistive Living (AAL) technology has the potential to relieve some pressure from both families and social services, and provide more care in the home and community, which will reduce national health and care spending and has the potential to improve wellbeing [4]. Monitoring the activities of older people during their daily life can improve their safety, autonomy and quality of life [5].
In the past, a common way of assessing PA was the use of questionnaires, such as the Physical Activity Scale for the Elderly [6]. Such questionnaires are limited as they cannot provide accurate continuous data of lived experience, relying instead on individuals' rationalization of their behavior.
This has led to a great deal of interest in the potential of wearable technologies for monitoring PA. Currently, the accelerometer and gyroscope are the most widely used wearable sensors for PA monitoring. They have been used in a variety of applications, including

ADL Recognition with Accelerometers: Background and Literature Review
Typical ADL recognition using an accelerometer is usually achieved by a task that consists of several steps: signal pre-processing, feature extraction, dimensionality reduction and classification [19]. The classification is not performed on every single data point in the acceleration data; instead, the signal is segmented using a sliding window with some degree of overlap. This minimises the effect of noise and reduces the computational cost of the learning algorithm. There is no agreement about the window size, as each activity has a different duration. Therefore, there should be a trade-off: a small window is not enough for a long activity, while a large window may contain more than one activity. A variety of window lengths have been used in the literature [20]. The feature extraction step involves calculating attributes that are able to capture the patterns over a sliding window. The features can be categorised into three: time domain, frequency domain, and time-frequency domain. Various features have been used for the ADL recognition task in the literature, starting from simple statistical methods (e.g., signal mean, signal standard deviation, correlations, etc.) to frequency domain features. However, there is no consensus on the best features that should be extracted and used for classification [21]. Using a large number of features does not necessarily improve the classification. Moreover, using high dimensional data will increase the computational burden and make the classifier model complex. Therefore, dimensionality reduction techniques are usually used to reduce the number of features. Feature selection is a dimensionality reduction used to identify the informative features which are relevant to the classification task and eliminate redundant and irrelevant features.
It is well known that classification accuracy can be enhanced through reducing dimensionality [22]. In ADL recognition, using relevant high-quality features and excluding irrelevant and redundant features is essential for improving classification accuracy and reducing the computational cost. Therefore, in order to achieve the best performance, the dimensionality (number of features) should be reduced as much as possible.
The two main techniques used for dimensionality reduction are feature extraction and feature selection. Feature extraction involves the creation of a new feature space as a combination of the original features. The most common methods are Principal Component Analysis (PCA) [23], Linear Discriminant Analysis (LDA) [24] and Singular Value Decomposition (SVD) [25]. Feature selection, on the other hand, involves the selection of the best subset of the original features. It can be categorised in terms of its relation to the classification algorithm, such as filter, wrapping and embedded methods. Wrapping methods are classifier dependent and use classification accuracy as a measure of the performance of the selected subset of features. These methods are very specific to the classification algorithm used for the evaluation. Embedded methods are also classifier dependent methods; they are part of the classifier. Unlike the previous methods, filter methods are classifier independent. They measure the quality of selected features without the involvement of a classifier algorithm [26].
Researchers usually collect their own data to train and validate their proposed system. This makes it very difficult to compare the performance of the different proposed techniques in an objective way [27].
Accelerometers have been used in many ADL recognition approaches in the literature. However, each study focuses on different activities and extracts diverse features, which makes it difficult to compare performance between studies. Many machine learning techniques have been used for ADL recognition, such as Random Forest [28], Support Vector Machine (SVM) [29], Decision Trees [30], Hidden Markov Models [31], and Bayesian Networks [29]. Data can be collected at different sampling frequencies; however, a 20 Hz sampling frequency is enough for capturing simple activities such as walking, standing, and sitting which are fairly regular [32]. However, a low sampling frequency may hide significant details about complex activities that have less regularity in the acceleration signals [33].
Feature selection has also been employed for activity recognition applications. Preece et al. (2009) [34] compared the classification accuracy of fourteen feature extraction methods used in the literature. A specific dataset was collected for this study using three EBT tri-axial accelerometers attached to the waist, the thigh and the ankle. The results showed that Fast Fourier Transform (FFT) features perform the best. The authors studied the feature sets individually.
Relief-F and Sequential Forward Floating Search (SFFS) wrapping methods were applied to a range of features from the literature [35]. The authors collected the data using one Freescale MMA 7260 tri-axial accelerometer attached to the waist using a custom-made belt. Statistical features (mean, variance, maximum acceleration difference) and spectral features were used. They employed two classifiers to evaluate the selected subset (K-Nearest Neighbor (KNN) and Naïve Bayes). They identified a variance of the x,y,z axis, entropy of x,y,z energy uncorrelated, window mean x,y,z, mean trend x,y,z and fluctuation analysis coefficient as the most significant subset. Capela et al. (2015) [19] used three filter feature selection methods: Relief-F, Correlationbased Feature Selection (CFS) and Fast Correlation Based Filter (FCBF). The selected feature sets were evaluated using several classification algorithms. In this research, gyroscope data was included in the analysis, and 76 features were extracted. The authors used SVM, J48 and Naïve Bayes classifiers to evaluate the performance of the selected subsets. The selected subsets were a collection of accelerometer and gyroscope features. Atallah et al. (2010) [36] used six tri-axial accelerometers to collect data. This number was chosen, as it was considered appropriate to study the ideal location for the sensor. They used a set of features consisting of statistical time domain features and features based on FFT. Relief-F, Simba and minimum Redundancy Maximum Relevance (mRMR) methods were employed to assess the significance of each feature. The performance of the three methods was very similar in terms of classification accuracy. They also agreed that the total FFT energy was the most relevant feature. Chernbumroong et al. (2011) [37] used an eZ430-chrons watch to collect data with a sampling rate of 33 Hz. The study focused on five activities: sitting, standing, lying, walking and running. The CFS method was used to select the features that highly correlated to the class label. The C4,5 classifier and Artificial Neural Network were used for the classification. The study did not specify which features were ranked as the most significant. Suto et al. (2017) [38] used nine feature selection methods to rank the extracted features. The authors focused on the most common features used by previous studies. Three classification algorithms (artificial neural network, KNN, decision tree) were used to evaluate the performance of the selected features, using the wearable action recognition database [39], which is collected from 20 participants who were asked to perform 13 common activities. The authors did not explain the evaluation criterion. The features have been investigated for each participant individually which may make the findings subject to sensor placement and orientation. The results were not consistent between the participants and the employed methods and classification algorithms. Rosati et al. (2018) [40], divided the time domain and frequency domain features into two sets (FeatSet_A, and FeatSet_B), and evaluated the performance of each set on activity recognition. The study employed a data set that was collected using IMUs from 61 healthy participants as part of this study. A 5 s sliding window with overlap of 3 s was used to segment all accelerometer and gyroscope signals. A genetic algorithm was employed to perform feature selection and minimise each set. The SVM classifier was employed to classify the achieved 97.1% accuracy; the results also showed FeatSet_B out performed FeatSet_A, where FeatSet_B comprises a number of zero crossing, positive and negative peaks, and mean value, standard deviation, maximum, minimum, and range of duration for positive, negative, and total peaks. Orlov et al. (2019) [41] employed the Relief-F algorithm to rank the features, where 54 features were extracted from the accelerometer signals. This study extracted only the simple statistical time domain features such as mean, standard deviation, and skewness. The authors reported improvement of about 1% on the classification accuracy after they applied the feature selection and reduced the number of features. They also report a 4-fold improvement in the processing time of the classifier. Ahmed et al. (2020) [42] employed the SFFS method to study the significance of the features. The study focused on statistical and frequency features only. The authors employed one publicly available dataset to test their approach. Both accelerometer and gyroscope data were used to extract 138 features. The reported results showed that the optimal subset consists of 18 features which were extracted from the 3 axes. The common features between the axes are mean, standard deviation, skewness, and kurtosis. The rest of the features were different between the axes. The finding of this study is very specific to the SVM classifier and cannot be generalized.
Finally, Thakur et al. (2022) [43] employed Guided Regularized Random Forest (GRRF) feature selection method to analyse the significance of the smart phone sensor feature for HA recognition. 20 features were extracted from 3-axial signals. They include statistical, time domain, and frequency domain features. In this work, Random Forest (RF) as a feature selection algorithm. The results showed that reducing the set of features by removing the irrelevant and redundant features improved the classification accuracy and allow SVM classifier to outperform deep learning-based approach on the classification task. However, the authors did not list the most significant subset of features that was used to generate that accuracy.

Contribution
Most of the previous studies extracted limited sets of features. Moreover, they used a limited set of features, and most of them employed just one specific dataset, which was collected as part of the study. The missing of the collected datasets and other key details of those studies does not allow us to evaluate the proposed approach against some of those studies.
In contrast, this paper extracts a comprehensive set of features from accelerometer raw data, some of which has never been used for this application before. Then, a framework is employed to measure the significance of the features from four publicly available datasets. These datasets have different sampling frequencies and different sensor placement locations; we have used them to comprehensively investigate the significance of the features for the classification. The investigation was performed at the study level; the significance of the features which were extracted from all the participants' data have been analysed together. This framework consists of several steps and employs a state-of-the-art feature selection method: The Joint Mutual Information Maximisation (JMIM) method [22]. The paper also investigates the influence of sampling frequency, sensor placement location, and complexity of activities on the significance of the features. Another distinction from previous work, where researchers collected their own data, is that publicly available datasets have been used in this study to compare the features objectively and make the results reproducible.
The rest of the paper is organised as follows. Section 2.1 describes the datasets, Section 2.2 explains the experimental setup, Section 3 presents the results, and Section 4 discusses the results and concludes the paper.

Datasets
To define the significant features for the activity recognition task using accelerometers, an analysis was performed using four different datasets with various characteristics. Two datasets are publicly available from the UCI machine learning repository [44], and the other two from [45,46]. Table 1 lists the characteristics of each dataset. These datasets have been previously used in activity recognition research [46][47][48][49][50]. They were selected to be able to test the selected features with a variety of placement locations, sampling frequencies, the overall performance of the activity recognition task and the inclusion of simple and complex activities in the testing datasets. This dataset was collected in a controlled laboratory environment using an accelerometer sensor on a smartphone placed at the waist. The data was collected from 29 participants at a sampling frequency of 20 Hz. The data consisted of six activities (walking, jogging, ascending stairs, descending stairs, sitting and standing). No demographics about the participants were provided.

Heterogeneity Activity Recognition Dataset (HARDS)
The HARDS dataset (also called HHAR) has been collected using a variety of smartphones and smart watches from nine participants (age range 25-30 years) in a controlled setting using specific routes outside the laboratory. All of the participants were asked to perform five minutes of each activity following a given script. Several sampling frequencies were used to collect the data based on the device used. In this paper, only the part of the data collected using Samsung and LG smart watches were used. The watches used sampling rates of 100 and 200 Hz, respectively, to collect the data. Six activities were monitored (walking, biking, sitting, ascending stairs, descending stairs and standing).

Smartphone-Based Recognition of Human Activities and Postural Transitions Dataset (SBRHA)
A group of thirty participants (age range 19-48 years) were recruited to collect this dataset in a controlled environment. They were asked to perform six activities (walking, laying, sitting, climbing up the stairs, climbing down the stairs and standing). A smartphone placed on the waist was used to capture the activities at a sampling frequency of 50 Hz using the built-in tri-axial accelerometers and 3-axial gyroscopes. In this study, we are focusing only on the data from accelerometers.

Physical Activity Monitoring for Aging People (PAMAP2)
This dataset contains 12 different activities. Nine participants (1/9 female, total age range of 24-32 years) were recruited to collect this data using three Inertial Measurement Units and heart rate monitoring in a controlled environment outside the laboratory.
One inertial measurement unit (IMU) was placed on the wrist on the dominant arm, another on the chest, and the last one on the dominant side's ankle. The IMU is a device that consists of triaxial accelerometers, triaxial gyroscope, and triaxial magnetometer. The gyroscope is known for it is tendency to drift over time even when the sensor is not rotating [51]. In this study, only data collected from the accelerometer was used in this study; The data was collected at a frequency of 100 Hz. This dataset has been chosen to study the effect of sensor placement location on the predictive power of the features.

Evaluation Framework
The evaluation framework used in this study is shown in Figure 1; some of the steps are described in more detail below. effect of sensor placement location on the predictive power of the features.

Evaluation Framework
The evaluation framework used in this study is shown in Figure 1; some of the step are described in more detail below.

Pre-Processing
It has been reported in the literature that the best classification accuracy can b achieved using a sliding window between 2.5-3.5 s [52]. Therefore, each acceleration com ponent (signal) is segmented using a 2.5 s sliding window. Each window is considered sequentially. Each observation (2.5 s) in the raw data is considered as an occurrence. Fo example, walking for five seconds is considered as two occurrences. An Infinite Impuls Response (IIR) high-pass filter was employed on each window to remove gravitationa contributions.

Feature Extraction
Raw accelerometer data cannot be used directly by most ML algorithms. Feature that measure the characteristics of the signal windows are extracted to capture the pat terns among the data. Since this study investigates accelerometer signal features, the de cision was made to extract most of the features that were reported in the literature. Tim domain and frequency domain are commonly used measures as features fo

Pre-Processing
It has been reported in the literature that the best classification accuracy can be achieved using a sliding window between 2.5-3.5 s [52]. Therefore, each acceleration component (signal) is segmented using a 2.5 s sliding window. Each window is considered sequentially. Each observation (2.5 s) in the raw data is considered as an occurrence. For example, walking for five seconds is considered as two occurrences. An Infinite Impulse Response (IIR) high-pass filter was employed on each window to remove gravitational contributions.

Feature Extraction
Raw accelerometer data cannot be used directly by most ML algorithms. Features that measure the characteristics of the signal windows are extracted to capture the patterns among the data. Since this study investigates accelerometer signal features, the decision was made to extract most of the features that were reported in the literature. Time domain and frequency domain are commonly used measures as features for accelerometer data. In addition to the common features, features across the domain boundaries were included, which are not commonly used for activity recognition, such as:

•
Recurrence Quantification Analysis (RQA) [53]: Recurrence is an essential property of any dynamical system. It is used to describe the behavior of the system in phase space. Recurrence Plots (RPs) are a method used for visualizing the recurrence behavior of dynamical systems [49]. Assume that a dynamical system is described in phase space with a set of trajectories x → i , i = 1, 2, . . . , N, these vectors are used to describe a quantity of parameters. The development of the system state is described by series of the vectors. The dynamical system can be represented by recurrence matrix R(i, j): where Θ : R → (0, 1), N is the number of considered states, and ε are predefined distance. E is essential as the system usually recurs exactly visiting the same states. A periodic system is characterized by long and non-interrupted diagonals in RPs. The vertical distance between these diagonal lines reflects the period of oscillation. While in the chaotic system, diagonals are formed in RP, but shorter than periodic systems. Uncorrelated stochastic signals generate RP with many single points. Therefore, the shorter the diagonals in RP, the less predictable the system. Therefore, each activity is expected to have its diagonal characteristics. In order to quantify the characteristics of the recurrence plots and go beyond just visualizing them, RQA is used to produce quantitative measures based on the recurrence point density, diagonal lines, and vertical lines. Four variables for the quantification by RQA:

1.
Recurrence Rate (RR): The percentage of recurrence points in a recurrence plot. It is equivalent to the probability that a specific state will recur to its ε neighbourhood.
Determinism (DET): The percentage of recurrence points in a recurrence plot that form the diagonal lines of minimal length l min . Processes with chaotic behavior cause short or no diagonals, while deterministic processes lead to longer diagonals. Therefore, to measure the chaotic behaviour, the ratio between the recurrence points along the diagonal structures and the total all recurrence points.
where P(L) is the histogram of diagonal line of length L.

3.
Entropy (ENTR): The Shannon information theory entropy of the probability distribution of the diagonal line length.
which is the probability that the diagonal line has the length L. ENTR reflects the complexity of RP in respect of the diagonal line. The value of ENTR is small if the process is noisy. 4.
The average length of the diagonal lines (L): • Permutation Entropy (PE) [54]: PE is a non-parametric time series method which is used to quantify the complexity level of a time series data. Assume a given time series The PE method captures the probability distributions of patterns of symbol sequences, termed permutations. Each activity has different level of complexity. Therefore, measuring the complexity level can be very informative. PE quantifies the complexity by measuring the entropy of sequences: where p j is the frequency of the possible sequences. • Lyapunov Exponent (LE) [55]: LE quantifies the rate of divergence of nearby trajectories in the phase space of a dynamic system, which has been proven to be the most useful method in diagnosing the chaos in a dynamical system. A positive value of LE (λ) means that the orbits are chaotic, a zero value indicates that orbits retain in the same position, while a negative value demonstrates that orbits retain their relative positions. Lyapunov exponents quantify the exponential divergence of initially close state-space trajectories and estimate the amount of chaos in a system. In this study a small data quantity method is employed to calculate the largest Lyapunov exponent (LLE) [55]. The algorithm is described as follows: where L (t k ) denotes the distance between two points in the phase space at time t k . In this study, a small data quantity method is employed to calculate the largest Lyapunov exponent (LLE) [56]. The algorithm is described as follows: Suppose that the time series acceleration data is X = {x 1 , x 2 , . . . ., x N }, the face space can be represented using the following equation: where τ is the lag, and m is the embedding dimension. After the reconstruction of the dynamics, the algorithm searches for the point that minimizes the distance to the particular reference point. Then, L (t k ), and L (t k−1 ) are calculated and used to find the value of λ. The Lyapunov Exponent is calculated with the assumption that the acceleration signal of each feature has a different level of chaos. • Total Harmonic Distortion (THD) [57]: THD quantifies the distortion of a waveform relative to a pure fundamental frequency. The THD present in a signal is the ratio between square root of the sum of the powers of all harmonic components' values to the power of the fundamental frequency. where V 2 n is the Root mean square (RMS) amplitude of the nth harmonic, and V f und is the RMS of the fundamental frequency. In this study only the five first harmonics are included. Table 2 shows the full list of the extracted features. These features were extracted from each acceleration component of each signal segment. They were also extracted from the signal magnitude (MG = x 2 + y 2 + z 2 ). Each window is represented by 193 features in the feature matrix, all of which are calculated for each segment of each axis. In addition to the features discussed earlier, common features include mean, standard deviation, root mean square value, autocorrelation of each signal, Fast Fourier Transform (FFT), Wavelet Packet Decomposition (WPD), power peak features, and autocorrelation. All the features were calculated after the filtration except the mean and standard deviation, which were calculated using the raw data. Many FFT coefficients capture the high frequency part of the signal, these coefficients are likely to have low amplitude which are unlikely to carry significant information about the activity. Most of the information is contained below the frequency of 17 Hz. Activities like walking are low frequency activities. Therefore, WPD shown in Figure 2 is applied to the acceleration signal with decomposition at five levels to capture the lower end of the signals that contain significant information about the activities. A set of features extracted from the coefficients that are shown in Figure 2 are used extract 36 features which include: the sum of absolute value of coefficients from level 1 to 5, energy, and entropy of each acceleration signal along each axis (x,y,z) and MG. sensor readings. This can reduce the weight of abnormally high values among the data. Normalization was performed using the following equation: Continuous features in the data were discretised using the Equal Width Discretisation (EWD) method [50] to avoid any bias during the feature selection task.

Feature Selection
Feature selection is a dimensionality reduction technique. The current study employs a subset of this technique for two reasons. First, it helps to reduce dimensionality and improves the performance of the learning algorithm by removing the redundant and irrelevant features that may confuse the learning algorithm. Second, it measures the discriminative power of each feature and selects the most informative that produces the highest classification accuracy. The JMIM method was also employed in this research to select the significant feature subset for the discrimination task. This method was chosen, as it was proved that the method outperformed other state-of-the-art methods in the literature, and because it is computationally inexpensive.
JMIM is based on the use of mutual information and the 'maximum of the minimum' criterion. The method starts with the most significant feature, then it begins to select features one by one and add them into the selected subset. The feature is selected when it maximises the goal function: = arg ∈ ( ∈ ( ( , ; ))) (2) where F is the original feature set, S is the selected subset feature, is the candidate feature and is the selected feature.
The method selects the features that share the highest information with the class label in the context of the features that have already been selected. The method employs the following forward greedy search algorithm:  To simplify the presentation of the results, the features are divided into five groups, denoted by giving each group a different colour in Table 2. These groups include simple time domain features, frequency domain features, time-frequency domain features (WPD), repeatability and regularity-based features, and autorepression-based features.
All the features were normalized in order to minimize the effect of any abnormal sensor readings. This can reduce the weight of abnormally high values among the data. Normalization was performed using the following equation: Continuous features in the data were discretised using the Equal Width Discretisation (EWD) method [50] to avoid any bias during the feature selection task.

Feature Selection
Feature selection is a dimensionality reduction technique. The current study employs a subset of this technique for two reasons. First, it helps to reduce dimensionality and improves the performance of the learning algorithm by removing the redundant and irrelevant features that may confuse the learning algorithm. Second, it measures the discriminative power of each feature and selects the most informative that produces the highest classification accuracy. The JMIM method was also employed in this research to select the significant feature subset for the discrimination task. This method was chosen, as it was proved that the method outperformed other state-of-the-art methods in the literature, and because it is computationally inexpensive.
JMIM is based on the use of mutual information and the 'maximum of the minimum' criterion. The method starts with the most significant feature, then it begins to select features one by one and add them into the selected subset. The feature is selected when it maximises the goal function: where F is the original feature set, S is the selected subset feature, f I is the candidate feature and f s is the selected feature.
The method selects the features that share the highest information with the class label in the context of the features that have already been selected. The method employs the following forward greedy search algorithm: 1.
(Computation of the MI with the output class) For ∀ f i F compute I(C; f i ).

3.
(Choice of the first feature) Find a feature f i that maximises (Greedy selection) Repeat until|S| = k: (Selection of the next feature) Choose the (Output) Output the set S with the selected features.
The greedy algorithms may not find the optimal solution for some of the heuristic problems when each node in the problem has a fixed value. However, that is not the case for mutual information-based feature selections methods. Although the amount of information between each feature and the class label is theoretically fixed, when the feature is within a subset then this amount of information is no longer fixed, as there is factor called information interaction between the features that can reduce or increase this amount of information. Moreover, the feature may share redundant information that is covered by other features. In addition, the used goal function does not include any summation, it employs maximum of the minimum criterion. Therefore, the search for the features is not as simple as the common heuristic problems.
The traditional understanding of local minima problem does not apply here because of the information redundancy and information interaction between features which makes everything dynamic with the searched features.

Classification
The key point of this work is to investigate the significance of the accelerometer signal features. Classification accuracy is used to evaluate the significance of the selected subset of features. For this reason, two classification algorithms have been used; K-nearest neighbour classifier (KNN) (k = 3), and the Support Vector Machine (SVM) classifier with linear kernel; these classifiers have been chosen because they make few (if any) assumptions about the data. Therefore, the performance will be mainly due to the feature selection not the classifier's ability to handle the data. They were also chosen because they have been used in similar studies [58].
We used the Matlab Statistics Toolbox for SVM and KNN classifiers. Five-fold cross validation was used for the feature's significance evaluation. Each fold of the dataset (20%) was used for the validation once, while the rest of the data (80%) were used for the classifiers. This process was repeated five times, which means that the whole dataset was used for the validation.
The selected subset of features was evaluated after adding each new feature, starting with only the most significant feature in the selected subset. This continued until all 193 features were used together for the training and testing. The classification accuracy produced reflected the discriminative power of the whole subset after adding the new selected feature and was not based on the newly selected feature only.

Results
This section describes the results of applying the proposed framework to define the significant feature selection for each dataset.

Significant Features
The JMIM method has been used to define the most significant features when classifying the activity within the datasets. This method is classifier independent: it measures the shared information between the class label and the subset of features. Therefore, the features are ranked based on the amount of information that they add about the class, as the selected features add more information when classification accuracy improves.
The results of the WISDM dataset show that the Root Mean Square (RMS) features shared the highest information with class. Mean, FFT, autocorrelation and wavelet decomposition were among the most significant as well. These five features make up 29 out of the most significant 30 features.
The RMS feature is shown as the most significant feature in the results of SBRHA dataset. The results also showed that Mean, FFT, autocorrelation and wavelet decomposition are also among the most significant features. The 30 most significant features comprise 28 features from these four feature groups.
The results of both datasets show that uncommon features, such as RQA, permutation entropy, Lyapunov exponent and total harmonic disorder were not among the most significant features for these two datasets.
The results of the LG watch and Samsung watch datasets (HARDS) are slightly different. Features that measure the recurrence, regularity and degree of chaos were shown to be among the ten most significant features. These two datasets use relatively high sampling frequencies, which provide more values and more information about the details of the movement. Therefore, features related to repeatability, regularity and level of chaos appeared in the ten most significant features especially for the LG watch data which has a higher sampling frequency, 18 of the 30 most significant features belong to RQA, Permutation entropy, and Lyapunov exponent. Wavelet features also appeared within this list five times. While for the Samsung watch 11 of the 30 most significant features were wavelet features; repeatability related features appeared 10 times in this list. The most significant features were the standard deviation features for both datasets. Many of the twenty most significant features belong to wavelet decomposition features.
Due to the unique characteristics of PAMAP2 dataset in terms of the number of activities and the level of movement complexity of these activities, the order of the ranked features is different. Features related to Recurrence rate in the signal, WPD, permutation entropy, and autorepression make up most of the features that produce the highest classification accuracy. On the other hand, the results were different when the ankle dataset was down sampled to 20 Hz. The results showed the simple time domain features such mean, and frequency domain features such FFT, WPD, and autorepression features are among the most significant features. Features related to recurrence rate in the signal and permutation entropy appeared less significant when the sampling frequency was reduced. Therefore, features related to Recurrence rate in the signal, wavelet decompensation, permutation entropy can be important for discriminating the activities if the sampling frequency is high and more samples are available for each window. The results also showed that there is no major effect of the sensor placement location on the order of the selected features.

Classification
Two classifiers were used to evaluate the significance of the ranked selected subset of features for the classification accuracy. As mentioned earlier, each classifier was trained and tested after adding every single feature to the subset. Five-folds cross validation was used for every training and testing iteration. Classi f ication accuracy = TP + TN TP + TN + FP + FN Precision + Recall Extending these classification metrics to multi-class problems can be achieved by considering one of the classes as a positive class and the remaining ones as negative class [59]: where M is the number of distinct classes in the datasets. Figure 3 shows the classification accuracy and f-measures figures of the WISDM dataset using SVM and KNN classifiers. It is clear that SVM outperformed KNN and achieved 93% accuracy, while KNN achieved 90.65%. Table 3 shows the classification accuracy using 1, 10, 30, 60, 120 and 193 ranked features. The table shows that the most significant feature on its own is able to produce more than 74% accuracy, which is about 80% of the classification accuracy achieved using all 193 features. With just 30 features, the classification accuracy of the SVM classifier was more than 94% of the accuracy of the total features. The achieved accuracy outperforms the accuracy of the original study, which used the same dataset [60], which was 91.7%.

− = 2 × × +
Extending these classification metrics to multi-class problems can be achieved b considering one of the classes as a positive class and the remaining ones as negative clas [59]: where is the number of distinct classes in the datasets. Figure 3 shows the classification accuracy and f-measures figures of the WISDM da taset using SVM and KNN classifiers. It is clear that SVM outperformed KNN an achieved 93% accuracy, while KNN achieved 90.65%. Table 3 shows the classification ac curacy using 1,10,30,60,120 and 193 ranked features. The table shows that the most signi icant feature on its own is able to produce more than 74% accuracy, which is about 80% of the classification accuracy achieved using all 193 features. With just 30 features, th classification accuracy of the SVM classifier was more than 94% of the accuracy of the tota features. The achieved accuracy outperforms the accuracy of the original study, whic used the same dataset [60], which was 91.7%. The results of the SBRHA dataset showed that with only the most significant feature the SVM classifier was able to classify the activities with the different accuracy; joggin (93.8%), walking (67.89%), sitting (95%), standing (89.46%), while the classifier misse classifying all the walking upstairs and downstairs activities. The results of using the sub set of 30 features showed that walking upstairs and downstairs are also the most misclas sified, as walking and jogging have an accuracy of (57.55%), while standing is with th highest accuracy (99%), then jogging (97.02%), walking (92.92%) and sitting (95%).  The results of the SBRHA dataset showed that with only the most significant feature, the SVM classifier was able to classify the activities with the different accuracy; jogging (93.8%), walking (67.89%), sitting (95%), standing (89.46%), while the classifier missed classifying all the walking upstairs and downstairs activities. The results of using the subset of 30 features showed that walking upstairs and downstairs are also the most misclassified, as walking and jogging have an accuracy of (57.55%), while standing is with the highest accuracy (99%), then jogging (97.02%), walking (92.92%) and sitting (95%).
Although the dataset is not balanced, Figure 4 shows that f-measure is relatively high for both classifiers, the value of f-measure when the most significant 30 features have used to train the classifiers was 0.91, and 0.90 for SVM and KNN respectively. Although the dataset is not balanced, Figure 4 shows that f-measure is relatively hig for both classifiers, the value of f-measure when the most significant 30 features have use to train the classifiers was 0.91, and 0.90 for SVM and KNN respectively.  As noted earlier, only the data collected by the smart watches from HARDS datase was used in this study, the rest of HARDS dataset was not included. Figure 5 shows th results of the classification accuracy of the LG watch with the most significant subset o 30 features was 89.87%, and 89.01% for SVM and KNN classifiers respectively. The f-meas ure with the same number of significant features was 0.92 for both classifiers. Layin down is the activity that was classified least correctly with a recognition rate of 80.68%; was misclassified as standing and sitting. This accuracy stayed almost steady after th feature subset reached 120 features, which means the rest of the features were redundan the accuracy for SVM, and KNN were 93.05%, and 91.84% respectively.
In terms of classification rate of each activity when the most 120 significant feature are used, biking scored the highest classification (98.74%) followed by sitting (98.73% standing (92.91%), walking (88.18%), and walking up and down (84.06%). This order stay almost the same when the most 30 significant features were used, the only difference i swapping the order between walking and walking up the stairs. The results of the Samsung watch dataset are shown in Figure 6. They show that th  As noted earlier, only the data collected by the smart watches from HARDS dataset was used in this study, the rest of HARDS dataset was not included. Figure 5 shows the results of the classification accuracy of the LG watch with the most significant subset of 30 features was 89.87%, and 89.01% for SVM and KNN classifiers respectively. The f-measure with the same number of significant features was 0.92 for both classifiers. Laying down is the activity that was classified least correctly with a recognition rate of 80.68%; it was misclassified as standing and sitting. This accuracy stayed almost steady after the feature subset reached 120 features, which means the rest of the features were redundant, the accuracy for SVM, and KNN were 93.05%, and 91.84% respectively.
Although the dataset is not balanced, Figure 4 shows that f-measure is relatively hig for both classifiers, the value of f-measure when the most significant 30 features have used to train the classifiers was 0.91, and 0.90 for SVM and KNN respectively. As noted earlier, only the data collected by the smart watches from HARDS datase was used in this study, the rest of HARDS dataset was not included. Figure 5 shows th results of the classification accuracy of the LG watch with the most significant subset o 30 features was 89.87%, and 89.01% for SVM and KNN classifiers respectively. The f-meas ure with the same number of significant features was 0.92 for both classifiers. Layin down is the activity that was classified least correctly with a recognition rate of 80.68%; was misclassified as standing and sitting. This accuracy stayed almost steady after th feature subset reached 120 features, which means the rest of the features were redundan the accuracy for SVM, and KNN were 93.05%, and 91.84% respectively.
In terms of classification rate of each activity when the most 120 significant feature are used, biking scored the highest classification (98.74%) followed by sitting (98.73% standing (92.91%), walking (88.18%), and walking up and down (84.06%). This order stay almost the same when the most 30 significant features were used, the only difference i swapping the order between walking and walking up the stairs. The results of the Samsung watch dataset are shown in Figure 6. They show that th classification accuracy and f-measures of were lower than the LG watch. The highest clas sification accuracy was 88% and 86% for SVM, and KNN, respectively, which wa achieved using the 120 most significant features. In terms of classification rate of each activity when the most 120 significant features are used, biking scored the highest classification (98.74%) followed by sitting (98.73%), standing (92.91%), walking (88.18%), and walking up and down (84.06%). This order stays almost the same when the most 30 significant features were used, the only difference is swapping the order between walking and walking up the stairs.
The results of the Samsung watch dataset are shown in Figure 6. They show that the classification accuracy and f-measures of were lower than the LG watch. The highest classification accuracy was 88% and 86% for SVM, and KNN, respectively, which was achieved using the 120 most significant features. At the same time the f-measure using this subset of features was 0.88, and 0.85 for the SVM, and KNN classifiers. The figures of this dataset agree with the previous dataset about the redundancy of the 70 least significant features. The walking up the stairs activity had the lowest recognition rate of 58.70%, it was again misclassified also as standing and walking. The results for the PAMAP2 datasets for different sensor placement locations ar shown in Figures 7-9. The results shows that there is no significant influence of the senso placement on the classification accuracy. However, there is a slight advantage when sen sors are placed on the ankle. Due to the high number of activities in the PAMAP2 dataset some of which consist of complex movements, the classification accuracy for PAMAP datasets is slightly lower than the other datasets used in this study. For the ankle datase walking up and down the stairs were the most misclassified activities with a recognitio rate of 40.71%, and 57.64%. They were misclassified as walking, running, and ironing ac tivities.   The results for the PAMAP2 datasets for different sensor placement locations are shown in Figures 7-9. The results shows that there is no significant influence of the sensor placement on the classification accuracy. However, there is a slight advantage when sensors are placed on the ankle. Due to the high number of activities in the PAMAP2 datasets, some of which consist of complex movements, the classification accuracy for PAMAP2 datasets is slightly lower than the other datasets used in this study. For the ankle dataset, walking up and down the stairs were the most misclassified activities with a recognition rate of 40.71%, and 57.64%. They were misclassified as walking, running, and ironing activities. The results for the PAMAP2 datasets for different sensor placement locations ar shown in Figures 7-9. The results shows that there is no significant influence of the senso placement on the classification accuracy. However, there is a slight advantage when sen sors are placed on the ankle. Due to the high number of activities in the PAMAP2 dataset some of which consist of complex movements, the classification accuracy for PAMAP datasets is slightly lower than the other datasets used in this study. For the ankle datase walking up and down the stairs were the most misclassified activities with a recognitio rate of 40.71%, and 57.64%. They were misclassified as walking, running, and ironing ac tivities.  To evaluate the discriminative power of the uncommon features, which are the fea tures related to RQA, PE, LE, and THE, the classification algorithms were trained withou using the uncommon features. The drop of the classification accuracy is used as indicatio of the discriminative power of these features. Table 4 illustrate the results of this exper ment, Adding the uncommon features to the dataset enhanced the classification accurac by 2.3-8.6%. The results showed that the enhancement is larger when the sampling fre quency is higher, which means the uncommon features are able to extract more pattern when the sampling frequency is high.

Discussion
The significance of accelerometer features was successfully identified using the state of-the-art JMIM feature selection method. The evaluation of the selected subset was car ried out using SVM and KNN classifiers. Four publicly available datasets were used i this study because they have been previously used in activity recognition research an To evaluate the discriminative power of the uncommon features, which are the features related to RQA, PE, LE, and THE, the classification algorithms were trained without using the uncommon features. The drop of the classification accuracy is used as indication of the discriminative power of these features. Table 4 illustrate the results of this experiment, Table 4. Classification Accuracy without, and with uncommon features using SVM classifier.

HARDS PAMAP2
LG Adding the uncommon features to the dataset enhanced the classification accuracy by 2.3-8.6%. The results showed that the enhancement is larger when the sampling frequency is higher, which means the uncommon features are able to extract more patterns when the sampling frequency is high.

Discussion
The significance of accelerometer features was successfully identified using the stateof-the-art JMIM feature selection method. The evaluation of the selected subset was carried out using SVM and KNN classifiers. Four publicly available datasets were used in this study because they have been previously used in activity recognition research and also to study the effect of sampling frequency as well as on body placement location on the feature significance for classification accuracy. The sampling frequency of the datasets is between 20 and 200 Hz. Wearables are placed at the waist, wrist, chest, and ankle. The number of activities also vary, with most of the datasets including only simple activities, and one of them including complex activities.
This study focused on the type of the features rather than the acceleration axis to which the signal belongs. The features are divided into five groups: (simple time domain), (frequency domain which includes FFT, and spectral features), (repeatability and regularity which includes RQA, permutation entropy, and Lyapunov exponent), (time-frequency which includes WPD), and finally (autorepression features).
The results showed that the Root Mean Square (RMS) features produced 74% accuracy on their own for the WISDM and SBRHA datasets. FFT, Root Mean Square (RMS), mean, autocorrelation and wavelet were among the most significant features for human activity recognition for these two datasets. This result agreed with the results in previous literature [14].
On the other hand, when the sampling frequency is raised, features related to signal repeatability, regularity and level of chaos (RQA, Permutation Entropy, Lyapunov exponent), and WPD were shown among the most significant. The sampling rate was 100 and 200 Hz, which meant that each contained more data points. This high number of data points provided more detailed information about the movement. That is why these features appeared among the most significant in the subset. Wavelet decomposition features were also among the most significant features for this dataset. This has been shown in the results from the PAMAP2 and HARDS datasets.
Even though many participants were recruited to perform the activities during the data collection, which meant that the activities were performed at different speeds and styles, the achieved classification accuracy was very high, especially for SBRHA (more than 98%). The results also showed that SVM worked better than KNN.
In conclusion, simple time domain features can be very predictive for recognising simple activities, as the sampling frequency becomes high WPD, and repeatability and regularity features can be very predictive, especially when the recognised activities include complex activities. WPD features can have significant discriminative power for recognising both simple and complex activities.
It would be interesting in the future to study the influence of window size and overlapping on the selected feature subset, as well as the effect of applying invariant orientation techniques on the feature significance. Further improvements are needed for the selection method to be able to produce scores for significance degrees.
Future research could also be conducted using new collected data with more activities, and more participants could be included. In particular, datasets should report full demographic statistics for participants including at least age, gender, ethnicity, and body mass index. The public datasets have poor descriptions of their participant demographics with only one reporting the number of females (1/9), thus it is likely that the datasets are heavily biased towards young white men. Future public datasets should reflect the variation in real world, including additional participants from diverse age groups and with different ranges of movement capabilities in order to study the feature significance for data collected from a variety of populations, and different patients with movement impairments. A future study can be expanded to include all IMU sensors, which will require studying the gyroscope drift problem, and comparing various compensation approaches. The methodology of this study can also be employed to study controlling powered exoskeletons through developing machine learning-based algorithms to recognise the movement. Funding: This research was part funded by UK EPSRC grants EP/P01013X/1 (STRETCH) and EP/R013144/1 (SAUSE), EP/V027263/1 (SERVICE), and Science Foundation Ireland (SFI 13/RC/2094).

Conflicts of Interest:
The authors declare no conflict of interest.