A Comprehensive Analysis on Wearable Acceleration Sensors in Human Activity Recognition

Sensor-based motion recognition integrates the emerging area of wearable sensors with novel machine learning techniques to make sense of low-level sensor data and provide rich contextual information in a real-life application. Although Human Activity Recognition (HAR) problem has been drawing the attention of researchers, it is still a subject of much debate due to the diverse nature of human activities and their tracking methods. Finding the best predictive model in this problem while considering different sources of heterogeneities can be very difficult to analyze theoretically, which stresses the need of an experimental study. Therefore, in this paper, we first create the most complete dataset, focusing on accelerometer sensors, with various sources of heterogeneities. We then conduct an extensive analysis on feature representations and classification techniques (the most comprehensive comparison yet with 293 classifiers) for activity recognition. Principal component analysis is applied to reduce the feature vector dimension while keeping essential information. The average classification accuracy of eight sensor positions is reported to be 96.44% ± 1.62% with 10-fold evaluation, whereas accuracy of 79.92% ± 9.68% is reached in the subject-independent evaluation. This study presents significant evidence that we can build predictive models for HAR problem under more realistic conditions, and still achieve highly accurate results.


Introduction
The maturity of pervasive sensing, wireless technology, and data processing techniques enables us to provide an effective solution for continuous monitoring and promote individual's health. Today, the miniature sensors can be unobtrusively attached to the body or can be part of clothing items to observe people's lifestyle and behavior changes [1]. According to study presented in [2], on-body sensing proves to be the most prevalent monitoring technology for the gait assessment, fall detection and activity recognition/classification. As such, extensive research has been undertaken to select or develop reasoning algorithms to infer activities from the wearable sensor data. Human activity recognition thattargets the automatic detection of people activities, is one of the most promising research topics in different areas such as ubiquitous computing and ambient assistive living [3]. Low-cost, yet highly reliable accelerometer is the most broadly used wearable sensor for the sake of activity recognition and could provide high classification accuracy of 92.25% [4], 96% [5], and 99.4% [6]. 3-D accelerations can be represented as:

Backgrounds and Methodologies
Human Activity Recognition (HAR) starts with collecting data from the motion sensors. The data are partitioned into windows to apply feature extraction thereby filtering relevant information in the raw signals. Afterward, extracted features are used as inputs of each classifier that ultimately yields the HAR model. To evaluate the effect of sensing heterogeneity on classifiers, we do not perform any preprocessing steps. This problem is formulated as follows: Definition: With p extracted features from the motion sensors, given a set = { , , … , } of labeled and equal-sized time windows, and a set = { , , … , } of activity labels, the goal is to find the best classifier model C, such that for any which contains a feature set = , , , , … , , , the predicted label = ( ) is as identical as possible to the actual activity performed during . p is the number of features in vector extracted from . Figure 1 depicts the whole system flow of sensor-based activity recognition for nine activities.

Data Segmentation, Feature Extraction and Selection
Stream of sensory data needs to be divided into subsequent segments. Fixed-size Sliding Window (FSW) is the most common method in segmentation step where the data stream is allotted into fixed-length windows with no inter-window gaps. If there is no degree of overlap between adjacent windows, it is called Fixed-size Non-overlapping Sliding Window (FNSW). The second method is Fixed-size Overlapping Sliding Window (FOSW), which is similar to FNSW except that the windows overlap during segmentation [8,9]. The use of overlap between adjacent windows has been shown to be effective in classification problem using wearable sensor data [14,15]. Finding the optimal window size t is an application-dependent task. The window size should be properly determined in such a way that each window is guaranteed to contain enough samples (at least one cycle of an activity) to differentiate similar movements. In addition, increasing the window size does not necessarily enhance the accuracy but may add computational complexity (causing higher latency). To better address the challenge, we analyze the influence of window sizes (ranging from 1 s to 15 s) on the classification performance.
Feature extraction is to obtain the important characteristics of a data and represent them into a feature vector used as input of a classier [16]. Table 1 gives details about the most effective time/frequency-domain and heuristic features in the literature in the context of activity recognition. Due to low computational cost and high discriminatory ability of time-domain features, they are the most frequently employed features for real-time applications. We compute all the features listed in Table 1 using each reading of accelerometer sensor consists of 3-D accelerations (x, y, z). However, to minimize the effects of sensor orientation, we add another dimension to the sensor readouts, which is called the magnitude of the accelerometer vector, i.e., + + , because it is less sensitive to the orientation changes [17]. It is worth noting that the correlation features are calculated between each pair of axes, and the tilt angles are estimated by combination of all three axes as shown in Table  1. Each classifier is fed with the feature vectors obtained from fusing data at the feature level. As a result

Data Segmentation, Feature Extraction and Selection
Stream of sensory data needs to be divided into subsequent segments. Fixed-size Sliding Window (FSW) is the most common method in segmentation step where the data stream is allotted into fixed-length windows with no inter-window gaps. If there is no degree of overlap between adjacent windows, it is called Fixed-size Non-overlapping Sliding Window (FNSW). The second method is Fixed-size Overlapping Sliding Window (FOSW), which is similar to FNSW except that the windows overlap during segmentation [8,9]. The use of overlap between adjacent windows has been shown to be effective in classification problem using wearable sensor data [14,15]. Finding the optimal window size t is an application-dependent task. The window size should be properly determined in such a way that each window is guaranteed to contain enough samples (at least one cycle of an activity) to differentiate similar movements. In addition, increasing the window size does not necessarily enhance the accuracy but may add computational complexity (causing higher latency). To better address the challenge, we analyze the influence of window sizes (ranging from 1 s to 15 s) on the classification performance.
Feature extraction is to obtain the important characteristics of a data and represent them into a feature vector used as input of a classier [16]. Table 1 gives details about the most effective time/frequency-domain and heuristic features in the literature in the context of activity recognition. Due to low computational cost and high discriminatory ability of time-domain features, they are the most frequently employed features for real-time applications. We compute all the features listed in Table 1 using each reading of accelerometer sensor consists of 3-D accelerations (x, y, z). However, Sensors 2017, 17, 529 4 of 26 to minimize the effects of sensor orientation, we add another dimension to the sensor readouts, which is called the magnitude of the accelerometer vector, i.e., x 2 + y 2 + z 2 , because it is less sensitive to the orientation changes [17]. It is worth noting that the correlation features are calculated between each pair of axes, and the tilt angles are estimated by combination of all three axes as shown in Table 1. Each classifier is fed with the feature vectors obtained from fusing data at the feature level. As a result of the above feature extraction process, a total of 176 features are obtained for each segment and then scaled into interval [0, 1] using min-max normalization so as to be used for classification.
As not all features are equally useful in discriminating between activities, Principal Component Analysis (PCA) is applied to map the original features F k = f k,1 , f k,2 , . . . , f k,p into a lower dimensional subspace (i.e., new mutually uncorrelated features) F k = f k,1 , f k,2 , . . . , f k,m , where m ≤ p [18]. It also significantly reduces the computational effort of the classification process. The PCA components can be counted by X = YP, where X and Y are centering and input matrix, respectively and P is a matrix of eigenvector of the covariance vector matrix C x = PΛP T . Λ is a diagonal matrix whose diagonal elements are the eigenvalues corresponding to each eigenvector [19]. The new feature vectors are so-called principal components and arranged according to their variance (from largest to lowest). To keep the essential information in acceleration data that describe human activity, we take the first principal components that explain 95% of the total variance. The pairwise scatter plots of the first four components (transformed features) of one of test cases are given in Figure 2. As expected, the first components (the first component against the second component) for different classes are better clustered and more distinct.

Peak Intensity
The number of signal peaks within a certain period of time

Coefficients of Variation
Percentiles t = npi 100 + 0.5, p i = 10, 25, 50, 75, 90 Autocorrelation the height of the first and second peaks and the position of the second peak of R(k) f denotes the f th Fourier coefficient in the frequency domain; the positions and power levels of highest 6 peaks of PSD computed over the sliding window; total power in 5 adjacent and pre-defined frequency bands.

Datasets
To design a robust learning model working in more realistic conditions, we combined 14 datasets, focusing on accelerometer sensors that contain several sources of heterogeneities such as measurement units, sampling rates and acquisition protocols that are present in most real-world applications. Table 2 listed the datasets and brought the details of the collected data in each project. In total, the aggregated dataset has about 35 million acceleration samples from 228 subjects (with age ranging from 19 to 83) of more than 70 different activities. This is the most complete, realistic, and transparent dataset in this context.
We which is a great place for many commercial wearables, is considered as RLA/LLA in this paper. In this field, numerous studies [31,32] have shown that the performance of HAR systems strongly depends on sensor placement since the number and the placement of inertial sensors have direct effects on the measurement of bodily motions. Each placement turns out to be more suitable in terms of performance for particular activities. Besides, having fewer sensors attached to the body is more preferable since wearing multiple ones can become burdensome and is not well-accepted. Therefore, we limited our modeling and analysis for single-accelerometer data while we still expect a sufficiently high recognition rate for the picked activities. According to the datasets, the most examined activities (top activities) are walking, running, jogging, cycling, standing, sitting, lying down, ascending and descending stairs which also represent the majority of everyday living activities. Another observation we find is that in eight major positions we have data for all top activities. Therefore, we choose them as target activities for eight separate positions (W, RLA, LLA, RUL, LUL, RLL, LLL and C). We created rectangular tree map that presents dense volumes of data in a space filling layout allowing for the visual comparison of datasets contributions in each target position (see Figure 3). For example, as depicted in Figure 3, datasets 3, 5, 6, 7 and 8 contribute data for constructing the chest dataset with nine activities. We explore 293 different classifiers including Decision Tree, Discriminant Analysis, Support Vector Machines, K-Nearest Neighbors, Ensemble Methods, Naïve Bayes and Neural Network with their different parameters. The methods and their parameters setting are described and given IDs in Appendix A. The main objective of implementing different classification techniques is to review, compare and evaluate their performance considering the most well-known heterogeneous datasets publicly open to the research community. We are going to intertwine different issues and suggest solutions if we expect reasonable results in the practical applications.

Datasets
To design a robust learning model working in more realistic conditions, we combined 14 datasets, focusing on accelerometer sensors that contain several sources of heterogeneities such as measurement units, sampling rates and acquisition protocols that are present in most real-world applications. Table 2 listed the datasets and brought the details of the collected data in each project. In total, the aggregated dataset has about 35 million acceleration samples from 228 subjects (with age ranging from 19 to 83) of more than 70 different activities. This is the most complete, realistic, and transparent dataset in this context.
We , and Chest (C). All sensors positions described in each dataset have been mapped into the major positions. For example, if a subject puts the cellphone in the left front pants pocket, we consider it as Left Upper Leg (LUL) position, or wrist, which is a great place for many commercial wearables, is considered as RLA/LLA in this paper. In this field, numerous studies [31,32] have shown that the performance of HAR systems strongly depends on sensor placement since the number and the placement of inertial sensors have direct effects on the measurement of bodily motions. Each placement turns out to be more suitable in terms of performance for particular activities. Besides, having fewer sensors attached to the body is more preferable since wearing multiple ones can become burdensome and is not well-accepted. Therefore, we limited our modeling and analysis for single-accelerometer data while we still expect a sufficiently high recognition rate for the picked activities. According to the datasets, the most examined activities (top activities) are walking, running, jogging, cycling, standing, sitting, lying down, ascending and descending stairs which also represent the majority of everyday living activities. Another observation we find is that in eight major positions we have data for all top activities. Therefore, we choose them as target activities for eight separate positions (W, RLA, LLA, RUL, LUL, RLL, LLL and C). We created rectangular tree map that presents dense volumes of data in a space filling layout allowing for the visual comparison of datasets contributions in each target position (see Figure 3). For example, as depicted in Figure 3, datasets 3, 5, 6, 7 and 8 contribute data for constructing the chest dataset with nine activities. In the first trial, each subject placed the smartphone in a predetermined position i.e., the left side of the belt. However, in the second attempt, they could fix the phone in a desired position on the waist.
(2 The subjects performed nineteen activities by their own style and were not controlled during data collection sessions. (4) [33] 16  year) accelerometer (6-bit resolution) 32 Hz right wrist (1) walking, climbing stairs, descending stairs, laying down on bed, sitting down on chair, brushing teeth, eating meat, etc. (14) There are postural transitions, reiterated and complex activities in the dataset.

Experimental Results and Discussions
In this section, we report the effects of the heterogeneities, from sensors characteristics, data collection scenarios and subjects, on various feature representation techniques and 293 classifiers considering two cross-validation techniques. First, the 10-fold cross-validation strategy is applied as one of the most accurate approaches for model selection. Figure 4 shows the minimum and maximum obtained accuracy of each classifier over different window sizes with the waist accelerometer data. The algorithms are sorted according to the best obtained accuracy. Considering the best acquired accuracy for each classification category in this position, the ensemble methods KNN (Subspace) and Tree (Bagging) achieved the highest activity recognition rate whereas DT performed the worst. Further DA, DA (Subspace), Tree (AdaBoost), Tree (RUSBoost) and NB performed almost equal but worse than SVM, NN and KNN. As can be seen, some classification learning algorithms are more sensitive to parameters settings and window size and may thus be more likely to exhibit significant differences. To have a better and deeper investigation, we extracted the classifiers with top 5% accuracies and call them "topClassifiers" for each position. Figure

Experimental Results and Discussions
In this section, we report the effects of the heterogeneities, from sensors characteristics, data collection scenarios and subjects, on various feature representation techniques and 293 classifiers considering two cross-validation techniques. First, the 10-fold cross-validation strategy is applied as one of the most accurate approaches for model selection. Figure 4 shows the minimum and maximum obtained accuracy of each classifier over different window sizes with the waist accelerometer data. The algorithms are sorted according to the best obtained accuracy. Considering the best acquired accuracy for each classification category in this position, the ensemble methods KNN (Subspace) and Tree (Bagging) achieved the highest activity recognition rate whereas DT performed the worst. Further DA, DA (Subspace), Tree (AdaBoost), Tree (RUSBoost) and NB performed almost equal but worse than SVM, NN and KNN. As can be seen, some classification learning algorithms are more sensitive to parameters settings and window size and may thus be more likely to exhibit significant differences. To have a better and deeper investigation, we extracted the classifiers with top 5% accuracies and call them "topClassifiers" for each position.  As can be seen in this figure, most of the recognition methods remain consistent in their relative performance across different accelerometer data obtained from different positions. As explained in Section 2, finding the optimal length of window size is an application-dependent task. The window size should be properly determined in such a way that each window is guaranteed to contain enough samples to differentiate similar activities or movements. Thus, we consider different window sizes ranging from 1 s to 15 s in steps of 1 s to ensure the statistical significance of the calculated features. It comprises most of the values used in the previous activity recognition systems [8]. Figure 6 describes the ranking of different window sizes in providing the best accuracy (among all classifiers) in each position. For example, window of length 7 s provides the best classification accuracy when the sensor is attached on the RLA. The second best accuracy value for RLA is achieved with window size 8 s. The first line in this figure shows the window sizes of the best accuracies for each position. In contrast to LLL where the top four accuracy values have been observed in small window sizes ranging from 2 s to 5 s, chest provides the top-rank accuracy values in large window sizes from 10 s to 15 s. This observation is more highlighted in orange line (w = 1 s) where all positions obtain the worst case accuracy values except for the LLL. However, in some cases we can change the window size (increase/decrease) at the expense of a subtle performance drop. For example, in chest position, the window size can be reduced from 15 s to 7 s by only tolerating 0.16% in recognition performance. As can be seen in this figure, most of the recognition methods remain consistent in their relative performance across different accelerometer data obtained from different positions. As explained in Section 2, finding the optimal length of window size is an application-dependent task. The window size should be properly determined in such a way that each window is guaranteed to contain enough samples to differentiate similar activities or movements. Thus, we consider different window sizes ranging from 1 s to 15 s in steps of 1 s to ensure the statistical significance of the calculated features. It comprises most of the values used in the previous activity recognition systems [8].   The bar charts in Figure 5 indicate the number of window sizes (1 s to 15 s) in which the topClassifiers provided good results (top 5%). An interesting observation from the bars is that some classifiers such as KNN (Subspace), Tree (Bagging) and SVM work well with most windows sizes. This means they could mitigate the effect of window size to gain meaningful information for the activity classification process. To have a better understanding of window size effect on accuracy, Figure 7 shows the topClassifiers across all window sizes in each position. As can be observed, the interval 3-10 s proves to provide the best accuracies in most cases considering the target activities. This range can be reduced if fusion of multiple sensors is used for feature extraction. Another point worth mentioning that is for the underlying periodic activities, large window sizes do not necessarily translate into a better recognition performance. Although accuracy is necessary for all recognition algorithms, it is not the only parameter to consider in designing a recognition model. The runtime complexity (classification step) is another important challenge as the model should be working fast and responsive regardless of where it is deployed i.e., on/off-device. Thus, we make use of the concept of Pareto optimality to extract superior solutions from topClassifiers to tradeoff classifier accuracy and runtime. We consider two objective functions, i.e., misclassification and runtime, to be minimized. A feasible solution x dominates a feasible solution y when ∀ , ( ) ≤ ( ) The bar charts in Figure 5 indicate the number of window sizes (1 s to 15 s) in which the topClassifiers provided good results (top 5%). An interesting observation from the bars is that some classifiers such as KNN (Subspace), Tree (Bagging) and SVM work well with most windows sizes. This means they could mitigate the effect of window size to gain meaningful information for the activity classification process. To have a better understanding of window size effect on accuracy, Figure 7 shows the topClassifiers across all window sizes in each position. As can be observed, the interval 3-10 s proves to provide the best accuracies in most cases considering the target activities. This range can be reduced if fusion of multiple sensors is used for feature extraction. Another point worth mentioning that is for the underlying periodic activities, large window sizes do not necessarily translate into a better recognition performance. The bar charts in Figure 5 indicate the number of window sizes (1 s to 15 s) in which the topClassifiers provided good results (top 5%). An interesting observation from the bars is that some classifiers such as KNN (Subspace), Tree (Bagging) and SVM work well with most windows sizes. This means they could mitigate the effect of window size to gain meaningful information for the activity classification process. To have a better understanding of window size effect on accuracy, Figure 7 shows the topClassifiers across all window sizes in each position. As can be observed, the interval 3-10 s proves to provide the best accuracies in most cases considering the target activities. This range can be reduced if fusion of multiple sensors is used for feature extraction. Another point worth mentioning that is for the underlying periodic activities, large window sizes do not necessarily translate into a better recognition performance. Although accuracy is necessary for all recognition algorithms, it is not the only parameter to consider in designing a recognition model. The runtime complexity (classification step) is another important challenge as the model should be working fast and responsive regardless of where it is deployed i.e., on/off-device. Thus, we make use of the concept of Pareto optimality to extract superior solutions from topClassifiers to tradeoff classifier accuracy and runtime. We consider two objective functions, i.e., misclassification and runtime, to be minimized. A feasible solution x dominates a feasible solution y when ∀ , ( ) ≤ ( ) Although accuracy is necessary for all recognition algorithms, it is not the only parameter to consider in designing a recognition model. The runtime complexity (classification step) is another important challenge as the model should be working fast and responsive regardless of where it is deployed i.e., on/off-device. Thus, we make use of the concept of Pareto optimality to extract superior solutions from topClassifiers to tradeoff classifier accuracy and runtime. We consider two objective functions, i.e., misclassification and runtime, to be minimized. A feasible solution x dominates a feasible solution y when where f i is the ith objective function. However, in many problems, there is usually no single solution that is superior to all others, so the non-dominated solutions compose the Pareto front. For example, in Figure 8, we populate the runtime-accuracy plane with some topCalssifiers for waist position and depict the Pareto front. The shaded area represents the region in f 1 × f 2 space that is dominated by the point x which is non-dominated and hence belong to the Pareto front [38]. All points in this region are inferior to x in both objectives. In addition, if we want to minimize an objective to a constraint, e.g., f 1 (x) < c, the Pareto front provides the solution for all possible values of the cap c [38]. Therefore, the Pareto front contains significantly richer information than one obtains from single-objective formulations. Table 3 summarizes the non-dominated classifiers in each position.
The results show a clear tradeoff between classifiers runtime and accuracy. There is no strong relation between the sensor position and classification performance. In overall, the highest classification accuracy was achieved by KNN (Subspace), and KNN stayed in the second place, which is followed by SVM and NN. Figure 9a depicts the overall view of the non-dominated classifiers and their power in providing high recognition accuracy. The size of each classifier ID in this figure represents the number of times that the corresponding classifier has been reported in Table 3. where is the ith objective function. However, in many problems, there is usually no single solution that is superior to all others, so the non-dominated solutions compose the Pareto front. For example, in Figure 8, we populate the runtime-accuracy plane with some topCalssifiers for waist position and depict the Pareto front. The shaded area represents the region in space that is dominated by the point x which is non-dominated and hence belong to the Pareto front [38]. All points in this region are inferior to x in both objectives. In addition, if we want to minimize an objective to a constraint, e.g., ( ) < , the Pareto front provides the solution for all possible values of the cap c [38]. Therefore, the Pareto front contains significantly richer information than one obtains from single-objective formulations. Table 3 summarizes the non-dominated classifiers in each position. The results show a clear tradeoff between classifiers runtime and accuracy. There is no strong relation between the sensor position and classification performance. In overall, the highest classification accuracy was achieved by KNN (Subspace), and KNN stayed in the second place, which is followed by SVM and NN. Figure 9a depicts the overall view of the non-dominated classifiers and their power in providing high recognition accuracy. The size of each classifier ID in this figure represents the number of times that the corresponding classifier has been reported in Table 3. The KNN has the best classification runtime (7 ± 1.78 ms) fed with a feature vector among all of them. While for classification accuracy, it is always after its ensemble method KNN (Subspace). In all cases, KNN (Subspace) with average accuracy (96.42% ± 1.63%) provided better results than all other non-dominated classifiers, with the exception of data from the RLL, where the SVM (95.52%) provided superior accuracy. However, SVM's prominence is negligible while considering its runtime (113.05 ms) and no significant accuracy improvement (0.1%). Given the accuracy results stated in Table 3, although NN classifiers provide promising results in most cases, they are dominated by other techniques and could be only among the selected methods in three positions RUL, RLL and LLL. With a closer look at the classifications results in Figure 5 and tabulated results, we find out ensemble method Tree (Bagging) is a very strong method and is among topClassifiers in all cases, but is always outperformed by other methods in terms of both accuracy and runtime. According to the selected KNN classifiers, the distance method affects most in the performance. City block and Euclidean have been the best choices and too large value for k does not improve the performance as it destroys locality. We also can draw another conclusion that there was no significant difference in KNN (Subspace) performance with different number of learners (i.e., 10, 20 and 40). Therefore, applying fewer learners is preferentially utilized due to its much lower runtime.
Regarding the position analysis, the best performance (98.85% and 98.03%) is achieved with the aggregated data on RUL and LUL. Chest is the next best performing placement (97.72%) compared to the RLL (95.52%), LLL (96.38%), RLA (95.30%), LLA (94.06%) and Waist (95.67%). The KNN has the best classification runtime (7 ± 1.78 ms) fed with a feature vector among all of them. While for classification accuracy, it is always after its ensemble method KNN (Subspace). In all cases, KNN (Subspace) with average accuracy (96.42% ± 1.63%) provided better results than all other non-dominated classifiers, with the exception of data from the RLL, where the SVM (95.52%) provided superior accuracy. However, SVM's prominence is negligible while considering its runtime (113.05 ms) and no significant accuracy improvement (0.1%). Given the accuracy results stated in Table 3, although NN classifiers provide promising results in most cases, they are dominated by other techniques and could be only among the selected methods in three positions RUL, RLL and LLL. With a closer look at the classifications results in Figure 5 and tabulated results, we find out ensemble method Tree (Bagging) is a very strong method and is among topClassifiers in all cases, but is always outperformed by other methods in terms of both accuracy and runtime. According to the selected KNN classifiers, the distance method affects most in the performance. City block and Euclidean have been the best choices and too large value for k does not improve the performance as it destroys locality. We also can draw another conclusion that there was no significant difference in KNN (Subspace) performance with different number of learners (i.e., 10, 20 and 40). Therefore, applying fewer learners is preferentially utilized due to its much lower runtime.
Different studies [15,26,39] show that the use of overlap between successive sliding windows help the classifiers to be trained with more feature vectors and consequently improve the recognition performance. However, it adds more computational work needed to process the overlapped data multiple times. To evaluate the effectiveness of the degree of overlap, the overlaps of 10%, 25%, 50%, 75% and 90% are used where the percentage is the amount the window slides over the previous window. For instance, a sliding window with 25% overlap will start the next window while the previous window is 75% complete. The value can range from 0% to 99%, since a 100% sliding window is erroneous. Figure 9b illustrates the recognition system capabilities for diverse overlap values while keeping the best window size in each position (see Figure 6). The results demonstrate that the performance tendency is increased in most cases by overlapping more data segmentations. An average increase of 3.28% in the best accuracy was found between the 0% and 90% overlap scenarios and all obtained with ensemble of KNN. Figure 10 also illustrates the number of classifiers that provide good results (90%-99%) by considering different overlap sizes. The larger the overlap, the more improvement is expected in performance. As described, this is because more features can be trained and consequently the predictive model almost certainly works better in testing phase; however, it suffers from more training time. Different studies [15,26,39] show that the use of overlap between successive sliding windows help the classifiers to be trained with more feature vectors and consequently improve the recognition performance. However, it adds more computational work needed to process the overlapped data multiple times. To evaluate the effectiveness of the degree of overlap, the overlaps of 10%, 25%, 50%, 75% and 90% are used where the percentage is the amount the window slides over the previous window. For instance, a sliding window with 25% overlap will start the next window while the previous window is 75% complete. The value can range from 0% to 99%, since a 100% sliding window is erroneous. Figure 9b illustrates the recognition system capabilities for diverse overlap values while keeping the best window size in each position (see Figure 6). The results demonstrate that the performance tendency is increased in most cases by overlapping more data segmentations. An average increase of 3.28% in the best accuracy was found between the 0% and 90% overlap scenarios and all obtained with ensemble of KNN. Figure 10 also illustrates the number of classifiers that provide good results (90%-99%) by considering different overlap sizes. The larger the overlap, the more improvement is expected in performance. As described, this is because more features can be trained and consequently the predictive model almost certainly works better in testing phase; however, it suffers from more training time. In activity recognition problem, K-fold cross-validation is an accurate approach for model selection; however, Leave-One-Subject-Out (LOSO), which is also called subject-independent crossvalidation, can be used to avoid possible overfitting and is one of the best approaches in estimating realistic performance. Because LOSO reflects inter-subject variability and tests on yet-unseen data, it consequently leads to a decrease in accuracy. According to combined datasets in each position (see Figure 3), we conducted LOSO evaluation where the KNN (Subspace) is trained on activity data for all subjects except one. Then, the classifier is tested on the data for only the subject left out of the training data set. The procedure is then repeated for all subjects in each position and the mean accuracy is reported. For each position, the window size and overlap are set based on the best acquired results from 10-fold evaluation. Based on the aggregated data for each position, the models trained in LUL (92.35%) and LLL (90.03%) positions were not affected much by interpersonal differences in the body movement of subjects and could obtain relatively good results of accuracy. They are followed by Chest (86.31%) and RLL (83.51%), which are better performing placements compared to the lower arm positions. The RLA (64.62%) was influenced the most and LLA (77.12%) accounted for a substantial decrease in accuracy, as well. Result of Waist (72.53%) was very similar to one reached by RUL (72.91%), but both suffer performance degradation by more than 25%. As expected the accuracy of recognition in all positions was reduced since the trained model deals with a set of data that might have new measurement characteristics. As the best classifier of each position in average goes down 17.13% in accuracy, there is a need for further studies investigating new relevant features and novel machine learning model to be sufficiently flexible in dealing with interperson differences in the activities' performances in a position-aware scenario. In activity recognition problem, K-fold cross-validation is an accurate approach for model selection; however, Leave-One-Subject-Out (LOSO), which is also called subject-independent cross-validation, can be used to avoid possible overfitting and is one of the best approaches in estimating realistic performance. Because LOSO reflects inter-subject variability and tests on yet-unseen data, it consequently leads to a decrease in accuracy. According to combined datasets in each position (see Figure 3), we conducted LOSO evaluation where the KNN (Subspace) is trained on activity data for all subjects except one. Then, the classifier is tested on the data for only the subject left out of the training data set. The procedure is then repeated for all subjects in each position and the mean accuracy is reported. For each position, the window size and overlap are set based on the best acquired results from 10-fold evaluation. Based on the aggregated data for each position, the models trained in LUL (92.35%) and LLL (90.03%) positions were not affected much by interpersonal differences in the body movement of subjects and could obtain relatively good results of accuracy. They are followed by Chest (86.31%) and RLL (83.51%), which are better performing placements compared to the lower arm positions. The RLA (64.62%) was influenced the most and LLA (77.12%) accounted for a substantial decrease in accuracy, as well. Result of Waist (72.53%) was very similar to one reached by RUL (72.91%), but both suffer performance degradation by more than 25%. As expected the accuracy of recognition in all positions was reduced since the trained model deals with a set of data that might have new measurement characteristics. As the best classifier of each position in average goes down 17.13% in accuracy, there is a need for further studies investigating new relevant features and novel machine learning model to be sufficiently flexible in dealing with inter-person differences in the activities' performances in a position-aware scenario.

Conclusions
In this study, different machine learning techniques were deeply explored when heterogeneity of devices and their usage scenarios are intrinsic. In addition, each position was analyzed based on the aggregated tri-axial accelerometer data from different datasets. Hence, the quantitative comparison of the classifiers was hindered by the fact that each position is explored with a different aggregated dataset. In each position investigation, in addition to different sources of heterogeneities in data, there are also different factors such as body shape, clothing, straps, belt and accidental misplacements/disorientations (in the form of rotations or translations) that make the analysis harder to have a solid model. The averaged results showed 96.44% ± 1.62% activity recognition accuracy when using K-fold and 79.92% ± 9.68% accuracy when using a subject-independent cross-validation. According to the obtained results, it is clear that new data with different sources of heterogeneities could significantly reduce the accuracy with more than 32% (e.g., in RLA) based on LOSO evaluation. An overall look into the results, KNN and its ensemble methods showed stable results over different positions and window sizes, indicating its ability in designing a robust and responsive machine learning model in the wearables, and they are followed by NN and SVM. However, as we showed in this paper, the choice of parameter values in each classifier can have a significant impact on recognition accuracy and should be taken into account (see Appendix A). Considering the promising results in this pilot study, we intend to work on novel features extraction methods and classifiers that outperform classical classification methods while better dealing with inter-person differences and data diversities. Another point that deserves to be further assessed is optimizing the runtime performance since it has a great role in efficiency of the cloud-based machine learning deployments.
Author Contributions: Majid Janidarmian designed the study, implemented the methodologies and drafted the manuscript. Atena Roshan Fekr has contributed to the revision of this paper and provided insightful comments and suggestions. Katarzyna Radecka and Zeljko Zilic directed the study and contributed to the overall study design as well as the analysis of results.

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
In this study, different machine learning techniques were deeply explored when heterogeneity of devices and their usage scenarios are intrinsic. In addition, each position was analyzed based on the aggregated tri-axial accelerometer data from different datasets. Hence, the quantitative comparison of the classifiers was hindered by the fact that each position is explored with a different aggregated dataset. In each position investigation, in addition to different sources of heterogeneities in data, there are also different factors such as body shape, clothing, straps, belt and accidental misplacements/disorientations (in the form of rotations or translations) that make the analysis harder to have a solid model. The averaged results showed 96.44% ± 1.62% activity recognition accuracy when using K-fold and 79.92% ± 9.68% accuracy when using a subject-independent cross-validation. According to the obtained results, it is clear that new data with different sources of heterogeneities could significantly reduce the accuracy with more than 32% (e.g., in RLA) based on LOSO evaluation. An overall look into the results, KNN and its ensemble methods showed stable results over different positions and window sizes, indicating its ability in designing a robust and responsive machine learning model in the wearables, and they are followed by NN and SVM. However, as we showed in this paper, the choice of parameter values in each classifier can have a significant impact on recognition accuracy and should be taken into account (see Appendix A). Considering the promising results in this pilot study, we intend to work on novel features extraction methods and classifiers that outperform classical classification methods while better dealing with inter-person differences and data diversities. Another point that deserves to be further assessed is optimizing the runtime performance since it has a great role in efficiency of the cloud-based machine learning deployments.
Author Contributions: Majid Janidarmian designed the study, implemented the methodologies and drafted the manuscript. Atena Roshan Fekr has contributed to the revision of this paper and provided insightful comments and suggestions. Katarzyna Radecka and Zeljko Zilic directed the study and contributed to the overall study design as well as the analysis of results.

Appendix A.
In this section, the utilized approaches for activity classification problem are summarized while considering different parameters settings.

Appendix A.1. Decision Tree
In this method, the discriminatory ability of the features is examined one at a time to create a set of rules. Decision Tree (DT) has been used in many studies [15,39], and the results show that it performs well with time and frequency domain features. In the top-down tree structure, each leaf represents a classification label and each branch denotes conjunctions of attributes that lead to the leaves. In other words, decision trees classify instances by starting at the root of the tree and moving through it (with decision being made at each node) until a leaf node. Generally, the utilized growing and pruning algorithms in decision tree induction are greedy and follow a recursive manner. The construction of a tree involves determining split criterion, stopping criterion and class assignment rule [40]. The most common techniques to measure the node impurity (for splitting nodes) are explained in Table A1 [41]. They define the node splits, where each split maximizes the decrease in impurity. In this study, we split branch nodes layer by layer until there are 4, 20 or 100 branch nodes. Therefore, we define nine different decision trees considering the mentioned split and stopping criteria. When a node is determined to be a leaf, it has to be given a class label. A commonly used class assignment function is the majority rule meaning that class k is assigned to node t as follows [42]: The set of terminal nodes (leafs) denoted by T.

Appendix A.2. Discriminant Analysis
Discriminant Analysis (DA) is widely used in classification problems [43,44]. In this algorithm, there are three main elements: prior probability, posterior probability and cost [45]. A prior probability P(k) is the probability that an observation will fall into class k before you collect the data. There are two main choices, i.e., uniform and empirical. If the prior probability of class k is 1 over the total number of classes, it is called uniform. Besides, the number of training feature vectors of class k divided by all training features set defines the empirical prior probability.
Let L(i)/R(i) denote the fraction of members of class i in the left/right child node after a split and p(L)/p(R) are the fractions of observations that split to the left/right p(i) is the probability that an arbitrary sample belongs to class li.
based the concept of entropy from information theory A posterior probability is the probability of assigning observations to classes given the data. The product of the prior probability and the multivariate normal density explains the posterior probability that a point x belongs to label k. With mean µ k and covariance Σ k at a point x, the density function of the multivariate normal can be described as below [45]: The observed data x (the feature vector of one analysis segment) is classified to label k with the largest posterior probability. The posterior probability that an observation x belongs to label k is: where P(x) indicates the probability of the feature vector x and is the sum over k of P(x|k)P(k). Therefore, the predicted classificationŷ with m classes is: is the cost of classifying an observation y when its correct label is k [45]. In this paper, we consider five types of discriminant analysis classifiers: linear; and diagonal and pseudo variants of linear and quadratic types.

Appendix A.3. Support Vector Machine
A Support Vector Machine (SVM) is defined based on one or a set of separating hyperplanes in a high dimensional space and was first proposed for binary-classification problems. SVM generates linear functions by considering a set of labels obtained from training dataset. The linear separator is created considering the maximum margin from the hyperplane to the support vectors [46,47]. By n training samples, (x i , y i ); i = 1, 2, . . . , n we have: {x i , y i }, i = 1, . . . n, y i ∈ {−1, +1}, x i ∈ R d y i shows the binary nature of the classifier with either 1 or −1, representing the class of x i . The R d is a d-dimensional vector space over the real numbers. The decision boundary of a linear SVM classifier is as follows: where w and b indicate a weight vector and bias, respectively. There are different linear separators, though SVM targets the one with maximum-margin hyperplane from any data point. The linear classifier is as follows: The main goal is to find the best w and b which can maximize the geometric margin ( 2 ||w|| ), with linear constraints y i (w T x i + b) ≥ 1 for all (x i , y i ). This optimization problem can be defined as a minimization problem as follows: min To solve this problem, the optimization function is transformed into the Lagrangian dual with the Karush-Kuhn-Tucker (KKT) conditions so that the Lagrange multiplier vector α i is linked with each inequality of the constraints as: Thus, the optimal linear classification function is obtained as below, where n sv denotes the number of support vectors. Different studies show that SVM could provide an efficient non-linear classification and provide very promising results [48,49]. K x i , x j = φ(x i ) T φ x j is the kernel function which provides the inner product value of x i and x j in the feature space.
The most often used kernels in SVM are as described in Table A2. The type of the kernel function that transforms data from input space to a higher dimensional feature space has direct impact on the performance of the SVM classifier. Although there exist no well-defined rules for selecting the kernel type [46], here three well-known kernel functions are applied in SVM with Error Correcting Output Codes (ECOC) with One-Versus-One (OVO) technique to evaluate our multi-classification problem. ECOC breaks the multiclass task into a number of binary classification tasks which are then combined to output the final result [50]. The OVO coding design exhausts all combinations of class pair assignments. Therefore, if we have K distinct classes, the number of learners is k(k−1)

.
For each binary learner, one class is positive, another is negative, and the rest are ignored.

. K-Nearest Neighbors
K-Nearest Neighbor (KNN) is based on a neighborhood majority voting scheme and assigns the new instance to the most common class amongst its K nearest. Simplicity and runtime are the main advantages of this method which used in several research works [15,26,51]. There are different metrics to determine the distance d(x s , y t ) between two vectors x s and y t . Table A3 describes the methods used in this study. The three applied distance weights are equal (no weighting), inverse (1/d) and squared inverse (1/d 2 ). If multiple classes have the same smallest cost, the smallest index, the class with the nearest neighbor or a random tiebreaker among tied groups is used. Regarding selection of k, larger value may improve performance and reduce the effect of noise on the classification, but makes boundaries between classes are less distinct [52]. In addition, setting k to a too large value may destroy locality, and as a result, KNN looks at samples that are not neighbors. They are different techniques to select and find a good k [53,54]. Here, we consider three values 1, 10 and 100 for k; therefore, in total we run 243 KNNs with different settings.

Appendix A.5. Ensemble Methods
Generally, ensemble classifier refers to a combination of different classifiers that are cooperatively trained on data set and then classify new data by taking a weighted vote of their predictions to obtain better predictive performance. Indeed, within the sensor-based recognition domain, different studies [6,15,20,25,26] report where an ensemble method outperformed a range of other classification models. Bagging, as named from the phrase "bootstrap aggregating", is used to improve results of classification algorithms and help to avoid overfitting [55]. This ensemble method constructs bootstrap samples by repetitively resampling training instances with replacement. A sequence of classifiers c 1:b (b = 10, 30, 50) in respect to variation of the training set is created by the bagging method. The prediction of a compound classifier, derived from the combinations of c 1:b , is given as: The above formula can be interpreted as to classify an example d i to the class with majority voting, and α should be chosen so that more accurate classifiers have stronger impact on the final prediction than less accurate classifiers [56]. More details about the theory of classifier voting can be found in [57].
Another approach is boosting which attempts to turn a weak leaner into a strong learner by gradually adapting how models are made. Each new model added to the ensemble is biased to take more notice to training instances that earlier models misclassified [45]. AdaBoost.M2 is a very prevalent boosting algorithm for multi-class classification. The algorithm trains learners sequentially and requires the weak learner to output an array of confidences associated with each possible labeling of an example. For every learner with index t, AdaBoost.M2 computes the weighted pseudo-loss for N observations and k classes as below [45]: where h t (x n , k) and d (t) n,k are the confidence of prediction by learner and observation weights at step t for class k, respectively. The second sum is over all classes other than y n that is the true class. For more details, the reader is referred to [58].
RUSBoost is designed to improve the performance of models trained on skewed data. It combines data sampling and boosting, providing an effective method at classifying imbalanced data. It applies Random Under Sampling (RUS), a method that randomly takes out examples from the majority class for each weak learner in the ensemble until a preferred class distribution is found. If the smallest class has N training instances, classes with more instances are under sampled by taking only N observations. For reweighting and constructing the ensemble, it follows the procedure in AdaBoost.M2 [59].
Random subspace ensembles (Subspace) is similar to bagging except that the features are randomly sampled. Thus, subspace ensembles have the advantage of less memory and computations than ensembles with all features resulting in considerably shorter model training times. To train a weak learner, this technique selects a random set of m predictors (in this study, m = 12) from the d possible values without replacement. In our study, it repeats this procedure until there are 10, 30 or 50 weak learners. Finally, it takes an average of the score prediction of the weak learners, and classifies the observation with the maximum mean score [60].
In this manuscript, the boosting and bagging algorithms are based on tree learners, and the subspace has been applied to discriminant analysis and k-nearest neighbor learners.
Appendix A.6. Naïve Bayes Naïve Bayes (NB) is a powerful probabilistic classifier employing a simplified version of Bayes formula to decide on a class of a new instance [61]. In activity recognition, NB proved to perform well in the previous studies [20,62]. The following equation shows the Naïve Bayes under assumption of feature independence, though the assumption is usually violated in practice.
P(l| f 1 . . . f n ) = p(l) ∏ n i=1 p( f i l) p( f 1 . . . f n ) where l represents labels/classes (l = 1, 2 . . . L) and f i is a feature vector. The denominator of the right side of the equation is a constant, and p(l) is a prior. The posterior probability P(l| f 1 . . . f n ) is determined by the likelihood ∏ n i=1 p( f i l) and p( f 1 . . . f n ) is the joint density of the predictors, so, p( f 1 . . . f n ) = ∑ L l=1 p(l) ∏ n i=1 p( f i l) .
The Naïve Bayes classifier combines the independent feature mode with a decision rule. The common rule is known as the maximum a posteriori or MAP decision rule. A typical assumption when dealing with data stream is that continuous values associated with each class are distributed according to a Normal (Gaussian) distribution. However, to alleviate this assumption, NB classifier computes a separate kernel density estimate for each class according to its training data [45]. There exists a large range of kernels that can be exploited for the kernel density estimate. Table A4 shows the kernel smoother types we applied in this study.
Appendix A.7. Neural Network Artificial Neural Network (NN) is generally presented as a system of interconnected neurons that are capable of machine learning. The basic processing unit of an NN is called perceptron and is a decision making unit with several inputs and a single output. The input neuron p i is weighted with an appropriate w i . The perceptron sums the dot product of weights and inputs vectors, and adds a bias b. The obtained total signal will be transformed by a function which not only can be linear, but is most often a nonlinear transformation (e.g., log-sigmoid and tan-sigmoid) [63]. This process is summarized as: Feedforward neural networks are one of the most broadly used models in many real-world scientific problems. The network is divided into layers; therefore, it can learn nonlinear relationships between input and output vectors with nonlinear transfer functions. In the input layer, the nodes pass the values to the neurons or hidden units placed in the subsequent layer, which is called hidden layer. In this paper, we considered 10, 20, and 40 hidden neurons. The final layer is the output layer that depends upon the number of class labels in the classification problem [64].
In training the network, its parameters are adjusted incrementally until difference between the output units and the target values is minimized. Resilient backpropagation [65], scaled conjugate gradient [66] and Levenberg-Marquardt backpropagation are the most well-known network training algorithms. For example, Levenberg-Marquardt optimization uses the Hessian matrix approximation, J T J, in the following Newton-like update [45]: where Jacobian matrix J holds the first derivatives of the network errors in respect of the weights and biases. µ stands for an adjustment factor, e for a vector of network errors and I for the identity matrix [63,67]. Resilient backpropagation network training algorithm updates weight and bias values according to the algorithm explained in [65]. In this method, the magnitude of the derivative has no influence on the weight update and only the sign of the derivative can define the direction of the weight update.
Therefore, in total, there are 293 different classifiers with different settings, as listed in Figure A1.
biases. µ stands for an adjustment factor, e for a vector of network errors and I for the identity matrix [63,67]. Resilient backpropagation network training algorithm updates weight and bias values according to the algorithm explained in [65]. In this method, the magnitude of the derivative has no influence on the weight update and only the sign of the derivative can define the direction of the weight update. Therefore, in total, there are 293 different classifiers with different settings, as listed in Figure  A1.