To demonstrate our methodology, we use the publicly available daily and sports activities dataset acquired by our research group earlier [43
]. To acquire the dataset, each subject wore five Xsens MTx sensor units [44
] (see Figure 5
), each unit containing three tri-axial devices: an accelerometer, a gyroscope, and a magnetometer. The sensor units are placed on the chest, on both wrists, and on the outer sides of both knees, as shown in Figure 6
. Nineteen activities are performed by eight subjects. For each activity performed by each subject, there are 45
time-domain sequences of 5 min duration, sampled at 25 Hz, and consisting of 7500 time samples each. The dataset comprises the following activities:
Sitting (A), standing (A), lying on back and on right side (A and A), ascending and descending stairs (A and A), standing still in an elevator (A), moving around in an elevator (A), walking in a parking lot (A), walking on a treadmill in flat and inclined positions at a speed of (A and A), running on a treadmill at a speed of (A), exercising on a stepper (A), exercising on a cross trainer (A), cycling on an exercise bike in horizontal and vertical positions (A and A), rowing (A), jumping (A), and playing basketball (A).
The activities can be broadly grouped into two: In stationary activities (A–A), the subject stays still without moving significantly, whereas non-stationary activities (A–A) are associated with some kind of motion.
4.3. Activity Recognition and Classifiers
A procedure similar to that in [34
] is followed for activity recognition. The sensor sequences are divided into 9120
non-overlapping segments of 5 s duration each and transformed according to one of the seven approaches described in Section 4.2
. Then, statistical features are extracted for each segment of each axis of each sensor type. The following features are calculated: minimum, maximum, mean, variance, skewness, kurtosis, 10 coefficients of the autocorrelation sequence (autocorrelation sequence for the lag values of
samples is used), and the five largest discrete Fourier transform (DFT) peaks with the corresponding frequencies (the separation between any two peaks in the DFT sequence is taken to be at least 11 samples), resulting in a total of 26 features per segment of each axis. For the reference approach that does not involve any transformation, there are
features that are stacked to form a 1170-element feature vector for each segment. The number of axes as well as the number of features vary depending on the transformation technique; however, the total number of feature vectors is fixed (9120). For instance, in the Euclidean norm, there is a three-fold decrease in the number of axes and hence in the number of features. The features are normalized to the interval
over all the feature vectors for each subject.
The number of features is reduced through PCA, which is a linear and orthogonal transformation where the transformed features are sorted to have variances in descending order [52
]. This allows one to consider only a certain number of features that exhibit the largest variances to reduce the dimensionality. Thus, for each approach, the eigenvalues of the covariance matrix of the feature vectors are calculated, sorted in descending order, and plotted in Figure 8
. Using the first 30 eigenvalues appears to be suitable for most of the approaches; hence, we reduce the dimensionality down to
We perform activity classification with seven state-of-the-art classifiers that are briefly described below.
Support Vector Machines (SVM):
The feature space is nonlinearly mapped to a higher-dimensional space by using a kernel function and divided into regions by hyperplanes. In this study, the kernel is selected to be a Gaussian radial basis function
because it can perform at least as accurately as the linear kernel if the parameters of the SVM are optimized [53
]. To extend the binary SVM to more than two classes, a binary SVM classifier is trained for each class pair, and the decision is made according to the classifier with the highest confidence level [54
]. The penalty parameter C
(see Equation (1
) in [55
]) and the kernel parameter
are jointly optimized over all the data transformation techniques by performing a two-level grid search. The optimal parameter values in the coarse grid
are obtained as
. Then, a finer grid is constructed around
and the optimal parameter values found by searching the fine grid,
, are used in SVM throughout this study. The SVM classifier is implemented by using the MATLAB toolbox LibSVM [56
Artificial Neural Networks (ANN):
We use three layers of neurons, where each neuron has a sigmoid output function [57
]. The number of neurons in the first (input) and the third (output) layers are as many as the reduced number of features, F
, and the number of classes, K
, respectively. The number of neurons in the second (or hidden) layer is selected as the integer nearest to the average of
, with the former expression corresponding to the optimistic case where the hyperplanes intersect at different positions and the latter corresponding to the pessimistic case where the hyperplanes are parallel to each other. The weights of the linear combination in each neuron are initialized randomly in the interval
and during the training phase, they are updated by the back-propagation algorithm [58
]. The learning rate is selected as 0.3. The algorithm is terminated when the amount of error reduction (if any) compared to the average of the last 10 epochs is less than 0.01. The ANN has a scalar output for each class. A given test feature vector is fed to the input and the class corresponding to the largest output is selected.
Bayesian Decision Making (BDM):
In the training phase, a multi-variate Gaussian distribution with an arbitrary covariance matrix is fitted to the training feature vectors of each class. Based on maximum likelihood estimation, the mean vector is estimated as the arithmetic mean of the feature vectors and the covariance matrix is estimated as the sample covariance matrix for each class. In the test phase, for each class, the test vector’s conditional probability given that it is associated with that class is calculated. The class that has the maximum conditional probability is selected according to the maximum a posteriori decision rule [52
Linear Discriminant Classifier (LDC):
This classifier is the same as BDM except that the average of the covariance matrices individually calculated for each class is used for all of the classes. Since the Gaussian distributions fitted to the different classes have different mean vectors but the same covariance matrix in this case, the classes have identical probability density functions centered at different points in the feature space. Hence, the classes are linearly separated from each other, and the decision boundaries in the feature space are hyperplanes [57
k-Nearest Neighbor (k-NN):
The training phase consists only of storing the training vectors with their class labels. In the classification phase, the class corresponding to the majority of the k
training vectors that are closest to the test vector in terms of the Euclidean distance is selected [57
]. The parameter k
is chosen as
because it is suitable among the k
values ranging from 1 to 30.
Random Forest (RF):
A random forest classifier is a combination of multiple decision trees [59
]. In the training phase, each decision tree is trained by randomly and independently sampling the training data. Normalized information gain is used as the splitting criterion at each node. In the classification phase, the decisions of the trees are combined by using majority voting. The number of decision trees is selected as 100 because we have observed that using a larger number of trees does not significantly improve the accuracy while increasing the computational cost considerably.
Orthogonal Matching Pursuit (OMP):
The training phase consists of only storing the training vectors with their class labels. In the classification phase, each test vector is represented as a linear combination of a very small portion of the training vectors with a bounded error, which is called the sparse representation. The vectors in the representation are selected iteratively by using the OMP algorithm [60
] where an additional training vector is selected at each iteration. The algorithm terminates when the desired representation error level is reached, which is selected to be
. Then, a residual for each class is calculated as the representation error when the test vector is represented as a linear combination of the training vectors of only that class, and the class with the minimum residual error is selected.
To determine the accuracies of the classifiers, L1O cross-validation technique is used [57
]. In this type of cross validation, feature vectors of a given subject are left out while training the classifier with the remaining subjects’ feature vectors. The left out subject’s feature vectors are then used for testing (classification). This process is repeated for each subject. Thus, in our implementation, the dataset is partitioned into eight and there are 1140 feature vectors in each partition. L1O is highly affected by the variation in the data across the subjects, and hence, is more challenging than subject-unaware cross-validation techniques such as repeated random sub-sampling or multi-fold cross validation [61
4.4. Comparative Evaluation Results
The activity recognition performance of the different data transformation techniques and classifiers is shown in Figure 9
. In the figure, the lengths of the bars correspond to the classification accuracies and the thin horizontal sticks indicate plus/minus one standard deviation about the accuracies averaged over the cross-validation iterations.
In the lower part of Figure 9
, the accuracy values averaged over the seven classifiers are also provided for each approach and compared with the reference case, as well as with the proposed method. Referring to this part of the figure, the standard system that we take as reference, with fixed sensor orientations, provides an average accuracy of 87.2%. When the sensor units are randomly oriented, the accuracy drops by 31.8% on average with respect to the standard reference case. This shows that the standard system is not robust to incorrectly or differently oriented sensors. The existing methods for orientation invariance result in a more acceptable accuracy reduction compared to the reference case: The accuracy drop is 18.8% when the Euclidean norms of the tri-axial sensor sequences are taken, 12.5% when the sensor sequences are transformed to the Earth frame, 12.2% when the sensor sequences are represented along and perpendicular to the gravity vector, and 8.4% when the SVD-based transformation is applied.
Our approach that uses the sensor sequences together with differential quaternions, both with respect to the Earth frame, achieves an average accuracy of 82.5% over all activities with an average accuracy drop of only 4.7% compared to the reference case. Such a decrease in the accuracy is expected when the sensor units are allowed to be placed freely at arbitrary orientations because this flexibility entails the removal of fundamental information such as the direction of the gravity vector measured by the accelerometers and the direction of the Earth’s magnetic field detected by the magnetometers. Hence, the average accuracy drop of 4.7% is considered to be acceptable when such information related to the sensor unit orientations is removed inevitably.
In the lower part of Figure 9
, we also provide the improvement achieved by each method compared to the random rotation case which corresponds to the standard system using random sensor orientations. The method that we newly propose in this article performs the best among all the methods considered in this study when the sensor units are allowed to be placed at arbitrary orientations.
The activity recognition accuracy highly depends on the classifier. According to Figure 9
, in almost all cases, the SVM classifier performs the best among the seven classifiers compared. SVM outperforms the other classifiers especially in approaches targeted to achieve orientation invariance where the classification problem is more challenging. The robustness of SVM in such non-ideal conditions is consistent with other studies [13
]. Besides the SVM classifier, ANN and LDC also obtain high classification accuracy. Although reference [22
] states that k
-NN has been shown to perform remarkably well in activity recognition, it is not the most accurate classifier that we have identified.
To observe the recognition rates of the individual activities, a confusion matrix associated with the SVM classifier is provided in Table 1
for the proposed method. It is apparent that the proposed transformation highly misclassifies the stationary activities A
. These activities contain stationary postures, namely, sitting, standing, and two types of lying, which are misclassified probably because we remove the information about sensor orientation from the data. In particular, activity A
(sitting) is mostly misclassified and confused with activities A
(lying on back side) and A
(standing still in an elevator). The remaining stationary activities are also misclassified as A
. Among the 15 non-stationary activities, activities A
(walking on a treadmill in flat and
inclined position, respectively) are confused with each other because of the similarity between the body movements in the two activities. Other misclassifications occur between activity pairs that have similarities such as A
, and A
, although rarely. Activities A
(running on a treadmill at a speed of 8 km/h) and A
(rowing) are perfectly classified by SVM for the proposed method, probably because they are associated with unique body movements and do not resemble any of the other activities.
We present the classification performance separately for stationary and non-stationary activities in Figure 10
. For each classifier and each approach, we calculate the accuracy values by averaging out the accuracies of the stationary activities (A
) and non-stationary activities (A
For stationary activities (see Figure 10
a), an average accuracy of 81.2% is obtained for fixed sensor orientations. When the sensor units are oriented randomly, the average accuracy drops to 42.6%. The existing orientation-invariant methods exhibit accuracies between 31.7% and 62.2%, some of them being higher and some being lower than the accuracy for random rotation. The Euclidean norm method performs particularly poorly in this case. The proposed method achieves an average accuracy of 66.8%, which is considerably higher than random rotation and all the existing orientation-invariant transformations. Although two of the existing transformations provide some improvement compared to the random rotation case, their accuracies are much lower than the standard reference system. Hence, removing the orientation information from the data makes it particularly difficult to classify stationary activities.
For non-stationary activities (see Figure 10
b), the accuracy decreases from 88.8% to 58.8% on average when the sensor units are placed randomly and no transformation is applied. The existing orientation-invariant methods obtain accuracies ranging from 78.2% to 83.2%, which are comparable to the reference case with fixed sensor orientations. The method we propose obtains an average accuracy of 86.7%, which is higher than all the existing methods and only 2.1% lower than the reference case. This shows that when the sensor units are fixed to the body at arbitrary orientations, the proposed method can classify non-stationary activities with a performance similar to that of fixed sensor unit orientations. In the last two rows of the confusion matrix provided in Table 1
, the average accuracy of the stationary activities (A
) and non-stationary activities (A
) are provided separately for the proposed method, again using the SVM classifier.
Referring to Figure 10
a, we observe that the recognition rate of stationary activities highly depends on the classifier. On average, the best classifier is LDC, probably because the recognition of stationary activities is quite challenging and the LDC classifier separates the classes from each other linearly and smoothly in the feature space. For the proposed method, the OMP classifier performs much better than the remaining six classifiers. On the other hand, for non-stationary activities (see Figure 10
b), the classifiers obtain comparable accuracy values, unlike the case for stationary activities. In this case, SVM is the most accurate classifier, both on average and for the proposed method.
4.5. Run Time Analysis
The average run times of the data transformation techniques per one 5-s time segment are provided in Table 2
. All the processing in this work was performed on a laptop with a quad-core Intel®
i7-4720HQ processor at 2.6–3.6 GHz and 16 GB of RAM running 64-bit MATLAB®
R2017b. The proposed method has an average run time of about 61 ms per 5-s time segment and can be executed in near real time since the run time is much shorter than the duration of the time segment.
The run times of the classifiers are presented in Table 3
for each of the seven data transformation techniques. Table 3
a contains the total run times of the classifiers for an average cross-validation iteration, including the training phase and classification of all the test feature vectors. We observe that k
-NN, LDC, and BDM are much faster than the other classifiers for all of the data transformation techniques. Table 3
b contains the average training times of the classifiers for a single cross-validation iteration. The k
-NN and OMP classifiers only store the training feature vectors in the training phase; therefore, their training time is negligible. Among the remaining classifiers, training of BDM is the fastest. Table 3
c contains the average classification time of a single test feature vector, extracted from a segment of 5 s duration. ANN and LDC are about an order of magnitude faster than the others in classification. The classification time of OMP is the largest. Note that, because of programming overheads, the total classification times provided in Table 3
a are greater than the sum of the training and classification times (Table 3
b,c, respectively) multiplied by 1140 (the number of feature vectors per L1O iteration).
This study is a proof-of-concept for a comparative analysis of the accuracies and run times of the proposed and existing methods as well as state-of-the-art classifiers. Therefore, we have implemented them as well as the remaining parts of the activity recognition framework on a laptop computer rather than on a mobile platform.
Given that the data transformation techniques and most of the classifiers have been implemented in MATLAB in this study, it is possible to further improve the efficiency of the algorithms by programming them in other languages such as C++, by implementing them on an FPGA platform, or by embedding the algorithms in wearable hardware. As such, our methodology can be handled by the limited resources of wearable systems. Alternatively, transmitting the data acquired from wearable devices wirelessly to a cloud server would allow performing the activity recognition in the cloud [14
]. Despite the latency issues that will arise in this case, this approach would provide additional flexibility and enable the applications of wearables to further benefit from the proposed methodology and the advantages of cloud computing.