1. Introduction
Human activity recognition (HAR) has been a topic of broad and current interest of so many researchers in different applications primarily dealing with human-centric problems such as health [
1], fitness [
2], elderly care [
3], surveillance-based security [
4], or context-aware computing [
5]. HAR deals basically with the integration of sensing and reasoning in order to identify activities such as walking or sitting, which provides useful feedback regarding individual’s behavior. For instance, in medical applications, patients with diabetes, neural, or heart problems are required to follow a well-defined exercise routine as part of their treatment and recovering process [
6].
Over the past decades, a significant development in microelectronics has been witnessed enabling sensors with handy characteristics (small size, low cost, and high computational power) to be exploited in research areas like HAR, where extracting knowledge from the data acquired by these sensors can be very fruitful [
7]. The common two modalities in this field are the ones based on external and internal sensing. On one hand, devices can be placed at specific predetermined positions such as cameras, where the detection of activities is fully dependent on the interaction of the user with these devices. Videos and images are the main source of information in this case and computer vision techniques are employed for decision making [
8]. There are numerous limitations associated with this method. For example, if the user is not in the range of cameras, movements cannot be detected. Moreover, installation and maintenance of vision equipment entails high costs. In addition, video processing algorithms are computationally expensive since they require a lot of time and memory allocation, which makes real-time HAR system less practical. On the other hand, sensors can be directly attached to the user where there is a guarantee for all-time data collection as the case for inertial and magnetic sensors. In the case of inertial and magnetic sensors-based HAR, the designed system must be able to recognize human activities and postures using information acquired from accelerometers [
9,
10,
11], gyroscopes, magnetometers, or their combination, i.e., HAR with inertial and magnetic measurement units (IMMUs) [
12,
13,
14].
Given the tremendous growth in popularity of smartphones, tablets, and wearable devices, which are always equipped with inertial and magnetic sensors, sensor-based HAR can be very intriguing especially with considering the potential to provide innovative ways of understanding human behavior, as people hold their mobile devices or wear their smartwatches for most of the day.
Although such sensors could reliably measure body segments orientation and movement, recognizing specific patterns remains an open challenge. The most common approaches employed to solve this issue are based on extracting time and/or frequency domain features from the sensors data and then feeding such features to some machine learning algorithms. Statistical measures are mainly considered as time-domain features [
15] (mean, standard deviation, root mean square, norm, histograms, etc.), while frequency-domain features are based on the Fourier transform. This feature extraction step can sometimes be inefficient as loading databases with too many attributes can slow down the learning process and lead to high computation cost (159 features in [
16] for example).
Moreover, focusing on statistical features only can over shade the physical significance of data and thus provide us with lower detection accuracy rates. As a matter of fact, numerous studies have been done on this matter, in [
17], it has been shown that 112 features, extracted from the accelerometer and the gyroscope, are considered important; however, for specific applications they can be lowered to 19 features for the accelerometer and 23 for the gyroscope. For arm and hand side classifications, accelerometer features can be reduced to 4 or even 1. A similar study has been conducted also in [
18]. In [
19], a comparison has been made between the use separately of angular velocity and linear acceleration features, or their combination. Seven features (time and frequency domain features) were extracted from such measurements. The obtained results prove the pertinence of each case (angular velocity or linear acceleration features) depending on the targeted event.
In this paper, the proposed approach enables the use of rather a raw set of features, instead of time and frequency domain features. It takes into account raw data from not only the gyroscope and the accelerometer, but also the magnetometer, and by combining them, we estimated the attitude, considered later as a new four features (for quaternion) and three features (for Euler angles). In such a way, we guarantee having a small number of features that are not only rich in physical information but are also applicable to any studied activity. We assess also the effect of the number of inertial sensors and their location placements on recognition of human postures and activities. In that case, it was further shown that using a lower number of features, with quaternion (or Euler angles), in the recognition process leads to a lower computation time and better accuracy in certain activities (up and down stairs). As a machine learning approach, we used the k-nearest neighbors (KNN) algorithm, since it was proven to achieve high classification accuracy in literature [
11,
13,
20]. More specifically, and as discussed in [
21], an enhanced version of this algorithm can be employed, called subspace KNN. The random subspace technique has been largely discussed in literature [
22,
23] and it has been shown effective when it was added to classic classifiers such as the KNN.
The subspace KNN is fed with two different features in this paper. First, we choose to work with only raw data attributes and assess if using data coming out from IMMU’s sensors, without any preprocessing, can help the subspace KNN to achieve efficient classification results. Second, we estimate the attitude (quaternion and Euler angles) of the human member from raw data of IMMU’s sensors. Attitude estimation is an area of research well treated in navigation [
24,
25] and less in HAR. The quaternion is used now as features for the subspace KNN and significantly improved performance of the classifier. To the best of our knowledge, such features have not been used for this recognition issue and constitute one of the main contributions of this paper. We also discussed the different comparisons that we conducted, varying the number, type of sensors and possible body placements and came out with multiple conclusions.
This paper is organized as follows.
Section 2 presents the methodology we followed about the used sensors (and measurements), attitude estimation principle, sensor placement and studied activities/postures, data acquisition, and methods for classification.
Section 3 exposes a deep discussion about the results of classification for recognition with raw data and quaternion features as well as a discussion about the computation time and accuracy of the proposed methods. We end the paper with some conclusions and future work in
Section 4.
2. Methodology for HAR
2.1. Sensors and Raw Inertial and Magnetic Measurements
To achieve our goal related to HAR, we dispose, in the framework of experimental tests, of a set of five wearable modules “Physilog“ from the Gait Up brand [
26] (see
Figure 1). Each module is a complete miniature IMMU equipped with a triad of three-axis accelerometer, three-axis gyroscope, and three-axis magnetometer, with micro-electro-mechanical systems (MEMS) technology. The raw data recorded from these sensors can be stored on a memory card that equips each module, then used in classification algorithms for further analysis and recognition. The five modules can be synchronized which help us to analyze data from different human limbs and cross the results between them. For the ‘Physilog’ module, the raw data from inertial and magnetic sensors is measured in the sensor’s coordinate system (or body coordinate system)
.
A three-axis accelerometer measures the specific force vector (sum of linear acceleration and Earth’s gravity ) and outputs its projection in . A three-axis gyroscope measures the angular velocity vector of . The gyroscope principle uses the Coriolis effect to measure the angular rate. A three-axis magnetometer measures the direction and intensity of the magnetic field, in particular, the Earth’s magnetic field vector in . Usually the outputs of these sensors are corrupted with noise vector assumed to be a white Gaussian whose components are not correlated.
2.2. Attitude Estimation Principle
In the ‘Physilog’ module, the raw data from sensors is expressed between two coordinate systems (see
Figure 2): the sensor’s coordinate systems
and the inertial coordinate system
(considered as the Earth’s coordinate system). The system
is defined according to the NED convention (north, east, down).
Then, we can define the rotation between these two coordinate systems as the attitude of the body segment. To adequately determine the attitude later in the experiments, we make sure that the principal axes of IMMU (composed of the triad of sensors) coincide with those of the body inertia (human limb). The attitude of the body supporting the ‘Physilog’ module can be represented by quaternion or Euler angles. The quaternion, denoted by
, is a hyper-complex number of rank 4 [
27]. Euler angles are defined as a set of three angles (roll: rotation around x-axis, pitch: rotation around y-axis, yaw: rotation around z-axis) [
28].
Attitude estimation problem has received a great attention in several areas of application. Not being directly measurable, this information can be reconstructed using estimation algorithms merging measurements from several sensors, depending on the final application. This problem was formulated originally by Wahba [
29] and consisted in determining the optimal attitude by using at least two pairs of unit vectors measured in two different coordinate systems, sensor, and Earth ones in our case (
and
, respectively). A multitude of solutions was proposed to solve this problem, some of first methods are based on deterministic approaches TRIAD [
30], QUaternion ESTimator (QUEST) [
31], and Singular Value Decomposition method (SVD) [
32]. More recently, some dynamic estimation methods, more efficient, such as Kalman filters (KF) [
33,
34,
35] and observers [
36,
37] are proposed. One of the famous surveys of these methods can be found in [
38].
The use of inertial and magnetic sensors has grown these last years on smartphones, tablets, etc. A large number of dynamic estimation methods was implemented on these connected objects. In fact, since the use of only three-axis gyroscope data is not enough for attitude estimation, three-axis accelerometer, and three-axis magnetometer data are added to get an absolute quaternion and compensate the gyro drift from bias. The essence in solving an attitude estimation problem, with dynamic estimation methods, resides in combining such inertial and magnetic sensor measurements in a relevant manner.
Figure 3 illustrates the general schema of estimation, where K represents the fusion gain between data that is merged from the accelerometer-magnetometer fusion and the gyroscope integration. This gain is calculated automatically via a specific equation inside KFs, is adjusted depending on sensors reliability for complementary filters, or is calculated from a certain candidate Lyapunov function for observers.
Following this architecture, a typical IMMU can provide two vector observations expressed in two coordinate systems:
Acceleration in provided by a three-axis accelerometer, noted , and its projection in , noted (g is the gravity).
Earth’s magnetic field in
provided by a magnetometer, noted
, and its projection in
, noted
.
,
, and
can be obtained using the World Magnetic Model (WMM) [
40].
The data fusion block will produce a quaternion that updates the one estimated from three-axis gyroscope data, via the kinematic equation , where is the quaternion form of angular velocity data.
2.3. Sensors Placement and Studied Activities and Postures
Eight healthy subjects, four males and four females, aged between 19 and 46 years, participated voluntarily in this study. Each participant signed written informed consent before the measurement and the ethic approval is obtained. The characteristics of the volunteers are displayed in
Table 1.
We provided each subject with the five synchronized ‘Physilog’ modules:
two modules placed on both feet,
two modules on the thighs, and
one module on the lower back.
Since experiments with the five ‘Physilog’ modules show that the combination of both right and left (feet and thighs) does not contribute significantly to the classification accuracy (0.03% of improvement), we chose to present the results from three module locations: left foot (LF), left thigh (LT), and lower back (LB), as they present the most significant placements for the studied postures and activities. The choice of right or left limb is done arbitrarily since no differences are observed between them. The sensors were very securely attached to the participant’s body limbs using a special straps provided by Gait Up. These straps avoid misalignment between the wearable sensors axes and those of the associated body member while recording the different protocols. We have also consulted experts in biomechanics, and we have done some comparisons of sensors outputs with a motion capture system localized in a special room equipped with Vicon and OptiTrack cameras, to make sure they are fixed on the body, in the most convenient way with the minimum error of alignment.
The subjects were given instructions to perform activities and postures in their own way without specific constraints, however we asked them to follow certain protocols where the order and the duration of the performed activities were specified. No restrictions have been made on the clothes or shoes worn by the participants (sneakers, boots, heels, etc.). Each subject conducted a test scenario composed of three different protocols that are performed separately, giving us in total, seven activities/postures to classify: standing, sitting, laying, leaning, walking, downstairs and upstairs (see
Figure 4).
The eight subjects were asked to follow a certain predefined order and duration of the proposed activities. This has enabled us later to accurately label the training data by knowing the start and the end times of each activity or posture. The three conducted protocols are detailed in
Table 2. Some of the performed activities or postures are added for synchronization or labeling reasons, such as ‘jumping jacks’, ‘wait’, and ‘turn’. They play a huge role in the detection of the targeted seven postures/activities as they help us distinguish them from each other, especially when two activities are performed successively and are both static or dynamic. For instance, during labeling, the jumping jack activity enables us to differentiate between static activities, such as sitting and standing up, as they have very similar raw signals, while the jumping jack in between these two postures is highly dynamic, and thus it helps us detect the end of sitting and the start of standing up (check first protocol in
Table 2). Similarly, the ‘turn + wait’ activity is performed to recognize going up stairs from going down stairs (third protocol). This manipulation is done during the data preparation phase in order to have a clean and correctly labeled training dataset.
2.4. Data Acquisition and Preparation
The data acquisition was performed in a reproduced apartment environment over a period of about three hours in total. The apartment was composed of chairs, a sofa, a bed, a desk, some home appliances (TV, coffee machine, food mixer, etc.), a long hallway, stairs and elevators, which enabled us to conduct all the studied activities in a very natural manner. No large magnetic perturbations were observed in this environment despite the fully realistic environment. A deeper study of high magnetic disturbances effect should be performed, but is not in the objectives of this paper. As discussed earlier, experiments have been done on five possible locations that are displayed in the
Figure 5. Sensor configurations have been made in order to set the range, the units and the sampling frequency (50 Hz). This was through the ‘Physilog Research Toolkit’ (RTK). To synchronize the recordings of the different modules, we identified one sensor as a master and the others as slaves as recommended by the manufacturer [
26]. It remains mandatory though to make sure to switch them ON at the same time in order to have a perfect synchronization, which is not easy to achieve each time and calls for a pre-treatment phase of the collected databases as mentioned in the previous paragraph. As we have applied three specific protocols for the seven activities, then we know exact information on when each activity should start and how much time it should last. This has enabled us to subtract samples that largely exceeded the expected number of samples for each activity (and for each IMMU) and to ensure the synchronization between different IMMUs.
Raw data is recorded and stored in the IMMU’s memory card, then extracted using the RTK and organized in a ‘.csv file’ that is later converted into an “Excel file” for easier manipulation. As we chose to be in the supervised machine learning framework, it was necessary to manually label our acquired raw data with the appropriate class (activity/posture). This phase is very critical to the efficiency of our proposed approach, as it affects directly the training class and can cause false recognition if some samples are mislabeled. This may occur mainly because of the uneven number of samples between the different collected databases, caused by delays and slight unsynchronized recordings of the different sensors. A pre-treatment phase is necessary to subtract beginnings and ends of raw data in order to have synchronized databases for all sensors from each module. In addition, false labeling is likely to happen in the areas of transition between two successive activities.
This is why we highlighted these transitions with the ‘jumping jack’ and ‘wait and turn’ activities, as discussed in
Section 2.4. Since we are not interested in studying these transitions further, we have eliminated them from the different recordings.
2.5. Overview of the Proposed Approach for HAR
The proposed methodology for HAR process begins with the real world in which each wearable ‘Physilog’ module is affixed to a person’s body segment. First, the training step consists in collecting raw data from sensors (for each module) after sampling the signals, for a training scenario. In a first approach, this raw data is used as features to train the classifier (form the training set). We recall that the data is not preprocessed or filtered beforehand. In a second approach, the raw data is used to extract features (quaternion or Euler angles) through an attitude estimation algorithm (see
Section 3.3). Second, the predicting step consists in exploiting new observations to effectively associate them to their corresponding class by using the chosen classifier (already trained). The overall method is illustrated in
Figure 6.
The main goal of the classification process is to allocate an object represented by a number of measurements (i.e., feature vectors) into one of a finite set of classes. In order to do so, a number of training samples are available for each class, and they are used to train the classifier. When new data is available, whether it is kept as raw or transformed into attitude features, the classifier tries to predict its corresponding output using a learned function. This falls into the supervised learning category in pattern recognition. K-nearest neighbors (KNN) [
41] is considered as one of the simplest and most effective algorithms for achieving our objective. An advanced version of this algorithm, the subspace KNN, is used in this work, a combination between the KNN algorithm, and the random subspace technique. To estimate the effectiveness of the classifier, a validation technique needs be adapted to test the accuracy of the recognition model when a new data arrives.
To evaluate the efficiency of the proposed approach in this paper, we have used the leave-one-out cross validation, which means that we learn about n − 1 observations, and then validate the model on the umpteenth observation, and repeat this operation n times. As we are working with eight subjects, the algorithm has been executed eight times, where each time a different subject is considered in the testing and the other seven subjects are for the training. Accuracy results for the eight executions were very similar. For this reason, we display on the paper the results of one execution that is arbitrarily chosen.
2.5.1. K-Nearest Neighbors Algorithm
The KNN algorithm is a supervised learning algorithm that is instance-based and non-parametric. With instance-based we mean that the function is only approximated locally and all computation is achieved until a prediction is required. In other words, there is no explicit training phase and training data points are not used to do any generalization. KNN is also non-parametric, which applies that it does not make any assumptions on the underlying data distribution. Thus, the model structure is determined from the data itself. KNN algorithm is based on feature similarity, more specifically, it identifies how closely new features of a given data point resemble to those of the training set, in order to affect that point to its corresponding class, such as demonstrated in
Figure 7. In this context, KNN performs a majority vote between the K most similar instances to a new unseen observation. This similarity is defined according to a calculated distance between two data points. One of the most used distance metrics is the Euclidean distance given by
Therefore, given a positive integer K, a new data and a similarity metric , KNN achieves the following two steps:
The choice of the K value to be used varies according to the dataset. As a rule, the fewer neighbors (a small number K) we have, the more we are subject to under-fitting. Using more neighbors (a large K number) is then more reliable for the prediction. However, if we use K = N number of neighbors, with N being the number of observations, we risk causing overfitting and consequently a model that generalizes badly on observations that it has not seen yet. It is then mandatory to select the optimal value of K for the given training set, by running the classification algorithm several times with different values of K (from 1 to 30), until the best classification accuracy result of a new upcoming data (testing set) is obtained. In our case, the optimal value of K was equal to 5.
2.5.2. Subspace KNN Algorithm
As its name suggests, the subspace KNN is a method that combines the KNN algorithm described above and the random subspace (RS) technique. The RS algorithm [
42] is an ensemble learning method that tries to reduce the correlation between different learners by training them on random samples of features, with replacement (a feature can be selected more than one time), instead of the entire feature set. This helps these individual learners avoid over focusing on features that seem highly descriptive in the training set, but are in fact less predictive for points outside the set.
The subspace KNN can be constructed using the following algorithm:
Let N be the length of the training set, D be the number of its features and L be the number of individual models in the ensemble.
For each individual model l, we choose nl (nl < N) to be the number of input points for l. It is common to have only one value of nl for all individual models.
For each individual model l, we create a training set by selecting dl features from D with replacement and train the model.
Finally, we combine the outputs of the L individual models using the majority vote. This vote outputs the class that has been chosen the most from all the different subspaces. The advantage of this method is that it enables us to have a different result for each subspace coming from the main training set, to later select the most recurrent one. This has been proven beneficial to avoid issues like overfitting [
21,
42].
To apply this algorithm, we first create L random samples, with replacement, of a given size n
l, from the training set, and then we compute a single KNN classifier for each sample, as shown in
Figure 8. Next, we will get a vote from each classifier about the correspondence of a new instance to a particular class in order to determine, through the majority vote, the final prediction.