Person Independent Recognition of Head Gestures from Parametrised and Raw Signals Recorded from Inertial Measurement Unit

: Numerous applications of human–machine interfaces, e.g. dedicated to persons with disabilities, require contactless handling of devices or systems. The purpose of this research is to develop a hands ‐ free head ‐ gesture ‐ controlled interface that can support persons with disabilities to communicate with other people and devices, e.g. the paralyzed to signal messages or the visually impaired to handle travel aids. The hardware of the interface consists of a small stereovision rig with a built ‐ in inertial measurement unit (IMU). The device is to be positioned on a user’s forehead. Two approaches to recognize head movements were considered. In the first approach, for various time window sizes of the signals recorded from a three ‐ axis accelerometer and a three ‐ axis gyroscope, statistical parameters were calculated such as: average, minimum and maximum amplitude, standard deviation, kurtosis, correlation coefficient, and signal energy. For the second approach, the focus was put onto direct analysis of signal samples recorded from the IMU. In both approaches, the accuracies of 16 different data classifiers for distinguishing the head movements: pitch, roll, yaw, and immobility were evaluated. The recordings of head gestures were collected from 65 individuals. The best results for the testing data were obtained for the non ‐ parametric approach, i.e. direct classification of unprocessed samples of IMU signals for Support Vector Machine (SVM) classifier (95% correct recognitions). Slightly worse results, in this approach, were obtained for the random forests classifier (93%). The achieved high recognition rates of the head gestures suggest that a person with physical or sensory disability can efficiently communicate with other people or manage applications using simple head gesture sequences. in which IMU were used to recognize types of physical activities, we studied the dependence of different lengths of time windows on the parametric representation of inertial signals to recognize different types of head gestures. In our work, we show that time domain approaches to


Introduction
Human-System Interaction (HSI) is currently actively pursued as a separate research field dedicated to the development of new technologies facilitating human communication with systems [1]. Depending on the application, such interaction systems are also referred to as Human-Computer Interfaces (HCI) or more generally human-machine interfaces. The design and construction of such systems requires an interdisciplinary research approach and involves knowledge of sensory perception mechanisms in humans, cognitive processes, and information processing, as well as basics of ergonomics. A well-designed user interface often determines the usefulness of an entire system [2].
Designing interfaces accessible for persons with sensory and motor disabilities is a particularly challenging research issue [3,4]. It is necessary to develop innovative solutions that enable handicapped users to communicate with such devices or systems. The interfaces use alternative, often analysis of inertial signals can outperform parameter-based ones and are less computationally demanding. The latter feature is particularly important for mobile implementations of the interface. The ultimate motivation for our research is to develop user-friendly and efficient electronic travel aids that will help the visually impaired retain orientation and mobility in unfamiliar environments [26,27].
The remainder of this paper is organized as follows. In Section 2, we describe the experimental methods that we used for data recording and data pre-processing. We also shortly review the applied data classifiers. In Section 3, we present the head gesture recognition results for the parametric and time domain representations of IMU signals. Finally, in Section 4 we apprise the achieved results, show the envisioned applications of our study, and point out limitations of the presented work.

Materials and Methods
In our proof of concept approach, we have applied a DUO MLX device (see Figure 1b) equipped with a stereovision camera, a three-axis electronic gyroscope, three-axis acceleration sensor, a thermometer, and a magnetometer [28]. The device is of small form factor: 52 × 25 × 13 mm and weighs 12.5 g. The device is mounted on the user's forehead (see Figure 1a). The study was conducted with 65 persons. The participants were mainly second year university students, 44 of whom were women. The trials were approved by the bioethics commission of the Medical University of Lodz (No. RNN/261/16/KE). All the trial participants were informed about the purpose of the study, the materials used and their role in the trial sessions. After fixing the DUO MLX device on their forehead, the users were asked to perform rehearsal head movements and were instructed to proceed with the movements just within a comfort zone of their head positions. During collection of the data, the participants remained in a sitting position. Each participant was sitting straight in a chair and did not change position during the experiment and performed only the given motions (yaw, pitch, roll, and immobility). Each user had the DUO MLX device mounted rigidly on their forehead. The participant did not touch the DUO MLX device or change its position on the forehead during the experiment. The aim of the study was to automatically classify three basic head gestures: roll (head movement "sideways"), pitch (head movement "up-down") and yaw (head movement "left-right"), and immobility (head at rest). A single movement cycle lasted 2.5 s on average with 1.2 s standard deviation of the movement period. The recordings of test and learning datasets from one individual took approx. 4 min. each.
Sample plots of the signals recorded by the acceleration sensor and the gyroscope are shown in Figure 2. The recorded accelerations ax, ay, and az along axes Ox, Oy, and Oz respectively are plotted in Figure 2a and angular velocities ωα, ωβ, and ωγ (as shown in Figure 1b) recorded from the gyroscope are plotted in Figure 2b. All the signals are recorded at a sampling rate of 100 Hz.

Signal Recordings and Pre-Processing
Signal recordings from 65 participants were grouped into the following datasets: 1. The learning dataset recorded from a randomly selected 53, i.e. approx. 80% of all the 65 trial participants. Signals that were used as the learning set were recorded as follows: 10 s immobility-2.5 min. of a specific motion type-10 s immobility, the type of gestures performed by test participants were: yaw, roll, pitch, and immobility. The collection of the learning signals therefore consisted of 53 × 4 = 212 recordings. 2. Two types of testing datasets that have not been used for training the classifiers:


T_53_set that was recorded for 53 trial participants for whom the learning data were recorded,  T_12_set that was recorded for the rest of the trial participants, i.e. 12 individuals who did not take part in the recordings of the learning datasets, further, for each of the two testing data sets there were two different testing scenarios applied:  testing scenario T1 for which the dataset was recorded following the sequence of: 20 s of immobility-40 s of yaw head movement-20 s of immobility-40 s of roll-20 s of immobility-40 s of pitch-20 s of immobility; thus, two types of the testing datasets were recorded T1_53_set and T1_12_set, respectively  testing scenario T2 for which the datasets were recorded according to the following procedure: head gesture sequences of roll, yaw and pitch gestures in random order for time periods lasting from 5 to 20 s and no immobility time gaps between the gestures; thus, two other types of the testing datasets were recorded: T2_53_set and T2_12_set respectively.
The signals recorded from the IMU were processed according to the two following procedures: 1. With signal parameterization (SP): for each of the six recorded IMU signals, i.e. three signals from the accelerometer (ax, ay, az) and three signals from the gyroscope (ωα, ωβ, ωγ), the following features were extracted: (1) average, (2) minimum, (3) maximum, (4) standard deviation, (5) kurtosis, (6) correlation coefficients for pairs of signals from the accelerometer and pairs of signals from the gyroscope, and (7) signal energy. Thus, the total number of parameters derived from the signals was: 6 signals × 7 parameters = 42 parameters.  2. Time domain representation (TDR): current samples of signals recorded from the accelerometer (ax, ay, az) and gyroscope (ωα, ωβ, ωγ) were used directly as six-element training vectors for the classifiers.
In order to eliminate user errors related to an excessively slow start of the selected type of movement, too rapid cessation of motion or incorrect initial movement, the learning datasets comprised 90 s recordings approx. in the middle of the 150 s time recording span.
According to the described recording procedure of the learning datasets (collected from 53 individuals), the following datasets were built: The datasets are available for download at [29].

The Classifiers
For the purpose of classification of the head gestures on the basis of the recorded IMU signals, data classifiers generally recognized to be very efficient were applied [30]. In particular, various architectures of decision trees and decision forests were used since they can provide an insight into the data decision process, i.e. these classifiers can decompose the classification task into decision rules of the input data features. The data classifiers that were applied were the following: (1) a decision tree, (2) a decision tree with a minimum number of samples per leaf equal to 5, (3) a random forest consisting of 10 decision trees, and (4) a random forest consisting of 10 decision trees, each of which contained a minimum number of leaf samples equal to 5, (5) k-nearest neighbors method (k-NN) for k  {1, 3, 5, 7, 9, 11, 13, 15, 19}, (6) Support Vector Machines (SVM) with a radial basis function kernel, and lastly (7) the SVM with a third degree polynomial as the kernel function.

Decision Trees and Random Forest
Decision tree is a supervised non-parametric classifier that allows building decision rules by multistage divisions of the dataset into disjoint classes [31].
Different algorithms can be used for building decision trees. In this work, we have employed the CART (Classification and Regression Trees) algorithm and used the so-called Gini coefficient as a measure of the diversity of classes in the tree nodes [25]. Classification trees can be combined into groups to form the so-called random forests. The final classification of the vector is performed by voting, i.e. a vector is assigned to the class for which it receives the largest number of votes from individual decision trees. Random forests are efficient classifiers for very large data sets [32,33].

The k-Nearest Neighbor classifier
The k-Nearest Neighbor Classifier (k-NN) is another non-parametric data classification technique. This is an instance-based learning algorithm that works on a simple scheme in which data sample x is assigned to a class for which there is a majority of data prototypes within the k nearest neighbors [34]. The k value should be selected in such a way as to retain the balance between overtraining and insufficient fit. We have evaluated the performance of this classifier for k  {1, 3, 5, 7, 9, 11, 13, 15, 17, 19} neighbors. However, we present just the best classification results obtained for k = 7 for training procedure SP, i.e. with IMU signal parametrization and for k = 19 obtained for the time domain analysis. In all the tested k-NN classifiers, the Euclidean distance was used as the distance metric.

Support Vector Machines (SVM)
The SVM classifier is particularly suitable for classifying large dimensional datasets [35,36]. The basic optimization goal in this classifier is to maximize the margin, i.e. the between-class separation region. The margin is defined as the distance between the decision boundary (separation hyperspace) and the nearest learning samples, called the support vectors. The aim is to obtain decision boundaries with the widest possible margins. In our implementation of the SVM classifier for the multiclass problem, a scheme in which the one-vs-rest classification approach was adopted, we tested the performance of these classifiers for regularization parameter values C  {0.1, 1.0, 5.0, 10.0} for the two types of kernels, a RBF kernel and a 3 rd degree polynomial. The best classification results were obtained for C = 1.0 for the parametrized signals and for C = 10.0 for time domain analysis. Thus, we present the classification results for the regularization parameter values for which the corresponding SVM classifiers performed best.

Classifier Training and Results
The classifiers were trained according to the procedures described in Section 2.1. The main aim was to compare the performance of different classifiers for parametrized signals and raw signal samples recorded from the inertial sensors.
The 10-fold cross validation method was used to evaluate the classifiers' performance in the head gesture classification task. The training sets (for both the SP and TDR training procedure) were therefore randomly divided into 10 different subsets of equal cardinality. Each subset was sequentially used as the validation set while the other nine subsets served as a training set for the classifiers. Each classifier was cross validated on the same subsets, i.e. the division of the training and validation sets was made first and then each classifier was trained on identical data subsets. Tables 1 and 2 report the results of the 10-fold cross-validation for the SP and TDR training procedures, respectively. For the classifiers trained on the parametrized signals (training procedure SP), the random forests and the SVM with a third degree polynomial kernel excelled and achieved accuracies over 98%. The poorest classification results were obtained for the k-NN classifier, but even this classifier achieved accuracies exceeding 95%.
For the classifiers trained directly on signal samples (training procedure TDR), all classifiers but the SVM achieved accuracies better than 97%. Although, the SVM classifier yielded results no worse than 92% correct head gesture recognitions.
We have evaluated the performance of the classifiers and compared whether the accuracies they yield are statistically different. Because we consider more than two classifiers and our data do not have a normal distribution, the Friedman test was used for this purpose [37]. For both the training procedures, i.e. the SP (with parametrization of IMU signals that generate 42 dimensional training vectors) and the TDR (raw IMU signal samples that generate six dimensional training vectors), we have rejected the null-hypothesis, i.e. we concluded that there are statistically significant differences (p < 0.05) between the classifiers. We also compared the chosen seven classifiers (Tables 1 and 2) by conducting the paired Wilcoxon signed rank test. For the SP training procedure for all classifiers we have also rejected the null-hypothesis. In the case of the TDR training procedure, there were no significant differences between the random forest and the k-NN for k = 19 (the null-hypothesis accepted). For other classifiers for the TDR procedure we have rejected the null-hypothesis, i.e. the classifiers' performances are statistically equal.

Results for Datasets Consisting of the Parametrised IMU Signals
For the parametrized signals (the SP training procedure), the classifier performances were evaluated for different time window widths T for which the statistical parameters of the signals were calculated. The results obtained for the combined test scenarios T1 and T2 are shown in Figure 4 for different time window sizes. Also, Figure 5 compares the results of head gesture classifications obtained for two different testing datasets, i.e. the test data for trial participants whose data was used for training the classifiers and the test data from trial participant whose data was not used for training the classifiers. We can conclude that the results shown in Figure 5 show the user independent performance of the system.  Interestingly (as illustrated in Figure 4), the best results were obtained for the shortest time window, i.e. T = 0.1 s window containing 10 signal samples. An increase in the width of the time window negatively affected the classification accuracy. Note that for the test dataset, regardless of the width of the time window, the best classifier was the SVM with a 3 rd degree polynomial kernel (accuracy better than 92%). The best classification accuracy for the k-NN classifier was obtained for k = 7 and was slightly worse than the one achieved by the random forest classifier (both with an accuracy above 91%).
For the shortest time windows, the accuracies of individual classifiers tend to differ, whereas for larger time windows, the performance of all the classifiers converges to similar albeit poorer values. Figure 5 illustrates in more detail how the classifiers performed for the shortest time window (T = 0.1 s).

Results for Datasets Taken Directly from IMU Signal Samples
Subsequently, we evaluated the classifier performances for the testing datasets that were drawn directly from the samples of IMU signals. The achieved classifiers' accuracies for the test datasets are shown in Figure 6. Comparison of the classifiers' accuracy obtained for the test data from participants who trained the classifiers (T_53_set) and the test data from individuals who did not record the training data (T_12_set). The test data collected directly from the accelerometer and gyroscope (vectors with 6 parameters).
Thus, we can note that for the classifiers trained on the IMU signal samples, the best accuracies exceeding 95% were achieved for the SVM classifiers. Also, an important observation is that their performance did not depend on the test data type (T_53_set or T_12_set).

Person Independent Recognition of Head Gestures
Here we report on the performance of the classifiers for the test data recorded for the 12 trial participants for whom the recorded IMU signals were not used for training the classifiers. In Figures  5 and 6, we compare the recognition rates of head gestures for the new 12 users and the classification results obtained for the test dataset recorded for 53 trial participants. Figure 5 contains the results for the shortest time window T = 0.1 s for which 42 statistical parameters were computed and used for training the classifiers, whereas Figure 6 shows the results for the classifiers trained on IMU signal samples, i.e. on six-element vectors.
For the person-independent test datasets, i.e. for the 12 participants who did not take part in training the classifiers, the best head movement recognition results for vectors with 42 parameters were obtained for the SVM and random forests. For all of these classifiers, about 90% accuracy or higher was achieved ( Figure 5). Note also that for the person independent test datasets, the results obtained were inferior to those for the T_53_set test data (i.e. person dependent dataset) except for the SVM classifiers which performed equally on the two test datasets (for the TDR training procedure).
On the other hand, for the classifiers trained directly on the IMU signal samples, head gesture recognition accuracy exceeded 90% for all classifiers, except the decision tree classifier (see Figure 6). Interestingly, the accuracy for all classifiers except SVM was better for T_12_set, however, the SVM performed best for both T_53_set and T_12_set test datasets.

Head Gesture Recognition Rates for T1 and T2 Testing Scenarios
We also verified how the classifiers' accuracy depends on the type of testing scenario as defined in Section 2.1. For the causal system, the best results were obtained for test scenario T1 (see Figure 7). The accuracy of all classifiers exceeded 90%. The results obtained for test scenario T2 were distinctly inferior. For this testing scenario, the accuracy of the classifiers was below 90%. Again, the best classifiers were the SVM and the random forests. For the signal samples recorded directly from the IMU (vectors with six parameters), almost all classifiers' accuracies were above 90%. As is evident in Figure 8, except for the SVM, the best results were obtained for test T2. Note that for the SVM classifier, the recognition accuracy for both tests was almost identical and exceeded 95%. Figure 8. Comparison of the classifiers' accuracy obtained for test scenarios T1 and T2, i.e. for the test data recorder for individuals who did not record the training data (T_12_set). Classifiers accuracy for the test data collected directly from the accelerometer and gyroscope (vectors with six parameters).
We have also verified which head gestures were most often confused with one another for the classifiers which yielded the best results, i.e. for the SVM classifier with a 3 rd degree polynomial and regularization parameter C = 1.0 for the SP training procedure and for the SVM with an RBF kernel and C = 10.0 for the TDR training procedure. The results are presented as confusion matrices in Tables 1 and 2. These tables show the results obtained for the 12 trial participants who did not take part in training the classifiers.
As shown in Table 3, the highest error rate was reported for pitch (5.5% of the gestures recognised as immobility). Further, considerably high error rates were noted for roll (4.6% of the gestures recognised as immobility). The best recognition rates were obtained for head immobility (92.6%). Interestingly, also for the immobility we have obtained the highest false positive recognitions. The sensitivity, specificity and F1 scores of the head gesture recognition rates are presented in Table 4. Note that the F1 score is defined as a harmonic average of the sensitivity and positive predictive value. It is a good measure of overall classifier performance. Table 4. Statistical measures showing detection results of head gestures for the SVM classifier with a 3 rd degree polynomial and C = 1.0 (the classifier trained on parametrized IMU signals).
For the SVM classifier trained on IMU signal samples (Table 5), the distribution of errors was different. For this case, the immobility was frequently and falsely recognized as yaw or as pitch (4.3% and 3.9% error rates respectively). Note here that both the sensitivity and specificity rates in Table 6 (classifiers trained on raw IMU signals) outperform the corresponding rates in Table 4 (classifiers trained on parametrized IMU signals). Table 6. Statistical measures showing detection results of head gestures for the SVM classifier with a 3 rd degree polynomial and C = 1.0 (the classifier trained on raw IMU signals).
The training times of the classifiers were also evaluated. The computations were performed on an Intel Core i5-7500, 16 GB RAM, Windows 10 64-bit PC. Scripts were written in Python in the Enthought Canopy environment using the Sklearn module. The obtained calculation times required for training the classifiers are presented in Table 7. Depending on the types of datasets (42-element or 6-element vectors), a different size of the training datasets was used, hence the training times of the classifiers varied significantly. For the 42element training vectors, the longest training time did not exceed 6 min. The training procedure was the fastest for the k-NN classifier. On the other hand, it took several minutes to train the SVM classifier.
For the 6-element training vectors (and larger training datasets), the shortest training times were obtained for the k-NN classifiers. Training of the decision trees (and random forests) took tens of seconds while the longest training times were required for the SVM classifiers (a few hours). Table 8 shows the time necessary to recognize a head gesture from a single vector by the trained classifiers. The computation times required for processing the data by the trained classifiers (for a PC with Intel Core i5-7500 processor) varies from 1 ms for the decision tree classifier to 11 ms for the SVM classifier with an RBF kernel. It should be noted, that for the parametrized approach, we must first compute the statistical parameters; for the shortest time window consisting of 10 signal samples, we need to wait for 0.1 s before we feed these parameters to the classifiers.

Discussion
Our primary motivation for the study was to build a hands-free interface for the visually impaired that would facilitate controlling an electronic travel aid built in our project dedicated to help the visually disabled in independent mobility and travel [26]. The main role of the DUO MLX device and the image processing software is to recover a 3D structure of the environment from the captured stereovision images. The sequences of depth images are then sonified and acoustically presented to the visually impaired user. In [38], we have shown that such a non-visual presentation of the environment can help the blind to detect obstacles and find safe walking paths in the surroundings. One of the complications we encountered in the conducted trials is handling of the device by the blind. The use of an additional remote control to manipulate the interface proved to be inconvenient and occupied the other hand of the user who carried a white cane. An example scenario of how the interface might be used is the following. The navigation system generates a series of audiohaptic signals to indicate obstacles or inform about points of interest. Such a system features numerous settings like loudness, sensitivity range, sonification, and haptic activation schemes. The blind user, while standing still, by a series of head gestures, can move within the menu of the system and select the appropriate setting for the encountered environment or a navigation task at hand. System settings are confirmed by synthetized voice messages. It is important to note that a blind person carries a white cane which occupies one hand and the proposed solution does not need to engage the other hand to control the device. Also, after consulting the visually impaired, we plan another application of the interface. The device can play the role of a remote control that can aid the visually impaired in the activities of daily living, e.g. it can be used to control radio/TV volume without a need to seek for the remote of such devices.
Compared with the study reported in [24], we obtained a comparable F1-score of 92% for the parametrized IMU signals. It should be noted, however, that in [24] and in [17], individual gestures were not distinguished, but rather whole body activities, e.g. walking, running, cycling, hence the need for longer time data analysis windows: 5 s, 3 s and 1 s in [24] and 5 s and 12 s in [17]. In our case, we recognize delicate head movements, thereby the shorter the time window, the better the motion recognition. In [17,18], the IMU sensors were attached to the wrist, in [24] two to five sensors were used and they were mounted in different parts of the body, while in [19], IMU was mounted on the head like in our work. The advantage of our study is that 65 people participated in the experiments and our database is more diverse than e.g. in [19] where only five people took part in the trials. None of the works [17][18][19]24] compared the recognition results of body movements for data derived from the parametrized signals and time domain samples of IMU signals. We also tested the recognition rates from a large pool of different data classifiers, i.e. the decision trees and forests (and their cropped versions), k-NN, and SVMs (with an RBF kernel and a 3 rd degree polynomial kernel). Finally, we show that better classification results can be obtained if the classifiers are trained on raw IMU signals rather than on the parametrized signals.
We are aware of the limitations of our study. The participants of the trials were mainly young people recruited from the students. Thus, an open question is on the usability of the system for elder people. Secondly, the trial participants remained seated during data collection. This limitation, for some system applications, should be released. Also, we should underline that our choice of IMU signals parameters, e.g. statistical parameters, was arbitrary and one might hypothesize that a different set of parameters might be composed that would even further improve the recognition performance of the head gestures.
Finally, the proposed system would need to be tested with target groups recruited from persons with physical or sensory disabilities.

Conclusions
In the presented study, we have compared two approaches to the classification of head gestures, i.e. training the classifiers on the parametrized signals, and training on direct signal samples recorded from the IMU.
The obtained results of head gestures recognition allow to formulate the following conclusions: 1. Head movements (roll, yaw, and pitch) can be efficiently recognized on the basis of signal recordings from an IMU that is positioned on the user's forehead. 2. The data classifiers trained on IMU signal samples outperformed the classifiers that were trained on a set of statistical parameters derived from the IMU signals. These performance differences were confirmed by running statistical tests. The Friedman test revealed that the classifiers yield statistically different results (p < 0.05). Also, the paired Wilcoxon signed-rank test has confirmed statistically valid differences for most of the classifiers (no significant differences between the random forest and the k-NN classifier for k = 19 was noted only). Our explanation about high head gesture recognition performance from raw IMU signals is that the six signal channels carry rich enough information for the classifiers to confidently recognize the head gestures. Also, we conclude that the inherent measurement noise (that occurs randomly) is averaged out during the training procedures of the classifiers. 3. The SVM classifier outperformed other classifiers in recognizing the head gestures, and if trained on the IMU signal samples, it achieved a recognition accuracy above 95%. 4. The recognition rates of the head gestures for the person-independent test dataset, i.e. the data taken from 12 persons whose recordings were not used in training the classifiers are comparable to the recognition rates obtained for the data recorded from the individuals who have also recorded the training data (see Figures 5 and 6). Interestingly, if the classifiers were trained on raw IMU signal samples, their recognition performance was generally better for data recorded from participants who did not train the classifiers. 5. The proposed method is suitable for on-line implementations. In Table 8, we show that the computation times required for detection of a gesture (on a PC with Intel Core i5 processor) by pre-trained classifiers do not exceed 11 ms.
control system for home appliances. We will also consider the usefulness of such an interface as a potential rehabilitation aid for individuals with neck stiffness.