MGRA: Motion Gesture Recognition via Accelerometer

Accelerometers have been widely embedded in most current mobile devices, enabling easy and intuitive operations. This paper proposes a Motion Gesture Recognition system (MGRA) based on accelerometer data only, which is entirely implemented on mobile devices and can provide users with real-time interactions. A robust and unique feature set is enumerated through the time domain, the frequency domain and singular value decomposition analysis using our motion gesture set containing 11,110 traces. The best feature vector for classification is selected, taking both static and mobile scenarios into consideration. MGRA exploits support vector machine as the classifier with the best feature vector. Evaluations confirm that MGRA can accommodate a broad set of gesture variations within each class, including execution time, amplitude and non-gestural movement. Extensive evaluations confirm that MGRA achieves higher accuracy under both static and mobile scenarios and costs less computation time and energy on an LG Nexus 5 than previous methods.


Introduction
The Micro-electromechanical Systems (MEMS) based accelerometer is one of the most commonly-used sensors for users to capture the posture, as well as the motion of devices [1]. Extensive research has been carried out based on the accelerometer data of mobile devices, including phone placement recognition [2], knee joint angle measurement [3], indoor tracking [4] and physical activity recognition [5,6]. Research conducted so far still faces a challenging problem, which is not tackled effectively: signal drift or the intrinsic noise of MEMS-based accelerometers on commercial mobile devices.
Moreover, the accelerometer enables a mobile device to "sense" how it is physically manipulated by the user. As a result, a new type of interaction based on motion gestures performed by the users has been proposed, making eye-free interactions possible without stopping movement. The objective of a motion gesture recognition system is to find out which gesture is intended by the users, which is a spatio-temporal pattern recognition problem. Except the problem of accelerometer signal drift or intrinsic noise, motion gesture recognition systems confront three new challenges as follows: • Intuitively, one cannot promise to perform the same gesture exactly twice. The motion gestures usually vary strongly in execution time and amplitude. Therefore, the gesture recognition system should take into account the motion variances of the users. • The gesture recognition system should provide on-the-move interaction under certain mobile scenarios, like driving a car or jogging. The non-gestural user movements have an effect on acceleration signals, making gesture recognition more difficult.
• Training and classification of the motion gestures are expected to be executed entirely on the mobile devices. Therefore, the computation and energy costs need to be limited for such self-contained recognition systems.
Previous work on motion gesture recognition can be categorized into two types: template-based and model-based. Template-based approaches store some reference gestures beforehand for each class and match the test gesture with some similarity measurements, such as Euclidean distance [7]. uWave [8] applies Dynamic Time Warping (DTW) to evaluate the best alignment between gesture traces in order to tackle execution time variation. Model-based methods are based on the probabilistic interpretation of observations. Exploiting Hidden Markov Model (HMM), 6DMG [9] is generally robust to time and amplitude variations. The recognition accuracy of previous research, however, is somehow affected by non-gestural user movements, like sitting in a running vehicle, which will be shown in the Evaluation section. Furthermore, most previous works carry out the calculation on a nearby server instead of on mobile devices, which may involve privacy issues.
In order to solve all of the above issues, we try to answer one non-trivial question: what are the robust and unique features for gesture recognition hidden in the raw acceleration data? This paper is dedicated to extracting robust features from the raw acceleration data and exploits them to realize gesture recognition on mobile devices. The features should accommodate a broad set of gesture variations within each class, including execution time, amplitude and non-gestural motion (under certain mobile scenarios).
In our solution, we first collected 11,110 motion gesture traces on 13 gestures performed by eight subjects across four weeks, among which, 2108 traces are collected under mobile scenarios. We then enumerate the feature set based on the time domain, the frequency domain and Singular Value Decomposition (SVD) analysis. The best feature vector of 27 items is selected under the guide of mRMR [10], taking both static and mobile scenarios into consideration. We then implement our Motion Gesture Recognition system using Accelerometer data (MGRA) with the best feature vector, exploiting SVM as the classifier. The implementation is on an LG Nexus 5 smartphone for the evaluations. MGRA is first evaluated through off-line analysis on 11,110 motion traces, comparing accuracy with uWave [8] and 6DMG [9]. The results demonstrate that MGRA achieves an average accuracy of 95.83% under static scenarios and 89.92% under mobile scenarios, both better than uWave and 6DMG. The computation and energy cost comparison on the LG Nexus 5 also confirms that MGRA outperforms uWave and 6DMG.
The major contributions are as follows: • A comprehensive gesture set of 11,110 motion traces was collected containing 13 gestures performed by eight subjects across four weeks, among which, 2108 traces are collected under mobile scenarios. Based on this dataset, 34 statistical features are enumerated through the time domain, the frequency domain and SVD analysis with the visualization of their impact on gesture classification.

•
We exploit mRMR to determine the feature impact order on gesture classification for static and mobile scenarios, respectively. The best feature vector of 27 items is empirically chosen as the intersection of these two orders.

•
The MGRA prototype is implemented with the best feature vector on the LG Nexus 5. We compare MGRA on classification accuracy, computation and energy cost under both static and mobile scenarios to previous research of uWave and 6DMG. MGRA achieves the best performance on all metrics under both scenarios.
The rest of this paper is organized as follows. In Section 2, we introduce the technical background on motion gesture recognition. Section 3 illustrates our data collection process and our observation on execution time, amplitude and scenario variations based on our gesture sets. Details on feature enumeration are described in Section 4, and Section 5 presents the feature selection process. Section 6 gives a system overview of MGRA. Section 7 presents the comparison of the accuracy of MGRA to uWave and 6DMG on two gesture sets, both under static and mobile scenarios. It also shows the time and energy cost of MGRA, uWave and 6DMG on Android smartphones. We conclude our work in Section 8.

Related Work
This section reviews the research efforts on gesture recognition systems based on the accelerometer for mobile devices. The objective of a gesture recognition system is to classify the test gesture (that the user just performed) to a certain class according to the training gesture set (that the user performed early).
Previous research can be mainly categorized into two types: template-based and model-based. Intuitively, some basic methods measure the distance between the test gesture and the template gestures of each class and select the class with the minimum distance as the result. Rubine [11] made use of the geometric distance measure on single-stroke gestures. Wobrock et al. [7] exploited the Euclidean distance measurement after uniformly resampling the test gesture to handle execution time variation.
To cope with sampling time variations, several methods based on Dynamic Time Warping (DTW) are presented. A similarity matrix is computed between the test gesture and the reference template with the optimal path, representing the best alignments between two series. Wilson et al. [12] applied DTW on the raw samples from the accelerometer and gyroscope for gesture recognition. uWave [8] first quantized the raw acceleration series into discrete values, then employed DTW for recognition. Akl and Valaee [13] exploited DTW after applying compressive sensing on raw accelerations. Nevertheless, the amplitude variation still affects the recognition accuracy for the aforementioned DTW-based methods.
Statistical methods, such as the widely-used Hidden Markov Model (HMM), are based on probabilistic interpretation of gesture samples to model the gestural temporal trajectory. HMM-based methods are generally robust, as they rely on learning procedures on a large database, creating a model accommodating variations within a gesture class. Each underlying state of HMM has a particular kinematic meaning and describes a subset of this pattern, i.e., a segment of the motion. Schlömer et al. [14] leveraged the filtered raw data from the acceleration sensor embedded in the Wii remote and evaluated 5, 8 and 10 states, respectively, for motion model training. 6DMG [9] extracted 41 time domain features from the samples of accelerometer and gyroscope of the Wii remote. It exploited eight hidden states for building an HMM model with 10 training traces for each class. Support Vector Machine (SVM) is also extensively applied for motion gesture recognition. SVM-based methods usually offer lower computational requirements at classification time, making them preferable for real-time applications on mobile devices. The gesture classification accuracy depends closely on the feature vector for SVM-based methods. Wu et al. [15] extracted the feature of mean, energy and entropy in the frequency domain and the standard deviation of the amplitude, as well as the correlation among three axes in the time domain. As the raw time series are divided into nine segments and each feature is repeatedly extracted from every segment, the total feature set contains 135 items in [15]. In [16], Haar transform was adopted in the feature extraction phase and produces descriptors for modelling accelerometer data. The feature set contains 24 items.
In this paper, we also exploit SVM as the core of MGRA owing to its low computation cost on classification. Different from previous approaches, we focus on feature enumeration through not only the time domain and the frequency domain, but also SVD analysis. Then, we select the feature vector of 27 items based on mRMR, taking both static and mobile scenarios into consideration. We realize MGRA entirely on an Android smartphone, the LG Nexus 5.

Gesture Design and Data Collection
This section introduces our gesture collection phase and reveals some key observations on raw sampling data of motion gestures.

Gesture Collection
We developed a motion gesture collection application on the LG Nexus 5 with a sampling rate of 80 Hz. To make the users free from interactions with the touch screen, we rewrote the function of the short press on the power button. In our approach, the application records the accelerometer readings after the first press on the power button and ends on the second press. During the interval of two presses, the users perform motion gestures.
Eight subjects, including undergraduates, graduates and faculty members, participated in data collection, with ages ranging from 21 to 37 (ethical approval for carrying out this experiment has been granted by the corresponding organization). Each subject was asked to perform gestures in his or her own convenient style. We did not constrain her or his gripping posture of the phone, the scale and the speed of the action. The subjects only participate when available. We asked them to perform each gesture no less than 20 times for one collection period.
Since a necessary and sufficient number of single gestures is needed for phone command control, we define nine gestures as our classification target. These nine gestures are chosen as the combination of upper letters "A", "B" and "C", as shown in Figure 1. The choice of these gestures is due to two reasons, as follows: (1) Each gesture shares one character with the other three gestures, ensuring the difficulty of recognition. If these gestures can be classified with high accuracy, the gestures of other characters' combination customized by future end users will keep high recognition precision.
(2) The choice of spatial two-dimensional symbol gestures is consistent with the previous survey study [17]. Our survey on motion gesture design from 101 freshmen also indicates that there are 96.8% of 218 gestures created as combinations of English characters, Chinese characters and digits. The statistics of the designed gestures is shown in Table 1.
We further collected two English words "lumos" and "nox" (magic spells from the Harry Potter novel series) and two Chinese characters to build the second gesture set. The gestures in this set share no common parts with each other; shown in Figure 2. We name the gesture set in Figure 1 Confusion Set and the second in Figure 2 Easy Set. We thereby construct MGRA targeted on Confusion Set and verify the recognition results with Easy Set to prove the gesture scalability of MGRA in the Evaluation section. Taking the mobile situations into consideration, our subjects collect the gesture traces not only under static scenarios, but also sitting in a running car. After four weeks, we collected 11,110 gesture traces, among which 2108 are performed under mobile scenarios. The amounts of all kinds of bespoken gestures are summarized in Table 2.

Observation on Gesture Traces
A motion gesture trace is described as a time series of the acceleration measurements. a = (a 1 , a 2 , ..., a n ), where a i = [a x (i), a y (i), a z (i)] T , is a vector of x, y and z acceleration components according to the phone axes. n is the number of samples within the whole gesture. Figure 3 shows the raw acceleration series of three typical motion traces of gesture A performed by one subject. The first and second traces are chosen from static traces, where the third is selected from mobile traces. It is observable that the motion patterns usually vary strongly on execution time, amplitude and scenario type. For example, the difference on execution time is about 0.5 s between two static gestures. The maximum value on the X-axis of one static trace is about three higher than the other in Figure 3a. Moreover, the values of the mobile trace are commonly larger than the two static traces on all three axes. The same phenomena coincide with other subjects. We deem these phenomena as the time, amplitude and scenario variety of motion gestures and discuss them in detail in the following subsections.

Time Variety
We count the execution time of all motion gesture traces in Confusion Set. Figure 4a shows the execution time distribution for one subject performing gesture A, which is similar to a Gaussian distribution. This is reasonable because a person normally cannot control actions precisely under the unit of seconds, while the whole gestures are finished in no more than 2 s. Figure 4b presents the box plot of the gesture traces in Confusion Set performed by the same subject. It shows that execution time variety exists among all gestures. We divide these nine gestures into three groups according to the execution time. The groups of gestures AA, AB and BB have a longer execution time, while gesture C has a shorter execution time. The remaining forms the third group. Therefore, applying just the time length can distinguish three gesture groups for this subject. This conclusion is consistent with the motion traces of other subjects.  We calculate the composite acceleration of the raw traces to estimate the whole strength when a user is performing gestures. The composite acceleration of the raw traces comes from Equation (1), which captures the user behaviour as a whole. The mean composite acceleration of the gesture indicates the strength used to perform the gesture by subjects. Figure 5a shows the average composite acceleration distribution of gesture AB performed by one subject under static scenarios. Similar to time variation, the distribution is also like a Gaussian distribution. Figure 5b shows the box plot of the mean composite acceleration amplitude among all motion traces. There is amplitude variety for all gestures performed by this subject, even under only static scenarios. According to Figure 5b, nine gestures can be categorized into two groups: gestures AB, B, BB and BC have a relatively higher strength than the other gestures. For other subjects, the variety in amplitude also exists.

Scenario Variety
There is a wide variety of mobile devices that have no common features, except mobility. We collect the gesture traces in a running car as an example for mobile scenarios. We drive the car in the university campus to collect mobile traces, and the trajectory is shown in Figure 6a. We then compare the composite accelerations under mobile scenarios to the values under static scenarios, shown in Figure 6b. The amplitude of composite acceleration is higher under mobile scenarios than that under static scenarios. This is mainly because the car contributes to acceleration values when changing speed. Time, amplitude and scenario varieties have been observed from our gesture set, which can have a direct impact on the recognition accuracy. Hence, we think about extracting the robust and unique features from the raw acceleration series for gesture recognition, which can help in accommodating these three varieties.

Feature Enumeration
Feature extraction is the fundamental problem in the area of pattern recognition. However, few works, which extract effective features and make a quantitative comparison of their quality for gesture recognition, have been reported. This section illustrates our feature enumeration from acceleration data through the time domain, the frequency domain and SVD analysis.

Time Domain Features
The motion gestures are different from each other in their temporal spatial trajectories. The trajectory can be reflected by time domain features to some degree. We extract the time domain features from the raw acceleration traces.
As discussed in Section 3.2.1, the gestures can be classified into three groups based only on execution time length. Therefore, we take the time length as our first feature, labelled as { f 1 }.
Due to the difference in spatial trajectories, the number of turns in an action may be used to distinguish different gestures. We use zero-crossing rates on three columns of raw traces to estimate the change of acceleration, reflecting some clues about the spatial trajectories. Figure 7a shows the zero-crossing rate on the X-axis of nine gestures performed by one subject. It shows that nine gestures can be categorized into five groups according to the zero-crossing rate on the X-axis, which are the groups {C}, {A, CC}, {AC, B}, {AA, AB, BC} and {BB}. We thereby treat the zero-crossing rate on three axes as feature { f 2 , f 3 , f 4 }. The composite acceleration of the raw traces calculated as Equation (1) can capture the user behaviour as a whole. Here, we calculate the mean and standard deviation (std) on composite acceleration, as well as each column. The mean shows how much strength the user uses when performing certain gestures. The standard deviation reveals how the user controls his or her strength during performing a gesture.  This confirms that the mean and standard deviation can to some extent work in gesture classification. Hence, we label the eight items of the mean and standard deviation as { f 5 , f 6 , ..., f 12 }.
We further calculate the maximal and minimal values on three axes, respectively, and on composite acceleration. Figure 8a shows the maximal and minimal acceleration on the Z-axis for nine gestures performed by one subject. Each class contains 40 trials. It provides some boundaries for gesture recognition, e.g., Figure 8a shows that the traces of AC, BC and C can be distinguished by these two parameters. Therefore, we take these eight values as features { f 13 , f 14 , ..., f 20 }. The time complexity for extracting time domain features turns out to be O(n).

Frequency Domain Features
We extract the features from the frequency domain. We apply Fast Fourier Transform (FFT) on the three columns of the raw traces. Previous research took all of the low frequency parts directly as the features [18], but found the recognition results to be worse than just by calculating the correlation from the original time series.
Unlike previous approaches, we assume that people have an implicit frequency while performing certain gestures, and we try to locate it. We select the frequency with the largest energy instead of the base and second frequency and use it to represent the frequency domain features. The second frequency always has significantly high energy for most gestures containing repeating symbols, e.g., AA. In order to align with the different time lengths of the motion traces, we take the period as the feature instead of the frequency. Therefore, the frequency feature includes the period and energy of a certain frequency. Figure 8b shows the frequency feature on the X-axis of 40 traces per gesture performed by one subject. It shows that some pairs or groups of gestures can be distinguished by these two parameters. For example, the traces of gestures AA, BC and C have no intersection among one another in Figure 8b. Similar results occur in the other axes practically. Therefore, the period and energy features on the three axes are all adopted into our feature set as { f 21 , f 22 , ..., f 26 }, respectively. The time complexity to compute the frequency features is O(nlogn).

SVD Features
During the data collecting phase, we observe that the subjects tend to hold the phone in different postures to perform different gestures. Therefore, we aim to represent such posture differences. Singular Value Decomposition (SVD) provides a unique factorization of the form A = UΣV * . In the scenario of motion traces, n is the sample number of the traces, U is an n × 3 unitary matrix, Σ is a 3 × 3 diagonal matrix with non-negative real numbers on the diagonal and V * denotes the conjugate transpose of a 3 × 3 unitary matrix. The diagonal entries σ i are the singular values of A listed in descending order. The complexity of SVD on the motion traces is O(n).
V * is the rotation matrix from the actual motion frame to the phone frame, indicating the gripping posture. As the gestures that we are studying are 2D gesture, the first and the second column vector of V * can be critical for the phone posture estimation, labelled as { f 27 , f 28 , ..., f 32 }. Figure 9a shows V * 11 and V * 21 of one subject in which each gesture class contains 40 traces. It shows that the nine gestures can be first divided into two groups on parameter V * 11 , which are groups {A, AA, AB, AC} and {B, BB, BC, C, CC}. This indicates that this subject uses different phone gripping postures when performing the gestures starting with different characters. There is further discriminative power inside each group. For example, gestures BB and CC can be separated on parameter V * 21 . We then dig into singular value Σ. Σ represents the user's motion strength on three orthogonal directions when performing actions. Recall that even the same user cannot perform identical gestures with exactly the same strength, and our gesture recognition feature should be suitable under mobile scenarios; so, we leverage the relative value, called the σ-rate (σ r ), defined as follows: The σ-rate represents how the user allocates her or his strength on orthogonal directions when performing a gesture relatively. Figure 9b shows features σ r (1) and σ r (2) on 40 traces per gesture performed by one subject. In Figure 9b, gestures BC and C are clearly not intersected, showing that the σ r can provide some clues to classify different gestures. Therefore, we add σ r to the feature set, labelled as { f 33 , f 34 }. As U contains the time series information, which has been extracted by the time domain and frequency domain analysis, we leave U out of consideration.
All together, the feature set is composed of 34 features, shown in Table 3. Though we only depict the classification impact of features on one subject, similar results empirically exist for other subjects. Table 3. Feature set.

Feature Selection
Feature selection is another elementary problem for pattern classification systems. We select the best feature vector using the mRMR approach [10] and Confusion Set validation. mRMR determines the feature order that minimizes the redundancy and maximizes the relevance to minimize the classification error.
We When adding an item to F, we delete this item in S and M to speed up the operation of "∈". In order to make Algorithm 1 easy to understand, we leave out the details of this speed up trick.
The result of F in (x) is thereby shown in Table 4. When executing Algorithm 1, there can be some value of x that no new feature is added to F. For example, when x = 4, the F in (4) = null. This is because F in (1 : 4) = F in (1 : 3) = { f 12 , f 1 , f 31 }. Meanwhile, F in (x) may add two features together for some value of x, like x = 22, as shown in Table 4. As illustrated in Algorithm 1, we add these two features according to their impact order under static scenarios. For x = 22, we first add feature f 34 and add f 8 afterwards. After coping with the intersection order, we delete the "null" out, and all 34 features are sorted in F in .
mRMR only provides the order of F in ; we further verify the classification results of F in (1 : x) for x = 1, 2...34 in Section 7.1. Here, we report that the best classification result comes from F in (1 : 27) i.e., The best feature vector contains 27 items, including time length, zero-crossing rates on the X-axis and the Y-axis, the mean and standard deviation on each column, the standard deviation of composite acceleration, the maximal and minimal values of accelerations on three axes and the composite acceleration, the energy on the X-axis, the first and the second column vector of V * and σ r . The time complexity to calculate F in (1 : 27) turns out to be O(nlogn).

Design of MGRA
Due to constrained computing and storage resources and being concerned with time consumption, we use multi-class SVM as the core for gesture recognition. The design of MGRA is shown in Figure 10, including five major components: • Sensing: recording the acceleration data when a user is performing motion gestures between two presses on the phone power button.

Evaluation
This section presents the results of the off-line analysis on the collected trace set and the online evaluation on LG Nexus 5 smartphones. For off-line analysis, we first determine the key parameters of MGRA, i.e., the feature vector and the training set number. We further compare two different SVM kernels and one different classifier of random forest to confirm the functionality of SVM with the RBF kernel. Then, we compare the classification accuracy of MGRA with two state-of-the-art methods: uWave [8] and 6DMG [9]. For online evaluation, we compare the energy and computation time among MGRA, uWave and 6DMG.

Parameter Customization
Feature vector selection has a great impact on classification for SVM. Section 5 only provides the feature impact order as F in . It still needs to allocate the length of the feature vector; meanwhile, a larger training trace number, better classification accuracy. However, it brings the end users a heavy burden to construct a training set when the number of traces is big.
Hence, we conduct a grid search for optimal values on these two parameters for the static traces, with the feature number n f varying from one to 34 according to the order F in and the training set number n t varying from one to 20. For each combination of these two parameters, we train the SVM model with n t traces per gesture class, randomly selected from Confusion Set under static scenarios for each subject. We only use static traces for training, as the end user may prefer to use MGRA under mobile scenarios, but may not like to collect training traces while moving. After constructing the model, the rest of the traces of each subject under static scenarios is used to test the subject's own classification model. This evaluation process is repeated five times for each combination. Figure 11 shows the average recognition accuracy of static traces among eight subjects for each parameter combination. From the point of view of the training traces number, it confirms the tendency that a larger number means better classification accuracy. When the training trace number exceeds 10, the accuracy improvement is little. The maximum recognition accuracy under static scenarios is 96.24% with 20 training traces per gesture class and with 27 feature items. For the combination of 10 training traces and 27 features, the average recognition accuracy is 95.83%, no more than 0.5% lower than the maximum value. Hence, we use 10 as the trace number per gesture class for training in MGRA. We further dig into the impact on recognition with different feature numbers while keeping n t = 10. Table 5 shows the confusion matrices of one subject on feature number n f = 1, 4, 11, 27. When only using the standard deviation of the composite acceleration ( f 12 ), the average classification accuracy is only 44.82%. For example, Table 5a shows that 55% traces of gesture AA are recognized as gesture AB, and the other 45% are recognized as gesture BB. After adding time feature ( f 1 ), posture V * 22 ( f 31 ) and σ r (1) ( f 33 ), only 7.5% and 5% of gesture AA are recognized incorrectly as gestures AB and BB, described in Table 5b. The average accuracy for four features increases to 83.31%.
After adding seven features, Table 5c shows that there is only 2.5% of gesture AA classified as gesture AB. None of gesture AA is recognized as gesture BB. Such an experiment shows that MGRA with 11 features correctly recognizes between AA and BB. It also illustrates that gesture AA is more confusable with AB than BB, confirming that two gestures are harder to classify when sharing common parts. The average recognition accuracy is 92.52% for 11 features.
For 27 features, the recognition error is little for each gesture class, as shown in Table 5d. Besides, all error items are on the intersection of two gestures sharing a common symbol, except only 2.4% of gesture A is recognized incorrectly as gesture B. The average recognition accuracy achieves 96.16% for this subject on the feature vector of 27 items.
Before determining the feature number, we should also consider the traces collected under mobile scenarios where the acceleration of the car is added. For each parameter combination, we test the models constructed from the static traces, with the traces collected under mobile scenarios. Someone may ask: why not construct the SVM model based on mobile traces separately? Because it is inconvenient for potential users in practice to collect a number of gestures for training under mobile scenarios. Even collecting the training traces in a car for our subjects, it requires someone else to be the driver. However, it is easy for the user to just perform a certain gesture to call commands on the smartphone under mobile scenarios, comparing to the interaction with touch screens. Table 5. Confusion matrices of MGRA on one subject with n t = 10 and different n f (%).  Figure 12 depicts the classification results for the same subject as Figure 11, testing his static models with mobile traces. The highest accuracy of 91.34% appears when the training set number is 20 and the feature number is 27. For the combination of n t = 10 and n f = 27, the recognition accuracy is 89.92%. The confusion matrix of n t = 10 and n f = 27 tested with mobile traces is shown in Table 6. Most error items are also on the intersections of gestures sharing one common symbol, except that a small fraction of gesture B is classified as gesture A and CC, gesture BB as AA and gestures C and CC as B. Comparing Table 6 to Table 5d, most error cells in Table 6 do not exist in Table 5d. This indicates that the accuracy decreases when testing the classification model constructed from static traces with mobile traces. However, the average recognition accuracy is 90.04%, still acceptable for this subject.
Taking the training sample number as 10, we further plot the average recognition accuracy among all subjects for all feature number values under two scenarios respectively, in Figure 13. It shows that the recognition accuracy achieves the maximum when n f = 27, for the test traces under both static and mobile scenarios. Therefore, we choose F in (1 : 27) as the feature vector of MGRA.

Comparison with SVM Kernels and Random Forest
We choose RBF as the SVM kernel for MGRA, withthe assumption based on the central limit theorem. To justify the choice of SVM kernel, we further examine Fisher and polynomial kernels under both static and mobile scenarios. The feature vector is chosen as F in (1 : 27) for the comparison.
Besides, some previous research has demonstrated the classification performance of random forest on activity or gesture recognition [19]. Therefore, we also evaluate random forest as the classifier and apply all 34 features under both scenarios. Table 7 shows the confusion matrix of the Fisher kernel, the polynomial kernel and random forest under static scenarios for the same subject with the results in Table 5d. It first shows that most error items are on the intersections of gestures sharing one common symbol, which is consistent with the error distribution in Table 5d. Table 7a also demonstrates that the Fisher kernel achieves 100% accuracy on classifying repetition gestures, which are AA, BB and CC. However, the accuracy for all of the other gestures is lower than that in Table 5d.
Comparing the polynomial to the RBF kernel, it shows that the classification accuracy of gestures A, AA, AB, BB, BC and C is higher than 95% in both Tables 5d and 7b. The misclassification between gestures AC and B is larger than 7% in Table 7b for the polynomial kernel, which has been corrected in Table 5d by the RBF kernel. The average classification accuracy among all subjects when applying three different SVM kernels under static scenarios is shown in Table 8. The accuracy of the RBF kernel is the highest, and the other two are close to or above 90% for static traces.
However, the accuracy decreases clearly when applying the model trained under static scenarios to the traces under mobile scenarios for the Fisher and the polynomial kernel, as shown in Table 9a,b. For gestures BC and B, the classification accuracy is no more than 60% for both the Fisher and the polynomial kernels. Gesture BC is misclassified as AC and B with a high percentage for both kernels, because these three gestures share some common parts. For gesture B, there are 27.5% and 32.5% misclassified as gesture CC for both kernels, respectively. These errors come from both gestures being performed as two circles by the subject. The only difference is that gesture B includes two vertical circles, while gesture CC contains two horizontal circles. Referring back to Table 6, the RBF kernel misclassifies only 2.5% of gesture B as CC under mobile scenarios. The average classification accuracy results among all subjects with three kernels under mobile scenarios is also listed in Table 8. It shows that the Fisher and the polynomial kernel are not robust to scenario change. On the contrary, the RBF kernel retains accuracy very close to 90%.
Applying random forest as the classifier, we get the confusion matrices for the same subject, listed in Tables 7c and 9c, under both scenarios, respectively. The results show that random forest is also not robust to scenario change. The average accuracy results are lower under both scenarios, when comparing random forest to SVM with the polynomial and RBF kernels, shown in Table 8. When digging into the Variable Importance Measure (VIM) of the random forest average on all subjects, Figure 14 shows that F in (1 : 27) are still the most important features for random forest. Except that F in (28, 29) are a little more important than F in (27) under static scenarios, and F in (28) outperforms F in (27) a bit under mobile scenarios. We further conduct an evaluation of random forest on feature set F in (1 : 27). The average classification accuracy is 86.17% and 68.32% under static and mobile scenarios, respectively, decreasing no more than 3.5% compared to the accuracy with all 34 features. This confirms that the feature selection results of mRMR and Algorithm 1 are independent of classification method and scenario.    After comparison among three SVM kernels and the classifier of random forest, we confirm that SVM with the RBF kernel is suitable as the classifier for MGRA. Table 9. Confusion matrices of SVM kernels and classification method testing static models with mobile traces for one subject.

Accuracy Comparison with uWave and 6DMG
We compare MGRA, uWave [8] and 6DMG [9] on classification accuracy with both Confusion Set and Easy Set in this section.
uWave exploits DTW as its core and originally only records one gesture trace as the template. The recognition accuracy directly relies on the choice of the template. To be fair for comparison, we let uWave make use of 10 training traces per gesture in two ways. The first method is to carry out template selection from 10 training traces for each gesture class first. This can be treated as the training process for uWave. The template selection criterion is that the target trace has maximum similarity to the other nine traces, i.e., the average distance from the target trace to the other nine traces after applying DTW is minimum. We call this best-uWave. The second method is to compare the test gesture with all 10 traces per gesture class and to calculate the mean distance from the input gesture to nine gesture classes. We call this method 10-uWave. 10-uWave does not have any training process, but it will spend much time on classification.
For 6DMG, we extract 41 time domain features from both acceleration and gyroscope samples in the gesture traces. The number of hidden states is set to eight, experimentally chosen as the best from the range of (2, 10). 6DMG uses 10 traces per gesture class for training.
We first compare the test trace set from static scenarios. Table 10 shows the confusion matrix of best-uWave, 10-uWave and 6DMG of the same subject of Table 5d. Most classification errors of MGRA, comparing Table 5d to Table 10, still exist when applying best-uWave, 10-uWave and 6DMG; except that best-uWave corrects MGRA's error of 2.4% on gesture A recognizing as gesture B. 6DMG corrects MGRA from recognizing 2.5% of gesture AB as gesture AA and 7.3% of gesture AC as BC. Both best-uWave and 10-uWave decrease MGRA's error of 7.3% to 4.9% on recognizing gesture AC as BC. On the contrary, a majority of errors existing for best-uWave, 10-uWave and 6DMG are corrected by MGRA. For example, the confusion cell of gesture CC recognized as BC is 7.5%, 7.5% and 5.0% for best-uWave, 10-uWave and 6DMG, respectively, in Table 10, which are completely corrected by MGRA. The average accuracy of best-uWave, 10-uWave and 6DMG for this subject under the static scenario is 89.79%, 90.35% and 91.44%, respectively, lower than 96.14% of MGRA.  Then, we compare the performance with the test traces by the same subject under mobile scenarios, shown in Table 11. Comparing Table 6 to Table 11, most recognition errors of MGRA still exist for best-uWave, 10-uWave and 6DMG, except a small fraction of errors are decreased. On the contrary, MGRA corrects or decreases most error items in Table 11. For example, the cell of gesture CC recognized as B is 23.1%, 23.1% and 5.1%, respectively, for best-uWave, 10-uWave and 6DMG. Referring back to Table 6, MGRA misclassified only 2.6% of gesture CC as B. The reason for gesture CC being misrecognized as B is the same as why gesture B was misrecognized as CC, as discussed in Section 7.2. Therefore, MGRA outperforms best-uWave, 10-uWave and 6DMG clearly under mobile scenarios, whose average recognition accuracy for this subject is 90.04%, 72.50%, 73.3% and 72.72%, respectively.
We calculate the average accuracy among all subjects for the test traces under static and mobile scenarios separately and depict the results in Figure 15. It shows that MGRA not only achieves a higher accuracy of classification, but also it has a more stable performance across gestures and scenarios. For best-uWave, 10-uWave and 6DMG, they achieve high accuracy on static traces, but their accuracy decreases about 10% when tested with mobile traces. Moreover, the recognition accuracy is gesture dependent for uWave and 6DMG, especially under mobile scenarios. Table 11. Confusion matrices of three classification methods testing static models with mobile traces for one subject(a) best-uWave; (b) 10-uWave; (c) 6DMG. Considering the impact of gesture set on recognition, we further compare the recognition accuracy on another gesture set, i.e., Easy Set. The evaluation process is the same as the one on Confusion Set. Table 12 shows the confusion matrices for one subject on Easy Set. Table 12a shows that MGRA achieves higher accuracy on static traces from Easy Set, than the result in Table 12d from Confusion Set. Only 2.5% of gesture C is classified incorrectly as B, and 5% of gesture 美 is classified incorrectly as 游. The average accuracy is 98.9%. Table 12b shows that MGRA also achieves a higher average accuracy of 95.7% with traces on Easy Set under mobile scenarios. Recall that the features of MGRA are enumerated and selected based totally on Confusion Set. Therefore, the high classification accuracy on Easy Set confirms the gesture scalability of MGRA.
Here, we report the average accuracy comparison among all subjects and all gestures in Table 13. It shows that all approaches improve their accuracy, comparing Easy Set to Confusion Set. For MGRA, the recognition accuracy only decreases 5.48% and 2.87%, from static test traces to mobile test traces for the two gesture sets, respectively. However, the other approaches drop more than 10%. This confirms that MGRA adapts more to mobile scenarios than uWave and 6DMG. Comparing the recognition accuracy on Easy Set, under static and mobile scenarios, MGRA holds accuracy higher than 95% for both. This indicates if the end user puts some efforts into the gesture design, MGRA can achieve high recognition accuracy no matter whether under static or mobile scenarios.   One question might be brought up: why does 6DMG fail in comparison to MGRA, which exploits the readings from both the accelerometer and gyroscope? This result basically is due to two reasons. The first is that MGRA extracts features from the time-domain, the frequency domain and SVD analysis, unlike 6DMG, which only extracts features from the time domain. The second is that MGRA applies mRMRto determine the feature impact order under both static and mobile scenarios and finds the best intersection of two orders. mRMR ensures the classification accuracy for selecting features of the highest relevance to the target class and with minimal redundancy.

Online Evaluation
Energy consumption is one of the major concerns for smartphone applications [20]. Real-time response is also important for user-friendly interaction with mobile devices. Therefore, we conduct a cost comparison for MGRA, best-uWave, 10-uWave and 6DMG on the LG Nexus 5. We measure the energy consumption through PowerTutor [21]. We count the training and classification time for the four recognition methods. Table 14 shows the cost comparison among the four recognition approaches. MGRA has the smallest time and the minimum energy cost for classification. Moreover, the training time is less than 1 min for MGRA, for it extracts altogether 27 features and is trained with the multi-class SVM model. The training and classification time is much higher for best-uWave and 10-uWave, because they take DTW on the raw time series as their cores. The raw series contains 200∼400 real values, larger than the 27 items of MGRA. Besides, DTW exploits dynamic programming, whose time complexity is O(n 2 ). 10-uWave has a much longer classification time than best-uWave, because the test gesture needs to be compared to all 10 templates for each gesture class, i.e., 90 gesture templates.
The training time and energy of 6DMG is much greater than MGRA, since 6DMG extracts 41 features for the gesture trace and training the HMM model. Besides, the classification time and energy of 6DMG are also greater than those of MGRA. This comes from 6DMG needing both acceleration and gyroscope sensors.

Conclusions
In this paper, we implement a motion gesture recognition system based only on accelerometer data, called MGRA. We extract 27 features and verify them on 11,110 waving traces by eight subjects. By applying these features, MGRA employs SVM as the classifier and is entirely realized on mobile devices. We conduct extensive experiments to compare MGRA to previous state-of-the-art works, uWave and 6DMG. The results confirm that MGRA outperforms uWave and 6DMG on recognition accuracy, time and energy cost. Moreover, the gesture set scalability evaluation also concludes that MGRA can be applied to both static and mobile scenarios effectively if the gestures are designed to be distinctive.