1. Introduction
The hands are the most widely used body part for gesturing aside from the face and represent one of the most intuitive methods of enabling human-machine interaction [
1,
2,
3]. Hand gesture recognition is a key element in a wide variety of domains including VR, assistive devices for disabled users, prosthetics, physical therapy devices, and computer peripherals both for general use and disabled users [
2,
3,
4,
5]. Hand gestures are classified as either static or dynamic gestures [
3,
6]. A static gesture is simply a pose or position of the hand that is stationary, and a dynamic gesture is composed of a series of sequential static gestures [
3,
6]. While there is a growing interest in the use of dynamic gestures tracked across multiple frame inputs for uses such as sign language, single frame inputs are the building block that make up dynamic gestures.
Three major types of sensor systems used in gesture recognition include data gloves, electromyography, and video based gesture recognition [
3,
4,
5,
6]. Data gloves may integrate magnetic systems, flex sensors, or inertial motion unit data [
5]. They have high accuracy in static gesture recognition and can provide rich data on position, but can be cumbersome and uncomfortable to wear over an extended period of time [
2]. Electromyography (EMG) leaves the hands free but suffers from signal drift for a number of reasons including sweat, fatigue, variation of muscle force, and shifting electrodes [
3,
6,
7]. While video based solutions leave the body unencumbered by wearables their challenges include changes in lighting, issues segmenting the hands from background clutter, and occluded lines of sight [
3,
6].
In most studies reviewed, an application was used which cued gestures and saved sensor data along with the ground truth label of the requested gesture [
8,
9,
10,
11,
12,
13]. This is an attractive option for real world use given that home users are unlikely to have alternate sensors to provide ground truth labelling; but the assumption that subjects accurately follow the on-screen prompts is not always accurate. Typically, there is some period of time given to transition between gestures, which is flagged with a ‘Null’ label. This allows for three primary sources of error in labelling (i) users do not instantaneously transition between gestures and as a result many of the ‘Null’ samples will be members of actual gesture classes, (ii) user fatigue can lead to variation in production of gestures confounding a trained model [
7,
8,
14,
15], (iii) users may inadvertently make incorrect gestures [
8,
16]. The solutions presented in this paper focus on the first two issues.
The ‘Null’ label is typically assigned to the samples recorded during transitions between labeled gestures, and may also be assigned to hand positions that do not represent a known gesture class. Up to 75% of the samples in a given data set can be ‘Null’ gestures which do not a gesture category included in the model [
17]. The majority of contemporary studies reviewed for this paper indicated they discarded transient data in beginning/in between/at end of gestures, or indicated that transitions between gestures received some manual processing [
8,
9,
10,
11,
12,
15]. This can lead to unrealistically optimistic performance estimates.
It is commonly observed that gesture production is subject to variability over time [
8,
14,
15]. If a user finds a gesture difficult or uncomfortable to make, their production of the gesture may degrade over time or they may assume different wrist positions to ease gesture production [
14,
15]. These physical adaptions can lead to some gesture classes strongly resembling each other over multiple repetitions as the user behavior does not match the assigned label. Within image recognition, the question of how to learn from poorly labelled data has been explored using iterative training models and deep learning [
18]. These solutions generally deep learning models, and as indicated by Kim et al. this may require more training data than a user will have the patience to provide at the beginning of a use session [
19].
Another often unaddressed challenge is producing robust inter-set results. Often only a single data set is used to train, test, and validate models [
5]. While this can produce high accuracies, it often results in significant overtraining and models that fail to generalize to new data [
12]. It has been observed in a variety of studies that the accuracy of inter-set validation even on the same day tends to be far lower than in-set validation [
8,
10,
11,
12]. This in part can be attributed to signal drift, sensors shifting on users’ arms, fatigue, or even variations in user position leading to changes in sensor readings [
3,
4,
7,
9,
10,
12]. These changes violate a fundamental assumption in machine learning that data is stationary over time, i.e., the mean and standard deviation can be known a priori from a sample and will not change over time [
10]. The field of concept drift focuses on addressing cases where the underlying data features’ relationship with the variable they are predicting changes over time. Most proposed solutions for concept drift require either an assumption that data is Gaussian in distribution (which may be incorrect in real world gesture classification), a priori knowledge of classes so that error can be detected, and fresh labelled data [
20,
21]. Many of the proposed solutions are dependent on deep learning. While deep learning can achieve high accuracy and recognize large numbers of gestures; it requires a significant quantity of data and high computational power [
4,
7]. This may represent a challenge both for edge computing devices and user patience.
While a typical goal in studies is to maximize the number of recognizable gestures encoded by a sensor, Jiang et al. note in their 2021 review on gesture recognition ‘it is unrealistic and unnecessary to capture and classify every possible hand and finger pose, and instead defining a target hand gesture set can enable adequate performance for a given specific application’ [
3]. If only a small number of gestures are needed, it may be reasonable to train on a larger number of gestures and tailor the set to the ones likely to be robustly recognized for a given user. This could represent a simpler solution which could be accomplished with a classical learning algorithm. Classical algorithms do not typically recognize as many gestures, but they require far lower demands in terms of quantity of data and computational power [
2,
3].
In this study we examine possible solutions to both problems described in this section. For the ‘Null’ label error issue we use an unsupervised learner to find gesture transitions without manual intervention, and use a rules-based approach to repair labels on samples incorrectly placed into ‘Null’. To address the issue of inaccurate production of gestures, we leverage an averaged Confusion Matrix (CM) created from a time series train/test split using multiple repetitions of each gesture to find gestures confused with each other on a by user basis. We then consolidate gestures available to tailor them to the individual subject. In both inter-session validation is used to avoid overly optimistic performance assessments.
1.1. Similar and Related Works
Relatively few studies have investigated the problem of identifying modeled gestures within a set of samples containing ‘Null’ data and transitions between gestures. In 2015, Rossi et al. explored the problem of transitions between gestures in a six-gesture set using EMG sensors. They observed that over 66% of the errors in their study occurred during gesture transitions and proposed a two stage model using Hidden Markov Models (HMM) as a first state classifier to identify samples representing transitions between gestures prior to classification using Support Vector Classification (SVC) and increased their accuracy on streamed gesture data by 12% achieving a final accuracy of 84% on six gestures including the transition or ‘Null’ class [
13]. A study by Taranta et al. in 2022 used computer vision to pick dynamic gestures out of streams of data which included gestures that did not fit any of their models trained classes. They used algorithms to assign confidence that a gesture belonged in a given class, and reject classifications which fell below the assigned confidence and achieved a 93.3% overall accuracy in picking 10 dynamic gestures out of a stream of data. They reported that the thresholds of confidence had to be tailored to a given user and gesture [
22].
A greater number of authors have approached the problem of inter-session accuracy in recent years though many dropped or hand edited the transitional gestures which may have led to overly optimistic assessments of the end results [
8,
10,
11,
12]. Palermo et al. collected data from 10 subjects using EMG sensors and repeating 7 gestures 12 times, twice a day for 5 days. They indicated that data relabeling was performed offline to correct for subjects making incorrect gestures and found that on inter-session tests accuracy on average decreased by 27.03% [
8]. Leins et al. performed an 11-gesture study using Electrical Impedance Tomagraphy on five subjects and found losses of accuracy as high as 27% for inter-session tests. The study indicates that no data was recorded for transitions between gestures. A small portion of data from the new session was used to recalibrate models and inter-session accuracies as low as 19.55% were improved to 56.34% [
12]. Asfour et al. collected pressure sensor data from nine subjects performing sixteen gestures in three sessions; one session was used for training and the others for validation. All of their sessions were recorded on the same day without removing sensors, there was a ‘brief break between sessions’, and gesture transitions/‘Null’ gesture samples were discarded. They implemented a feature engineering approach in which Fischers’ Discriminant Analysis was used to transform features to a vector space maximizing the variance between classes and minimizing the variance within classes to optimize models for inter-session validation resulting in an average inter-session accuracy of 82% [
11]. In their 2021 paper Moin et al. conducted a real time experiment with two participants using a 66 sensor EMG array to classify 13 gestures via an HD computing algorithm [
10]. In-session their algorithm results achieved a 97.12% accuracy on 13 gestures. Inter-session tests resulted in an average 11.89% loss of accuracy; injecting fresh data from the new session could bring their intersession accuracy as high as 94.71%. However, they indicated that transitional periods between gestures required a 500 ms delay to rise to accuracies higher than 80% [
10]. The only study that did not require fresh data to recalibrate was Asfour et al., and only Moin et al. included transitions between gestures.
1.2. Focus of This Work
In this pilot study we explore inter-session (IS) trial results utilizing a capacitive strap based sensor from our previous work on wearable gesture recognition [
23]. We collect two sets of labeled data consisting of 10 gesture classes and a ‘Null’ label for transitions between gestures (‘Null’) from a convenience sample of eight subjects. We explore two different methods of processing gesture data collected from a user in order to correct for the limitations of computer assigned ground truth.
The first method which we refer to as FILT is a two-stage algorithm. An unsupervised learner finds the points in ‘Null’ containing the actual transitions between gestures under the assumption that they are outliers. In the context of this experiment rules are then used to relabel samples prior to the transitions into the previous class, and post transition samples into the next class. This allows for more accurate data to be used to train a model and a more reasonable assessment of the model’s accuracy, and is possible because in the supervised training setting used in this experiment the class labels before and after ‘Null’ are known. In an online use case the FILT method could be used to discard samples indicating that a transition between gestures was occurring and only pass actual gesture classes to a second classifier for identification.
The second method is referred to as Time Series Consolidation (TSC). Here we make the assumption that due to limitations of a sensor system or individual user some gesture classes will not be distinguishable from each other between use sessions. The algorithm works by training and testing on different repetitions of a gesture in-session, generating a confusion matrix (CM) for each time batch of repetitions, and averaging the CMs when all reps have been tested. It uses the averaged CM to find classes that fail to achieve a set threshold of accuracy and merge them with the class they are most commonly confused with. This produces a smaller set of gesture classes tailored to the user and more likely to remain robust in inter-session tests. TSC has the advantage of allowing rules to be assigned to control which classes are retained and provide a more intuitive user interface.
Finally, we explore combining the FILT method of correcting labels of ‘Null’ gestures via an unsupervised learner and the TSC method of merging of classes via use of a confusion matrix to arrive at a set of gestures tailored to limitations of the subject and/or sensor system being used to recognize gestures. To complement fine tuning the gesture set to the individual subject we also fine tune the feature set, feature normalization method, classification model, and hyper-parameters to the individual user.
3. Results
The effect of using the unsupervised learner to find likely transition points and applying the rules discussed in
Section 2.3.3 can be seen in
Figure 3. On average, in-session the unsupervised learner had a Precision of 0.9912 (S.D. 0.0057) and an AccStage1 of 98.77% (S.D. 1.37%). In inter-session trails the Precision declined slightly to 0.9857 (S.D. 0.0148) and an AccStage1 of 93.8% (S.D. 4.46%).
The mean and standard deviation of accuracy and classes retained within the four methods considered were: RAWSIS (Accuracy = 42.47% ± 3.83%; Classes Retained = 11.00 ± 0.00; Class Bad = 9.38 ± 1.19), FILTSIS (Accuracy = 61.98% ± 9.17%; Classes Retained = 11.00 ± 0.00; Class Bad = 6.88 ± 1.89), RAWTSC (Accuracy= 93.03% ± 4.96%; Classes Retained = 5.29 ± 0.46; Class Bad = 0.5 ± 0.46), and FILTTSC (Accuracy = 87.32% ± 7.51%; Classes Retained = 6.57 ± 0.92; Class Bad = 1 ± 0.95). The results of the Kruskal–Wallis test showed a significant difference in both mean accuracy (
p < 0.0001) and actual classes found (
p < 0.0001). As seen in
Table 2 the differences between SIS and TSC methods are statistically significant, but the differences between the two SIS methods and the two FILT methods do not reach statistical significance.
Looking at the box and scatter plots in
Figure 5 the unsupervised layer used in the FILT methods introduces significantly more heteroskedasticity than is seen in the raw data. TSC methods generally had better accuracy and less variance when trained on the raw data than when trained on filtered data.
The
p values generated by running the Dunn test with Bonferroni-Holm correction in
scikit_posthocs are shown on
Table 2, with the values indicating statistical significance (
p < 0.05) underlined. Both RAWTSC and FILTTSC show statistically significant differences in accuracy compared to the baseline of RAWSIS. The
p value comparing the accuracy of RAWSIS vs. RAWTSC (
p = 0.000016) is over an order of magnitude smaller than that of RAWSIS vs. FILTTSC (
p = 0.000692); and when compared to the accuracy of FILTSIS only RAWTSC shows a statistically significant difference (
p = 0.012379). It is tempting to assume that RAWTSC is the best pipeline overall, but the difference in accuracy between RAWTSC vs. FILTTSC does not reach the threshold of statistical significance (
p = 0.37916). A similar trend applies to classes retained.
Table 3 contains a row for each machine learning pipeline along with a count of how many subjects each of the scaling methods, classification algorithms, and feature sets achieved the highest accuracy for. CapDelta and OptimCapDelta are not included on this table to save space as they did not produce the best results in any trials.
Figure 6 shows confusion matrices for all four data processing methods applied to Data1 and Data2 data from a sample subject. In order to compare the performance on in-set (IN) vs. inter-set (IS) testing the left hand side of each sub plot holds a confusion matrix for the results of applying a given pipeline of a train/test split taken from Data1. On the right hand side, a confusion matrix shows the results of applying the same pipeline but training on Data1 and validating on Data2. In all cases except for RAW the IN results for both 5-fold cross validation and shuffle split confusion matrices showed accuracy above 90%, but inter-set accuracy is highly variable across methods ranging from 34.9% to 99.8%.
Rules were set to preserve both the ‘Neutral’ and ‘Fist’ gestures because they were assumed to be the most intuitively different gestures for most users to make.
Table 4 presents gestures retained by a sample subject after the RAWTSC method was applied. It can be seen that there was significant variability between subjects in regards to which gestures were retained. The most commonly retained gestures were the ‘Point Two’ (point with Index and Middle finger), and ‘Chuck Grip’ gestures. Otherwise there was significant variability in the gestures retained. Three of the subjects retained the ‘Point Middle’ gesture, but only two retained ‘Pinch’ or ‘Point Index’. Only one subject retained the ability to distinguish ‘Thumb Up’ and none of the subjects retained the ability to distinguish ‘Spread’ or ‘Thumb Adduct’ gestures.
The gestures retained by users with the FILTTSC method can be seen in
Table 5. Because the purpose of the FILT method is to eliminate the transitional gestures within ‘Null’ and relabel the rest in their adjusting categories, all users are considered to have a detectable ‘Null’ category. However, it can be seen that six of eight users retained the ‘Point Index’ and ‘Chuck Grip’ gestures. While the ability to distinguish ‘Thumb Adduction’ was still limited to one of the subjects; ‘Thumb Up’ and ‘Spread’ could now be distinguished in three of the subjects.
4. Discussion
As mentioned in the introduction some assumptions were made when developing the learning pipelines. The underlying assumption of FILT is that the method used to assign ground truth while common in papers has a flaw baked into it that will lead to mislabeling of data. Correcting the labelling with a two-stage model should increase model accuracy both by improving the accuracy of the labels on samples used to train models, and insuring that the test data has a lower proportion of mislabeled samples. The underlying assumption of the TSC model is that users are not actually reliable or consistent in gesture production, and that by observing which gestures lose accuracy over multiple repetitions in one set, we can find those gestures likely to be mistaken for each other in different sessions. The TSC method identifies classes that are unlikely to be inconsistently produced by a given user over time leading to confusion with other similar classes. While an algorithm may over train on noise within a single session and achieve high accuracy on all classes, these classes are likely to have low accuracy in inter session trials. By identifying and merging those classes we can increase inter session accuracy and retain the more robustly recognized gestures for a given user. We also can control labels assigned to classes and report those labels back to the user resulting in a more understandable and intuitive class labels than would be produced by a purely unsupervised learning method. Finally, we looked at combining both FILT and TSC in a single pipeline to see if using them together would result in better performance than either one alone. For comparison purposes, all resulting accuracies were compared to the baseline performance of RAWSIS.
4.1. Unsupervised Layer
The development of the FILT method was motivated by the observation that even in-session accuracy had an unacceptably low value and the most common point of confusion was with the ‘Null’ class. As discussed in the introduction, the software labelling ground truth used in this and many other studies intrinsically creates mislabeled data [
8,
9,
10,
11,
12,
15]. When subjects are prompted to change gestures, the ground truth label is immediately switched to ‘Null’ and held for some period of time to allow transition (in this study 3 s). A typical class transition follows the following progression, (i) For some period in the beginning of the ‘Null’ class there is a lag while users remain in the previous gesture and parse the next requested gesture, (ii) a quick transition of the subjects’ hand position to the next requested gesture occurs, (iii) there is some period of time during which the user is now in the next requested class but the software has not updated labelling. In both phase i and phase iii the software provided ground truth labels will be erroneous. Examining the ‘Null’ class in
Figure 4b shows that while these phases occur in each gesture transition the duration of these phases is not consistent.
In fact, as can be seen in
Figure 6a, even for in-session tests the majority of inter class confusion involves the ‘Null’ class. It is a trivial (but tedious) matter in the lab to manually relabel data or discard data and commonly used in studies [
8,
9,
10,
11,
12,
15]. Doing so is not realistic for use outside a lab, and nor is using alternate technologies to validate ground truth as many home users will not possess alternate sensors. It makes sense intuitively that samples representing gesture transitions could be seen as outliers given that they represent only a small portion of the samples in ‘Null’. So, Outlier Detection was explored as a method to identify the samples within ‘Null’ representing gesture transitions.
As can be seen in
Figure 5a, once the FILTSIS method is applied the mean accuracy achieved improved significantly over RAWSIS. It can also be seen in
Table 3 that only one subject benefited from Caps feature set when the FILTSIS method was applied. This is likely because the rate of change on sensors should be near zero once a gesture is obtained and held. While filtering improved accuracy for the IS trials, it did not improve them enough that either of the SIS models would represent a useful gesture recognition interface given that the portion of gesture classes retained with accuracies above 80% was very low for both SIS methods.
When the FILTTSC method was used, more classes were preserved than by the FILTSIS method, but a lower mean accuracy was achieved. While this does not appear to be a statistically significant difference, the variance in both accuracy and number of classes preserved is higher for FILT methods in
Figure 3a,b. Therefore, the outliers in the FILT methods have the effect of making the methods look more similar than they are. It is possible that this is a function of the rules used to relabel classes after detecting outliers, and that a more sophisticated ruleset could reduce the observed variability. It is also possible that a study with more subjects would reveal a statistically significant difference in performance between the FILT methods and RAWSIS. The FILTSIS method obtained an average inter-set accuracy of 61.98% while retaining 11 classes, which is somewhat lower than the 82.5% accuracy of Rossi et al. describe using their hybrid SVC-HMM model. However, their study only was performed on in-session data and limited itself to six gestures (including transitions) [
13]. In a future study, it would be interesting to compare the results of the two methods directly on the same dataset.
4.2. Consolidation Algorithm
While the FILTSIS model had good accuracy when validated in-session, as seen in
Figure 5, when inter-session validation was performed against samples from Data2 the models using it failed to generalize and displayed low accuracies. It was observed that for any given user there were sets of gestures commonly mistaken for each other, but that these confused gestures did not form a universal set across all users. It was observed that the same pattern of confusing gestures emerged when in-session tests were validated using inter-repetition train/test splits instead of shuffle splits. This led to the question of whether using multiple in-session, inter-repetition train test splits and averaging the results of the generated CM could uncover gestures that had poor inter-repetition consistency. As well, whether tailoring models to the individual user by merging those gestures most likely to be confused for that user would lead to more robust results for inter-session validation.
The TSC method is of particular interest because it fills a gap in terms of handling unreliable data and provides an ability to learn which gestures are valid for a given user. Current studies typically focus on a fixed number of gestures rather than tailoring the gestures in a model to a user’s capabilities [
3]. Techniques that work with unreliable data typically use deep learning as a tool, but deep learning requires both a significant amount of data to learn from and may require significant computational power [
4,
7]. The field of concept drift can address nonstationary data, but as discussed in the introduction many contemporary studies on detecting gestures after sensor shift require fresh labeled data to learn from [
20,
21]. As discussed in
Section 2.3.5, by leveraging the CM as a tool to consolidate confused gesture classes, the TSC method is able to learn likely areas of confusion without having fresh data injected into the model and can achieve good results using classical machine learning methods.
Typically, CMs are used as a visual tool to analyze where classification models break down, or to share results with human colleagues. By instead using it as an iterative tool to seek ‘clusters’ of errors, and the most common mistaken identities, the algorithm could preserve to some extent the labels initially provided by merging less distinct classes into the ones that were better recognized by the models. While unsupervised clustering can perform a similar function, it is less likely to do so in a way that preserves knowledge of class labels. The preservation of class labels should allow for a more intuitive user interface that is tailored to a specific user’s needs. Our method recognizes that the number and/or composition of classes that a sensor can recognize on a given user cannot be known a priori, nor can one know the threshold accuracy needed to segment out the subset of useful gestures or the best learning pipeline. The advantage of the TSC method is that it does not require this knowledge upfront. Instead, a larger set of gestures is requested from the user, and a starting thresh and gMIN = 5 are used to set a floor. The TSC method then seeks a subset of gestures that can be consistently recognized over several repetitions from the user. In the case of users who were consistent in their motions, these starting conditions have the potential to retain more than 5 gestures, but also ensure that the algorithm would preserve at least five gestures. This floor was set because it would be theoretically sufficient to allow the sensor system to perform basic mouse functions. While for healthy users this is merely a convenient function, for users possessing physical disabilities being able to tailor a gesture set used as input for their physical capabilities offers the potential to vastly improve their user experience.
The parameters of the study are somewhat comparable with Asfour et al.’s 2021 study. Similar to their study only a short break was used between sets. While the Asfour study used 16 gestures vs. the 11 used in this study, they achieved a lower accuracy of 86.4% ± 8.6% in-session and 78.5% ± 11.00% inter-session [
11]. While they retained all gestures, they explicitly state that they only used static data from gesture classes and did not include a ‘Null’ or transitional class in their modeling. It seems likely doing so would result in losses of accuracy, but it would be of interest in the next stage of research to directly compare the results of applying their method to data collected on our subjects to the results obtained from the FILT and TSC methods. For all users, extrapolating the gestures likely to be performed consistently over time from a small dataset taken at the start of a user session rather than requiring multiple batches of fresh labeled data is a has the potential to improve the general user experience of wearable gesture recognition devices.
4.3. Combining TSC with Unsupervised Learning
All models had fairly good accuracy when validated using IN tests. However as seen in
Figure 5, when exposed to new samples from Data2 the models using the RAWSIS and FILTSIS failed to generalize and displayed low accuracies. This could largely be attributed to the bias versus variance tradeoff. In other words, models learned the noise in the dataset rather than generalizable patterns.
The highest mean accuracy and lowest variability (
Figure 5) were achieved using the RAWTSC method, which also had the least variability in the useable number of classes. While the RAWTSC method retained fewer classes after merging, the classes it retained had far higher IS accuracy. This is attributed to the fact that most subjects did not consistently perform all 11 of the gestures used in the study. In some cases, this can be attributed to the physical capabilities of the individual subject. For example, one of the subjects indicated that they suffered from arthritis, and another indicated that they had mild carpal tunnel. Other users indicated that they had difficulty making specific single-finger gestures, or that they experienced mild fatigue over the course of the experiment which could also lead to inconsistent production of gestures.
It was surprising that FILTTSC app”ared’to lead to lower mean accuracy and higher variance, but did not reach the criterion for statistical significance. It is possible that the variance itself was the reason for the lack of statistical significance as it can be seen that while six of the users had lower accuracies than the average values seen for RAWTSC, two of the users were high accuracy outliers. While the difference in accuracy does not meet the criterion for statistical significance, the variance may point to further work being needed in the rules-based algorithm that accompanies the FILT method. While the single unsupervised model could detect outliers of transitional gestures, it could not meaningfully detect incorrect gestures.
It was noted that for some subjects the accuracy for IN validation was at 100% for all gesture classes, but declined significantly for one or two gestures in IS validation. For almost all the subjects observed, there were instances in which the gesture they made was not the one prompted by the application. In order to test the robustness of the detection and consolidation algorithm in a more realistic use case, these mislabeled gestures were not removed. In cases where these mislabeled gestures were present in Data1 they would represent merging mislabeled data into a class; if present in Data2 they could account for some of the observed error. Training an unsupervised learner to recognize outliers in each gesture class could potentially do a better job of filtering out user error prior to relabeling and classification.
4.4. Scaling and Models
A glance at
Table 3 shows that the majority of the time the best results were achieved using unscaled data. However, in the case of RAWTSC; scaling via Standardization had a slight edge on performance compared to other scaling methods. This is assumed to be due to two factors: (1) in many cases subjects made extra and unrequested motions (including adjusting their facemask in one notable instance) which could result in measured values with no relationship to the expected range of values during gestures, (2) the assumption that data is stationary is violated due to variations in position and how the subjects made gestures between datasets. If only small differences existed between classes, the adjustment of feature values that occurred when standardization and normalization were applied could be enough to blur the boundaries between classes. The only case in which better performance is achieved scaling via Standardization is in the RAWTSC method where Standardization is present twice as often as Normalization and only two subjects achieve their best performance with unscaled data. Because the TSC method acts to merge the classes most likely to be confused with each other, it is not surprising that scaling is able to improve results. It is however unexpected that TSC performs better on RAW data than the FILT data, and it is possible that using a more sophisticated ruleset is needed for the transitional data. The current ruleset simply labels all samples between the previous class and the outlier as being in the previous class, and all samples between and outlier and the next class as being in the next class. This could inadvertently capture samples that do not belong in any class.
Because the goal was to tailor the model to the user, no assumptions were made regarding which models would best fit a given user; instead, the f1 metric discussed earlier was generated on the results of IN validation and used to select the model that presumably would have the best IS performance. Across all the methods explored, the best performance comes from KNN (though the number of neighbors needs to be tailored to a given subject by exploring the f1 scores on Data1). The performance of BRF improves slightly when using the FILT and TSC methods. This is likely because filtering and consolidating data leads to imbalances in class composition and the imbalanced learning model can handle better than standard KNN. In all instances, SVC performs poorly, and never achieves the best accuracy for any user or method. This may also be tied to the underlying lack of stationary data.
Automatic feature reduction did not significantly improve results. In the RAW data, using the Caps feature set was often needed to achieve good results. This is possibly because it is easier to detect Null motions with the rate of change features. In filtered data sets the extra data encoded in the Caps feature set tends to not make models more effective as the Null motions have already been stripped by the unsupervised learning layer.
4.5. Gestures Retained
There was no universal set of gestures retained which points to the retained gestures being unique to a given subject rather than a set that could be created a priori and applied to all users. This is likely due to two factors. Firstly, different users likely had different physical capabilities in terms of which gestures they could continue to make distinctly over time. Secondly, the capacitive sensors measure the deformation of the forearm muscles as different gestures are made. While the volume of muscles remains constant, the actual flexed and relaxed portions of muscle will change the distance of the skin from the strap sensors at different cross sections along the arm. A subject’s forearm muscular development and specifically, the muscles that a subject engages for a particular gesture, are likely a function of the subject’s daily activities and hobbies. Finally, while there were variations in arm length, subcutaneous fat, and musculature of the users, the sensor sleeve itself was a ‘one size fits all’ configuration.
RAWSIS and FILTSIS preserved all gestures, but not at a high enough accuracy to be particularly useful. While the two TSC methods preserved five to seven classes per subject; there was significant variation in the classes preserved both by subject and by the method. For all subjects, rules settings ensured that ‘Null, ‘Neutral’, and ‘Fist’ would be preserved. The rejection of ‘Null’ by the unsupervised learner in FILTTSC ensured that it would be recognized as an outlier, and in the RAWTSC method, ‘Null was distinct enough for other gestures that the system preserved it. This is potentially useful in rejecting non-gesture motions. In general, when more fingers were involved in making a gesture, it was more likely to be retained as a unique gesture.
As seen in
Table 4, ‘Pinch’ and ‘Point Two’ were identifiable in half the subjects and otherwise, there was significant variability in which gestures could be recognized for the RAWTSC method. A common trend in RAWTSC was that ‘Spread’ and ‘Thumb Adduct’ were not recognizable for any subject, and that ‘Thumb Up’ was only recognizable on one of the subjects. It is tempting to attribute this to the fact that the majority of the muscles used to spread the fingers and move the thumb are intrinsic to the hand. However, as can be seen in
Table 5 a significant number of classes are recovered when the ‘Null’ class is eliminated by the unsupervised learner in the FILTTSC method.
The gesture ‘Point Index’ is retained for six subjects with reasonable accuracy. ‘Point Middle’ is retained for five subjects, albeit for two of the subjects it proves to be a very low accuracy class. The ‘Thumb Up’ and ‘Spread’ gestures can be recognized in three of the subjects and while ‘Thumb Up’ has good accuracy, ‘Spread’ never achieves an accuracy of > 60% for any of the subjects retaining it as a gesture. One of the users also retains the ‘Thumb Adduct’ gesture with high accuracy.
A possible reason for the challenge of recognizing the ‘Spread’ gesture and the various gestures involving the thumb is the location of the muscles used in these gestures. Muscles involved in spreading the fingers are located within the hand instead of the forearm and many of the muscles responsible for thumb use are also located in the hand. The muscles extrinsic to the hand which controls the thumb are relatively small muscles and are located beneath other larger muscles. It is likely that the difference between the gestures involving the thumb and other gesture classes is not large enough to be distinguished in the SIS methods and they only become clear after some classes are consolidated. There does not seem to be a distinct pattern for weight, arm length, or gender in regards to which subjects retained these gesture classes. It should be noted however that ‘Spread’ and ‘Point Middle’ are often the lowest accuracy classes and adding a rule to prevent them from being retained by the TSC method may improve the overall performance and usability of the system as a human-machine interface.
4.6. Limitations
This pilot study represents a first step towards validating both the FILT method of handling label inaccuracy within the ‘Null’ class and the TSC method of tailoring gestures to user capabilities. That said it is a small pilot study with only eight users. We used a small number of subjects and in order to ensure that we would not lose subjects to attrition, we collected both data sets in a single day. While several users self-reported carpal tunnel or arthritis as conditions that affected their hand mobility and comfort in making certain gestures, this is anecdotal rather than statistical information. A much larger number of subjects will be needed in future studies to determine if patterns emerge related to physiological and anatomical variations. To determine whether tailoring gestures to users represents a significant improvement in gesture recognition interfaces for disabled users, the next study should intentionally recruit from disabled populations as well.
In this study, an armrest was used to assist subjects in maintaining a consistent position of their wrist and arm throughout the data recording sessions. Many studies point to challenges in gesture recognition when multiple arm positions are added to the training and validation sets [
10,
15,
35]. Without including both static and active changes in wrist and elbow position the robustness of models created in this work to changes in user position is unknown. Especially in the case of FILT methods, valid gestures may not be recognized or may even be rejected by the first-stage model. Several studies also indicate that challenges in inter-day tests emerge that due to changes in placement and positioning of sensors that may not be observed in single-day tests [
10,
15]. Further studies in which the subjects are instructed to maintain or make hand gesture poses while changing the wrist and elbow position will help determine how sensitive models are to changes in the subject position. Ideally, these data sessions should be recorded on multiple days to determine if the sensitivity to placement and body conditions between days is a confounding factor.
A metal hook sewn to an elastic strap was used in conjunction with an evenly spaced set of eyes to ensure that the straps would start with a consistent pressure on the subjects’ arm in the rest in a ‘Neutral’ starting position. This introduces the possibility that as subjects move away from the position the model was trained in, variations in pressure or position of the straps will be introduced and lead to model errors. One possible way to validate this would be to use pressure sensors. Pressures sensors have seen use in gesture recognition, but could also provide useful ground truth on changes in limb pressure against the capacitive straps used in this study [
3,
36]
The capacitive strap sensors used in this study only measure a change of capacitance caused by cross-sectional area change of muscles on the forearm as they contract or relax. This limitation likely prevents gestures dependent on the intrinsic muscles of the hand such as ‘Spread’ and ‘Thumb Adduct’ from being recognized effectively by this type of sensing system. While noise within a dataset may be falsely recognized as an indicator for the ‘Spread’ gesture, it does not generalize well outside of a dataset. The next set of experiments should either use a hybrid sensor approach to track gestures such as thumb motions and spreading of fingers which are controlled by muscles intrinsic to the hand.
5. Summary
In this study, we investigated methods to mitigate two confounding factors in gesture recognition (i) inaccurate ground truth for of samples representing transitions between gestures caused by software-based labeling, (ii) inconsistent production of gestures by users. We used capacitive strap gesture recognition sensors, and collected two data sets (Data1, Data2) containing five repetitions of ten gestures and a ‘Null’ category for gesture transitions from eight subjects. Data1 was used for training models and Data2 for validation in order to provide more realistic inter-session assessments of performance.
Two methods were developed to resolve these issues. FILT used an unsupervised learner to find outliers in the ‘Null’ class under the assumption these outliers represented gesture transitions. Samples flagged as outliers were dropped and labels of the remaining samples in the ‘Null’ class were reassigned to preceding or the following class as appropriate. The relabeled samples were then used to train and validate models trained using a conventional shuffle split. TSC created a series of inter-repetition models and averaged their confusion matrices to find the gesture classes unlikely to be consistently produced by a given user. The labels for these classes were then merged and a conventional learning model was trained. In both cases, models were trained on Data1 and validated on Data2.
Four pipelines were tested consisting of a mixture of these methods. RAWSIS used unmodified samples, and RAWFILT applied only the FILT method. RAWTSC applied only the TSC method, and FILTTSC applied the FILT method and then the TSC method. RAWSIS had the lowest average performance. It achieved an average accuracy of 91.16% ± 6.70% for in-session validation on Data1, and 61.98% ± 9.17% for intersession validation. The best performance was achieved by RAWTSC with an average intersession accuracy of 93.03% ± 4.96%.