# Real-Time Arm Gesture Recognition Using 3D Skeleton Joint Data

## Abstract

## 1. Introduction

## 2. Related Work

## 3. Arm Gesture Recognition

#### 3.1. The Microsoft Kinect SDK

#### 3.2. Gesture Recognition

## 4. Experimental Results

#### 4.1. Dataset

#### 4.2. Experiments

#### 4.3. Comparisons to the State-of-the-Art

## 5. Conclusions and Discussion

## Author Contributions

## Funding

## Conflicts of Interest

## References

**Figure 1.**(

**a**) Extracted human skeleton 3D joints using the Kinect software development kit (SDK). (

**b**) Visual representation of a given node J, its parent ${J}_{p}$ and its child ${J}_{c}$; $a,b,c,d$, used in Equations (2)–(6) and the reference point $({v}_{x,i}^{{J}_{p}},0,0)$.

**Figure 2.**Aligned RGB and skeletal images of (

**a**) swipe-up and (

**b**) swipe-in gestures, performed by the same user.

**Figure 3.**K-fold cross validation results for all machine learning approach and for several values of K.

**Figure 6.**Mean accuracy per gesture vs. number of users in training set, without the mischievous user.

**Figure 7.**Comparison of average time required for the classification of a sample in several architectures.

**Table 1.**Proposed features, extracted from the skeletal joints (features marked with ∗, are calculated using only HandLeft and/or HandRight).

Feature Name | Frames Involved | Equation |
---|---|---|

Spatial angle | ${F}_{2},{F}_{1}$ | $arccos\frac{{\mathbf{v}}_{2}^{\left(J\right)}\xb7{\mathbf{v}}_{1}^{\left(J\right)}}{\u2225{\mathbf{v}}_{2}^{\left(J\right)}\u2225\xb7\u2225{\mathbf{v}}_{1}^{\left(J\right)}\u2225}$ |

Spatial angle | ${F}_{N},{F}_{N-1}$ | $arccos\frac{{\mathbf{v}}_{N}^{\left(J\right)}\xb7{\mathbf{v}}_{N-1}^{\left(J\right)}}{\u2225{\mathbf{v}}_{N}^{\left(J\right)}\u2225\xb7\u2225{\mathbf{v}}_{N-1}^{\left(J\right)}\u2225}$ |

Spatial angle | ${F}_{N},{F}_{1}$ | $arccos\frac{{\mathbf{v}}_{N}^{\left(J\right)}\xb7{\mathbf{v}}_{1}^{\left(J\right)}}{\u2225{\mathbf{v}}_{N}^{\left(J\right)}\u2225\xb7\u2225{\mathbf{v}}_{1}^{\left(J\right)}\u2225}$ |

Total vector angle | ${F}_{1},\dots ,{F}_{N}$ | $\sum _{i=1}^{N}arccos\left(\frac{{\mathbf{v}}_{i}^{\left(J\right)}\xb7{\mathbf{v}}_{i-1}^{\left(J\right)}}{\u2225{\mathbf{v}}_{i}^{\left(J\right)}\u2225\u2225{\mathbf{v}}_{i-1}^{\left(J\right)}\u2225}\right)$ |

Squared total vector angle | ${F}_{1},\dots ,{F}_{N}$ | $\sum _{i=1}^{n}arccos{\left(\frac{{\mathbf{v}}_{i}^{\left(J\right)}\xb7{\mathbf{v}}_{i-1}^{\left(J\right)}}{\u2225{\mathbf{v}}_{i}^{\left(J\right)}\u2225\u2225{\mathbf{v}}_{i-1}^{\left(J\right)}\u2225}\right)}^{2}$ |

Total vector displacement | ${F}_{N},{F}_{1}$ | $\u2225{\mathbf{v}}_{N}^{\left(J\right)}-{\mathbf{v}}_{1}^{\left(J\right)}\u2225$ |

Total displacement | ${F}_{1},\dots ,{F}_{N}$ | $\sum _{i=1}^{n}\u2225{\mathbf{v}}_{i}^{\left(J\right)}-{\mathbf{v}}_{i-1}^{\left(J\right)}\u2225$ |

Maximum displacement | ${F}_{1},\dots ,{F}_{N}$ | $\underset{i=2,\dots ,N}{max}\left(\u2225{\mathbf{v}}_{i}^{\left(J\right)}-{\mathbf{v}}_{i-1}^{\left(J\right)}\u2225\right)$ |

Bounding box diagonal length ${}^{\ast}$ | ${F}_{1},\dots ,{F}_{N}$ | $\sqrt{{a}_{B\left({\mathcal{V}}^{\left(\mathcal{J}\right)}\right)}^{2}+{b}_{B\left({\mathcal{V}}^{\left(\mathcal{J}\right)}\right)}^{2}}$ |

Bounding box angle ${}^{\ast}$ | ${F}_{1},\dots ,{F}_{N}$ | $arctan\frac{{b}_{B\left({\mathcal{V}}^{\left(\mathcal{J}\right)}\right)}}{{a}_{B\left({\mathcal{V}}^{\left(\mathcal{J}\right)}\right)}}$ |

Symbol | Definition |
---|---|

J | a given joint |

${J}_{c},{J}_{p}$ | child/parent joint of J, respectively |

${F}_{i}$ | a given video frame, $i=1,\dots ,N$ |

${\mathbf{v}}_{i}^{J}$ | vector of 3D coordinates of J at ${F}_{i}$ |

${v}_{x,i}^{\left(J\right)},{v}_{y,i}^{\left(J\right)},{v}_{z,i}^{\left(J\right)}$ | the 3D coordinates of ${\mathbf{v}}_{i}^{J}$ |

$\mathcal{J}$ | the set if all joints |

${\mathcal{V}}^{\mathcal{J}}$ | the set of all vectors ${\mathbf{v}}_{i}^{J},J\in \mathcal{J},i=1,2,\dots ,N$ |

$B(\u2022)$ | a 3D bounding box of a set of vectors |

${a}_{B(\u2022)},{b}_{B(\u2022)}$ | the lengths of the sides of $B(\u2022)$ |

**Table 3.**Optimal classifier parameters ($\alpha $: learning rate; n: number of neighbors; e: number of estimators; s: search algorithm; d: max depth; m: metric between point p and q; f: max number of features; r: regularization parameter).

Classifier | Parameters |
---|---|

ABDT | $e=103$, $\alpha =621.6$ |

ABET | $e=82$, $\alpha =241.6$ |

DT | $d=48$, $f=49$ |

ET | $d=17$, $f=70$, $e=70$ |

KNN | $n=22$, $s=kd\_tree$, $m={\sum}_{i=1}^{n}\left(\left|{p}_{i}-{q}_{i}\right|\right)$ |

LSVM | $C=0.0091$ |

QDA | $r=0.88889$ |

RBFSVM | $C=44.445$, $\gamma =0.0001$ |

RF | $d=27$, $f=20$, $e=75$ |

**Table 4.**${F}_{1}$ score for each gesture separately and mean ${F}_{1}$ score for all gestures for leave one (user) out experiment.

User 1 | User 2 | User 3 | User 4 | User 5 | User 6 | User 7 | User 8 | User 9 | User 10 | |
---|---|---|---|---|---|---|---|---|---|---|

LH-SwipeDown | 0.76 | 0.83 | 1.00 | 0.82 | 1.00 | 0.80 | 1.00 | 1.00 | 1.00 | 0.96 |

LH-SwipeIn | 0.38 | 0.92 | 0.84 | 1.00 | 1.00 | 0.92 | 1.00 | 1.00 | 1.00 | 1.00 |

LH-SwipeOut | 0.61 | 0.93 | 0.86 | 1.00 | 1.00 | 0.89 | 1.00 | 1.00 | 0.97 | 1.00 |

LH-SwipeUp | 0.69 | 0.90 | 1.00 | 0.84 | 1.00 | 0.83 | 1.00 | 1.00 | 0.97 | 0.96 |

RH-SwipeDown | 0.78 | 1.00 | 0.95 | - | 1.00 | 1.00 | 0.92 | 1.00 | 0.87 | 1.00 |

RH-SwipeIn | 0.64 | 1.00 | 0.67 | - | 1.00 | 1.00 | 1.00 | 1.00 | 0.89 | 0.96 |

RH-SwipeOut | 0.61 | 1.00 | 0.80 | - | 1.00 | 1.00 | 0.95 | 1.00 | 1.00 | 0.95 |

RH-SwipeUp | 0.40 | 1.00 | 0.95 | - | 1.00 | 1.00 | 1.00 | 0.96 | 1.00 | |

Average | 0.62 | 0.94 | 0.88 | 0.92 | 1.00 | 0.92 | 0.99 | 1.00 | 0.96 | 0.97 |

**Table 5.**Comparisons to state-of-the-art research works using the MSR action dataset of [35]. Results denote accuracy (%).

Test I | Test II | Test III | Avg. | ||||||
---|---|---|---|---|---|---|---|---|---|

[35] | Our | [35] | Our | [35] | Our | [4] | [35] | Our | |

AS1 | 89.50 | 85.36 | 93.30 | 91.39 | 72.90 | 89.28 | 93.50 | 85.23 | 88.68 |

AS2 | 89.00 | 72.90 | 92.90 | 84.40 | 71.90 | 73.20 | 52.00 | 84.60 | 76.84 |

AS3 | 96.30 | 93.69 | 96.30 | 98.81 | 79.20 | 97.47 | 95.40 | 90.60 | 96.66 |

Avg. | 91.60 | 83.98 | 94.17 | 91.53 | 74.67 | 86.65 | 80.30 | 84.24 | 87.39 |

**Table 6.**Comparison to state-of-the-art research works using the dataset of [7].

**Table 7.**Comparison to the state-of-the-art research work of [36].

**Table 8.**Comparison to state-of-the-art research work of [18], using our own dataset.

[18] | Our | |
---|---|---|

Acc. (%) | 91.0 | 96.0 |

**Table 9.**Reported results of the state-of-the-art (alphabetically). The 2nd column (Approach) summarizes features and learning approach used. In the 3rd column (Gestures), relevant gestures to those of our work are in bold. The 4th column (Acc.(s.)) presents the best accuracy achieved in the used dataset and the number of subjects in parentheses.

Ref. | Approach | Gestures | Acc.(s.) | Comments/Drawbacks |
---|---|---|---|---|

[11] | 2D projected joint trajectories, rules and HMMs | Swipe(L,R), Circle, Hand raise, Push | 95.4 (5) | heuristic, not scalable rules, different features for different kinds of moves |

[1] | 3D joints, rules, SVM/DT | Neutral, T-shape, T-shape tilt/pointing(L,R) | 95.0 (3) | uses an exemplar gesture to avoid segmentation |

[7] | Norm. 3D joints, Weig. DTW | Push Up, Pull Down, Swipe | 96.7 (n/a) | very limited evaluation |

[12] | Head/hands detection, GHMM | Up/Down/Left/Stretch(L, R, B), Fold(B) | 98.0 (n/a) | relies on head/hands detection |

[13] | clustered joints, HMM | Come, Go, Sit, Rise, Wave(L) | 85.0 (2) | very limited evaluation, fails at higher speeds |

[10] | HMM, DTW | Circle, Elongation, Punch, Swim, Swipe(L, R), Smash | 96.0 (4) | limited evaluation |

[2] | Differences to reference joint, KNN | Swipe(L, R, B), Push(L, R, B), Clapping in/out | 97.2 (20) | sensitive to temporal misalignments |

[3] | 3D joints, velocities, ANN | Swipe(L,R), Push Up(L,R), Pull Down(L,R), Wave(L,R) | 95.6 (n/a) | not scalable for gestures that use both hands |

[4] | Pose sequences, Decision Forests | Open Arms, Turn Next/Prev. Page, Raise/Lower Right Arm, Good Bye, Jap. Greeting, Put Hands Up Front/Laterally | 91.5 (10) | pose modeling requires extra effort, limited to gestures composed of distinctive key poses |

[8] | 3D joints, feature weighted DTW | Jumping, Bending, Clapping, Greeting, Noting | 68.0 (10) | detected begin/end of gestures |

[9] | 3D joints, DTW and KNN | Swipe(L, R), Push Up(L, R), Pull Down(L, R), Wave(L, R) | 89.4 (n/a) | relies on heuristically determined parameters |

[5] | 4D quaternions, SVM | Swipe(L, R), Clap, Waving, Draw circle/tick | 98.9 (5) | limited evaluation |

[14] | Head/hands detection, kinematic constraints, GMM | Punch(L, R), Clap, Wave(L, R), Dumbell Curls(L, R) | 93.1 (5) | relies on head/hands detection |

[15] | Motion and HOG features of hands, hierarchical HMMs | Swipe(L, R), Circle, Wave, Point, Palm Forward, Grab | 66.0 (10) | below average performance on continuous gestures |

our | novel set of features, ET | Swipe Up/Down/In/Out(L,R) | 95.0 (10) | scalable, does not use heuristics |

