Human Action Recognition Using Bone Pair Descriptor and Distance Descriptor

The paper presents a method for the recognition of human actions based on skeletal data. A novel Bone Pair Descriptor is proposed, which encodes the angular relations between pairs of bones. Its features are combined with Distance Descriptor, previously used for hand posture recognition, which describes relationships between distances of skeletal joints. Five different time series classification methods are tested. The selection of features, input joints, and bones is performed. The experiments are conducted using person-independent validation tests and a challenging, publicly available dataset of human actions. The proposed method is compared with other approaches found in the literature achieving relatively good results.


Introduction
Automatic human action recognition is an important research topic in machine vision and learning. It allows us to understand the intentions of a human, which can be a useful information in video surveillance, detection of aggressive behavior, human-computer, and human-robot interaction. The development of low-cost devices such as the Microsoft Kinect sensor caused increased interest in recognition methods based on depth data, such as point clouds, depth maps, and skeletons [1][2][3][4][5][6]. The skeletal data consists of 3D coordinates of characteristic points of human body. The main advantage of this type of data, compared to color/gray images, point clouds, and depth maps, is its size. A typical human action recorded as a sequence of 50 skeletons has the size of 4 KB, whereas a similar image or depth map sequence has the size up to several dozen MB. It makes a huge difference in the case of large datasets used for classifier training. Although the significant progress has been made in human action recognition, the existing algorithms are still far from being perfect, especially if a person performing actions is not present in the training dataset.
In this paper, a problem of human action recognition based on skeletal data is tackled. Our approach is based on Distance Descriptor and novel Bone Pair Descriptor which is a modification of a method previously developed for static hand posture recognition. The experimental tests are performed using five classifiers and different configurations of features. The main contributions of this paper are as follows.
1. The development of the Bone Pair Descriptor. 2. The application of the Distance Descriptor to the human action recognition problem. 3. The original experiments with the selection of joints, bones, and features.
The paper is organized as follows. The related works are discussed in Section 2. The proposed descriptors used for human action recognition are presented in Section 3. The used dataset, classifiers, hardware, and performed experiments are characterized in Section 4. Section 5 contains the conclusions and the plans for the future work related to this subject.

Related Work
In recent years, several reviews on vision-based human action recognition have been published. Their authors attempt to classify different techniques.
Most often, the methods are divided into solutions based on handcrafted features or deep learning approaches. Features are calculated from color images, depth maps, skeletons, or by combining many modalities.
Many algorithms use 3D data most often in the form of depth maps. They facilitate the segmentation of the human silhouette in the case of a complex and heterogeneous background. In some approaches, spatial data is projected onto three orthogonal planes, corresponding to the front, side, and top views [7][8][9][10][11][12]. Depth motion maps (DMM), in various variants, are used in [8,10,11]. There are also solutions in which descriptors are built based on normals to the surface spanned on a three-dimensional point cloud [13,14]. In turn, the works in [15,16] describe algorithms based on the detection of points or regions of interest in a depth image.
Solutions using skeletal data can be classified into trajectory and pose-based [17]. The first group includes works, in which multivariate time series, obtained from space-time joints trajectories, are recognized [18,19]. In the second group, features describing the relationship between the skeleton elements, that determine a specific pose, are employed. The works in [1,2,20,21] use joints locations, angles between them, or more complex relationships linking body parts.
The solutions proposed in [22][23][24] combine the skeletal data with local features extracted from depth images in the neighborhood of the projected joints. In [25], skeletal data were used together with histogram of oriented gradients (HOG) descriptors calculated for regions of interest defined in RGB and depth images.
Many methods exist for coding space-time data into two-dimensional images, which are then processed by convolutional networks. Several versions of chronological spatial-temporal images are proposed. In these solutions, the columns represent the coded spatial configuration of the nodes [27][28][29] or the derived features [5,[30][31][32], and the rows correspond to the time domain. Other coding methods involve DMM calculated for three orthogonal planes [7] or motion history images derived from RGB data [33] or skeletal data [4,34].
In [35][36][37][38], RNN and long short-term memory network (LSTM) in various variants are used. The extension of the CNN network in the time domain is proposed in [39,40]. The paper [41] presents a hierarchical approach consisting of the decomposition of complex actions on simple ones. Hybrid solutions, with different variants of the CNN and LSTM networks connected sequentially, are used in [42][43][44][45][46]. In [47], the long-term recurrent convolutional network (LRCN) with jointly trained convolutional (spatial) and recursive (temporal) parts, is used.
Depending on the complexity, human activities can be classified into gestures, actions, interactions, and group collaborations. Therefore, in [48], methods were divided into single-layered and hierarchical. The first ones, designed to recognize simpler actions, use image sequences directly. In hierarchical approaches, complex activities are identified as combinations of simpler ones called subevents.
The problem is open and challenging. We propose the pose-based approach to human action recognition using only skeletal data and descriptors combining angular and positional relations of joints and bones corresponding to them.

Proposed Method
Our approach to classification of human actions is based on two descriptors: Distance Descriptor and Bone Pair Descriptor, which are characterized in the following subsections.

Distance Descriptor
The Distance Descriptor (DD) was first introduced in [49] for recognition of static hand postures. It encodes relationships between distances of skeletal joints from each other. Calculation of DD requires only 3D coordinates of joints. It does not use vectors or any other information. The descriptor can be computed for N joints by the following algorithm.

For each joint
(a) Calculate distances (Euclidean or city block) between the other joints P j , j = i. (b) Sort joints P j by the calculated distances from the closest to the farthest. (c) Assign consecutive integers a ij to the sorted joints P j , starting from 1. Step 3 is performed not only to reduce the number of features. After this step, for each joint P i , the algorithm determines not only which of the remaining joints P j are its closest neighbors but also which of the joints P j consider P i as their closest neighbor. Finally, in order to normalize feature values to the interval [0-1], each of them is divided by 2(N − 1).
The calculation of DD for the whole skeleton is time-consuming and not very effective in terms of classification accuracy. Therefore, an input set of joints should be selected from the whole skeleton.

Bone Pair Descriptor
The Bone Pair Descriptor (BPD) encodes the angular relations between particular pairs of bones. It is based on the Point Pair Descriptor (PPD) first introduced in [50] for recognition of static hand postures. PPD uses 3D joint coordinates, vectors pointed by fingers, and surface normals. BPD is a modification of PPD that uses bones as vectors allowing to describe human actions based only on skeletal data without surface normals. BPD can be calculated as follows. Let P c be the skeleton central joint, b c central vector associated with the joint P c , P i the i-th non-central joint, and b i the vector associated with that joint (Figure 1). Vectors b c and b i coincide with a bone or a part of spine.
where the vectors u, v i , and w i define the Darboux frame [52]: with · denoting the scalar product and × denoting the vector product. Let N be the number od non-central joints. The Bone Pair Descriptor consists of 3N features calculated for each non-central joint using the Formulas (1)- (3): Finally, the features are normalized to the interval [0-1]. For this purpose, each feature is divided by its maximum possible value: π for features α, φ and 2π for feature Θ.
BPD requires the selection of central joint P c , non-central joints P i and joints determining vectors (bones) b c and b i from the whole skeleton.
The Matlab scripts for Distance Descriptor and Bone Pair Descriptor can be downloaded from our website [53].

Dataset, Classifiers and Hardware
The experiments were performed using UTD-MHAD dataset [3] recorded by a Microsoft Kinect sensor. It contains 27 actions performed by 8 subjects (4 women and 4 men). Each subjects repeated each action 3 or 4 times, which is 861 sequences in total. The average action length is 68 frames. UTD-MHAD is a challenging dataset because of the large number of action classes and similarities between some of them.
The actions of UTD-MHAD dataset are listed below. We used only skeletal data for action recognition. The whole skeleton of UTD-MHAD dataset consists of 20 joints as presented in Figure 2 (left image). We selected the following subset of joints as an input for Distance Descriptor. We selected those subsets of joints experimentally. We also tested different configurations; however, the presented subsets yielded the best results in terms of recognition rate and computation time.
For the evaluation of our method we used the protocol suggested by the authors of UTD-MHAD dataset [3]. Subjects 1, 3, 5, and 7 were used as a training set and subjects 2, 4, 6, and 8 were used as a testing set.
In our experiments we used five different classifiers: (1) dynamic time warping with Euclidean distance (DTW-euc), (2) dynamic time warping with city block distance (DTW-cb), (3) fully convolutional network (FCN), (4) bidirectional long short-term memory network (BiLSTM), and (5) LogDet divergence-based metric learning with triplet constraints (LDMLT). DTW-euc and DTW-cb are the classic methods for time series classification using dynamic programming to calculate distance of two nonlinearly aligned sequences. BiLSTM [55] and FCN [56] are the deep learning methods that are well known and have been successfully used, e.g., in speech recognition [57] and networking [58]. LDMLT is a relatively new method based on Mahalanobis distance learned using the so-called triplets constraints [59]. The output of DTW-euc, DTW-cb, and LDMLT is not a class label but a distance between two given sequences (each testing sequence have to be compared to each training sequence). Therefore, there is a need to apply K-nearest neighbors classifier searching for a class represented by the majority of K nearest neighbors. In our experiments we set K to 1.
The experiments were performed using Matlab R2019a software on a PC with Intel Core i7-4710HQ, 2.5 GHz CPU, and 16 GB RAM.

Experimental Results
We started our experiments with the comparison of the classifiers. In Table 1, we present the configuration of parameters and recognition rates for each classifier. The parameters were chosen experimentally, i.e., by changing their values and observing whether the recognition rate improved. The LDMLT classifier yielded the best result with a large advantage over the other methods. Therefore, we used this classifier in the latter experiments.
We evaluated our method using feature vectors consisting of single DD, single BPD, and the concatenation of DD and BPD features. The results for various neighbors number K are shown in Table 2. The highest recognition rate of single DD is 86.3% while single BPD achieved 87.2%. Combination of DD and BPD features led to more than 5 percentage points improvement yielding 92.6% accuracy for K = 2. This result confirms that the positional information of DD and the angular information of BPD complement each other improving overall recognition rate.
As the Bone Pair Descriptor consists of three different features (calculated for each non-central bone and the central bone), we tried removing one or two of them from the feature vectors. Θ turned out to be the least effective and its removal does not affect the highest recognition rate, which is still 92.6% for K = 2. Using two BPD features instead of three results in less time-consuming feature extraction, learning, and classification, and therefore DD + BPD (α, φ) can be considered the best configuration. We also tried calculating DD with city block distance instead of Euclidean, however the results were very similar (slightly lower). To analyze which actions are most often misclassified we calculated the confusion matrix for the best configuration: DD + BPD (α, φ). The matrix is presented in Figure 3. As one can see, many actions were recognized perfectly. There is only one action, "right hand knock on door", with recognition rate below 50%. It was confused with "right hand wave" and "right arm throw" four times and with "right hand catch an object" twice. Another common misclassification is confusing "jogging in place" with "walking in place". Moreover, "right hand draw circle (counterclockwise)" was confused with "right arm swipe to the left" and "draw triangle" four times. All of these mistakes occur for the actions that, in the case of some subjects, are difficult to differentiate even for a human. The actions that are visually much different from each other were recognized with almost 100% accuracy.
In Table 3, we present the comparison of our approach with other existing methods which use only skeletal data from UTD-MHAD dataset. The proposed algorithm outperforms five of the listed methods almost equaling with the best found method, Bayesian Hierarchical Dynamic Model (HDM) [6]. Table 3. Comparison of the proposed method with other existing methods using UTD-MHAD dataset.

Method Recognition Rate [%]
Label Consistent K-SVD [2,4] 76.2 Covariance Joint Descriptors [1,4] 85.6 Optical Spectra-based CNN [34] 87 Joint Trajectory Maps [4] 87.9 Joint Distance Maps [5] 88.1 Our method (DD + BPD) 92.6 Bayesian HDM [6] 92.8 The average time of extracting features from an action (using the best configuration: DD + BPD (α, φ)) is~600 ms and the average classification time using LDMLT is~300 ms. Therefore, the average recognition time of a single action is below 1 s. The total training time is about 550 s. Most of the works presenting methods that use only skeletal data (see Table 3) do not report computational times of their algorithms. The only exception is Hou et al. [34], whose method Optical Spectra-based CNN achieved very short recognition time of about 40 ms on a PC with Intel Core i7-4790, 4 GHz CPU and NVIDIA TITAN X GPU (some parts of code were running on GPU). However, the total training time of this method is about 880 s, which is relatively long. It is also worth noting that Optical Spectra-based CNN achieved recognition rate of 5.6 percentage points less than our method despite being much faster in terms of recognition time.

Conclusions
In this paper, we proposed recognition method for human actions using skeletal data. Two descriptors, originally intended for hand posture recognition, were successfully adapted for classification of the body skeleton sequences. One of these descriptors, BPD, was modified to replace surface normals and vectors pointed by fingers with vectors representing bones. Configurations of joints and bones, used as an input data for descriptors, were selected. The experimental tests were performed using five classifiers and configurations of features. Our method achieved high recognition rate compared to other existing methods, which confirms its usefulness.
The proposed method does not require a specific lighting conditions, background, or any special outfit, e.g., inertial sensors and gloves. Moreover, it is fast enough to run in real-time. However, the implementation can be optimized in terms of recognition time, which can be a subject for a future work. The future work may also include combining skeletal data with depth maps/point clouds and color/gray images to develop more effective feature vectors. Such a combination could improve distinguishing between actions like "knock on door", "hand wave", and "arm throw" as features based on point clouds and color images, unlike skeletal descriptors, are able to recognize hand shapes. Another future study topic may be expansion of training datasets by generating additional action sequences based on the existing data.
Author Contributions: Conceptualization and methodology, D.W. and T.K.; software, D.W. and T.K.; validation, experiments design, and discussion of the results, D.W.; writing-original draft preparation, and review and editing, D.W. and T.K. All authors have read and agreed to the published version of the manuscript.
Funding: This project is financed by the Minister of Science and Higher Education of the Republic of Poland within the "Regional Initiative of Excellence" program for years 2019-2022. Project number: 027/RID/2018/19, amount granted: 11 999 900 PLN.

Conflicts of Interest:
The authors declare no conflicts of interest.