# Skeleton-Based Human Action Recognition through Third-Order Tensor Representation and Spatio-Temporal Analysis

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Modeling of Human Actions through Mode-n Singular Values

#### 2.2. Feature Fusion through Discriminant Correlation Analysis and Classification

## 3. Results

^{1}) and secondly a modality-based “leave-persons out” protocol was used as proposed in [27] (Table 2—MSRC-12

^{2}). In the second case, for each of the instruction modalities of MSRC-12 dataset, the ‘leave-persons-out’ protocol was adopted, keeping the minimum subject subset containing all the gestures as the test set and employing all the others for training.

#### 3.1. Defining the Number of MSVs

#### 3.2. Contribution of Different Feature Representations and Fusion Results

#### 3.3. Comparison with State-of-the-Art Approaches

## 4. Discussion and Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Han, F.; Reily, B.; Hoff, W.; Zhang, H. Space-time representation of people based on 3D skeletal data: A review. Comput. Vis. Image Underst.
**2017**, 158, 85–105. [Google Scholar] - Lokare, N.; Zhong, B.; Lobaton, E. Activity-Aware Physiological Response Prediction Using Wearable Sensors. Inventions
**2017**, 2, 32. [Google Scholar] [CrossRef] - Ramanathan, M.; Yau, W.Y.; Teoh, E.K. Human action recognition with video data: Research and evaluation challenges. IEEE Trans. Hum. Mach. Syst.
**2014**, 44, 650–663. [Google Scholar] [CrossRef] - Han, J.; Shao, L.; Xu, D.; Shotton, J. Enhanced computer vision with microsoft kinect sensor: A review. IEEE Trans. Cybern.
**2013**, 43, 1318–1334. [Google Scholar] [CrossRef] - Ngo, T.T.; Makihara, Y.; Nagahara, H.; Mukaigawa, Y.; Yagi, Y. Similar gait action recognition using an inertial sensor. Pattern Recognit.
**2015**, 48, 1289–1301. [Google Scholar] [CrossRef] - Chen, C.; Jafari, R.; Kehtarnavaz, N. A survey of depth and inertial sensor fusion for human action recognition. Multimed. Tools Appl.
**2017**, 76, 4405–4425. [Google Scholar] - Kim, E.; Helal, S.; Cook, D. Human activity recognition and pattern discovery. IEEE Pervasive Comput. IEEE Comput. Soc.
**2010**, 9, 48. [Google Scholar] - Rabiner, L.R.; Juang, B.H. An introduction to hidden Markov models. IEEE ASSP Mag.
**1986**, 3, 4–16. [Google Scholar] - Hidden Markov Model. Available online: https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm.html (accessed on 30 May 2016).
- Liu, K.; Chen, C.; Jafari, R.; Kehtarnavaz, N. Fusion of inertial and depth sensor data for robust hand gesture recognition. IEEE Sens. J.
**2014**, 14, 898–1903. [Google Scholar] [CrossRef] - Ofli, F.; Chaudhry, R.; Kurillo, G.; Vidal, R.; Bajcsy, R. Berkeley mhad: A comprehensive multimodal human action database. In Proceedings of the 2013 IEEE Workshop on Applications of Computer Vision (WACV), Tampa, FL, USA, 15–17 January 2013; pp. 53–60. [Google Scholar]
- Kosmopoulos, D.I.; Doulamis, N.D.; Voulodimos, A.S. Bayesian filter-based behavior recognition in workflows allowing for user feedback. Comput. Vis. Image Underst.
**2012**, 116, 422–434. [Google Scholar] - Xia, L.; Chen, C.C.; Aggarwal, J.K. View invariant human action recognition using histograms of 3D joints. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 20–27. [Google Scholar]
- Conditional Random Field. Available online: https://www.cs.ubc.ca/~murphyk/Software/CRF/crf.html (accessed on 30 May 2016).
- Zhou, L.; Li, W.; Zhang, Y.; Ogunbona, P.; Nguyen, D.T.; Zhang, H. Discriminative key pose extraction using extended lc-ksvd for action recognition. In Proceedings of the 2014 International Conference on Digital Lmage Computing: Techniques and Applications (DlCTA), Wollongong, Australia, 25–27 November 2014; pp. 1–8. [Google Scholar]
- Sharaf, A.; Torki, M.; Hussein, M.E.; Hussein, M.E. M. Real-time multi-scale action detection from 3D skeleton data. In Proceedings of the 2015 IEEE Winter Conference on Applications of Computer Vision (WACV), El-Saban, Waikoloa, HI, USA, 5–9 January 2015; pp. 998–1005. [Google Scholar]
- Meshry, M.; Hussein, M.E.; Torki, M. Linear-time online action detection from 3D skeletal data using bags of gesturelets. In Proceedings of the 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Placid, NY, USA, 7–9 March 2016; pp. 1–9. [Google Scholar]
- Patrona, F.; Chatzitofis, A.; Zarpalas, D.; Daras, P. Motion analysis: Action detection, recognition and evaluation based on motion capture data. Pattern Recognit.
**2018**, 76, 612–622. [Google Scholar] [CrossRef] - Du, Y.; Wang, W.; Wang, L. Hierarchical recurrent neural network for skeleton based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 1110–1118. [Google Scholar]
- Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol.
**2018**, 28, 807–811. [Google Scholar] [CrossRef] - Bilen, H.; Fernando, B.; Gavves, E.; Vedaldi, A. Action recognition with dynamic image networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 40, 12. [Google Scholar] [CrossRef] - Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 834–848. [Google Scholar] [CrossRef] - Jain, A.; Zamir, A.R.; Savarese, S.; Saxena, A. Deep learning on spatio-temporal graphs. In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 5308–5317. [Google Scholar]
- Shi, Z.; Kim, T.K. Learning and refining of privileged information-based RNNs for action recognition from depth sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
- Zhu, W.; Lan, C.; Xing, J.; Zeng, W.; Li, Y.; Shen, L.; Xie, X. Co-Occurrence Feature Learning for Skeleton Based Action Recognition Using Regularized Deep LSTM Networks. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016. [Google Scholar]
- Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell.
**2018**, 40, 3007–3021. [Google Scholar] [CrossRef] - Konstantinidis, D.; Dimitropoulos, K.; Daras, P. Skeleton-based action recognition based on deep learning and Grassmannian pyramids. In Proceedings of the 2018 26th European Signal Processing Conference, Rome, Italy, 3–7 September 2018; pp. 2045–2049. [Google Scholar] [CrossRef]
- Ke, Q.; Bennamoun, M.; An, S.; Sohel, F.; Boussaid, F. A new representation of skeleton sequences for 3D action recognition. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4570–4579. [Google Scholar]
- Kim, T.K.; Wong, S.F.; Cipolla, R. Tensor canonical correlation analysis for action classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Minneapolis, MN, USA, 17–22 June 2007; pp. 1–8. [Google Scholar]
- Vasilescu, M.A.O.; Terzopoulos, D. Multilinear analysis of image ensembles: Tensorfaces. In Proceedings of the 7th European Conference on Computer Vision, Copenhagen, Denmark, 28–31 May 2002; pp. 447–460. [Google Scholar]
- Koniusz, P.; Cherian, A.; Porikli, F. Tensor representations via kernel linearization for action recognition from 3D skeletons. In Proceedings of the 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 37–53. [Google Scholar]
- Dimitropoulos, K.; Barmpoutis, P.; Kitsikidis, A.; Grammalidis, N. Classification of multidimensional time-evolving data using histograms of Grassmannian points. IEEE Trans. Circuits Syst. Video Technol.
**2018**, 28, 892–905. [Google Scholar] - Dimitropoulos, K.; Barmpoutis, P.; Kitsikidis, A.; Grammalidis, N. Extracting Dynamics from Multi-dimensional Time-evolving Data using a Bag of Higher-order Linear Dynamical Systems. In Proceedings of the International Conference on Computer Vision Theory and Applications, Rome, Italy, 27–29 February 2016; pp. 683–688. [Google Scholar]
- Halko, N.; Martinsson, P.G.; Tropp, J.A. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev.
**2011**, 53, 217–288. [Google Scholar] [CrossRef] - Hackbusch, W.; Uschmajew, A. On the interconnection between the higher-order singular values of real tensors. Numer. Math.
**2017**, 135, 875–894. [Google Scholar] [CrossRef] - Padhy, S.; Dandapat, S. Third-order tensor based analysis of multilead ECG for classification of myocardial infarction. Biomed. Signal Proc. Control
**2017**, 31, 71–78. [Google Scholar] [CrossRef] - Haghighat, M.; Abdel-Mottaleb, M.; Alhalabi, W. Discriminant correlation analysis for feature level fusion with application to multimodal biometrics. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 1866–1870. [Google Scholar]
- Haghighat, M.; Abdel-Mottaleb, M.; Alhalabi, W. Discriminant correlation analysis: Real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur.
**2016**, 11, 1984–1996. [Google Scholar] [CrossRef] - Oniga, S.; Suto, J. Human activity recognition using neural networks. In Proceedings of the 2014 15th International Carpathian Control Conference (ICCC), Velke Karlovice, Czech Republic, 28–30 May 2014; pp. 403–406. [Google Scholar]
- Bloom, V.; Makris, D.; Argyriou, V. G3D: A gaming action dataset and real time action recognition evaluation framework. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 16–21 June 2012; pp. 7–12. [Google Scholar]
- Microsoft Research Cambridge-12 Kinect Gesture Data Set. Available online: https://www.microsoft.com/en-us/download/details.aspx?id=52283 (accessed on 14 January 2019).
- Ten Holt, G.A.; Reinders, M.J.; Hendriks, E.A. Multi-dimensional dynamic time warping for gesture recognition. In Proceedings of the Thirteenth Annual Conference of the Advanced School for Computing and Imaging, Heijen, The Netherlands, 13–15 June 2007; Volume 300, p. 1. [Google Scholar]
- Deep Neural Network. Available online: http://www.mathworks.com/matlabcentral/fileexchange/42853-deep-neural-network (accessed on 30 May 2016).
- Jhuang, H.; Gall, J.; Zuffi, S.; Schmid, C.; Black, M.J. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 3192–3199. [Google Scholar]
- Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell.
**2014**, 36, 1325–1339. [Google Scholar]

**Figure 2.**Tensor representation of skeleton and skeletal joints in 3D and 2D (xz-projection) graphs respectively for the extraction of spatial correlations (

**a**), (

**b**) and for the extraction of time-evolving correlations (

**c**). Representation of skeletal joints (a) in the initial position, (

**b**) in the position of right kick and (

**c**) for the period from the initial position to right kick.

**Figure 3.**Distributions of three mode singular values (SVs) for the different human actions of CERTH [32] database. In the two first columns, the proposed spatial descriptors for $t=20$ and $t=40$ are shown. In the third column the proposed temporal descriptor is shown. The blue line corresponds to the $mode-1$ SVs, the black line to the $mode-2$ SVs and the red line to the $mode-3$ SVs.

**Figure 4.**Action recognition rates of (

**a**) Spatial Descriptor and (

**b**) Temporal Descriptor using the first five to fifteen SVs and applying the proposed method to the CERTH dataset.

**Figure 5.**Contribution of different feature representations and fusion results through discriminant correlation analysis for CERTH, G3D, and MSRC-12 databases.

^{1}37% of the MSRC-12 dataset was used for training and 63% for testing.

^{2}“Leave-persons out” protocol was used.

Symbol | Definition |
---|---|

$Y$ | observed frame data |

$S$ | core tensor |

${Y}_{\left(n\right)}$ | mode-n unfolding |

$U$ | orthogonal matrix |

${\sigma}^{\left(n\right)}$ | mode-$n$ singular values |

${D}_{s}$ | spatial representation of three mode MSVs |

${D}_{t}$ | temporal representation of three mode MSVs |

$f$ | number of spatial and temporal descriptors for each human action sequence |

${D}_{S}$ | feature vector (term frequency histogram) for spatial analysis |

${D}_{T}$ | feature vector (term frequency histogram) for temporal analysis |

$d{s}_{ij}$ | spatial feature vector corresponding to the ${i}^{th}$ class and in ${j}^{th}$ sample |

${\overline{ds}}_{i}$ | means of $d{s}_{ij}$ vectors in the ${i}^{th}$ class |

$\overline{ds}$ | means of the whole feature set |

${S}_{bds}$ | between-class scatter matrix |

$P$ | matrix of orthogonal eigenvectors |

$\hat{\mathsf{\Lambda}}$ | diagonal matrix of real and non-negative eigenvalues |

${W}_{bds}$ | transformation that unitizes ${S}_{bds}$ |

${D}_{S}^{\prime}$ | feature vector (reduced dimensionality) for spatial analysis |

${D}_{T}^{\prime}$ | feature vector (reduced dimensionality) for temporal analysis |

${W}_{DS}$ | transformation matrices for the spatial feature vectors |

${W}_{DT}$ | transformation matrices for the temporal feature vectors |

${\stackrel{\xb4}{D}}_{S}$ | transformed spatial feature |

${\stackrel{\xb4}{D}}_{T}$ | transformed temporal feature |

Method | CERTH | G3D | MSRC-12 ^{1} | MSRC-12 ^{2} |
---|---|---|---|---|

Dynamic Time Warping | 87.5% | 57% | 48.12% | - |

Hidden Markov Model | 96.25% | 77.4% | 76.2% | - |

Restricted Boltzmann Machine | 97.1% | 84% | 79.8% | - |

Conditional Random Fields | 97.91% | 69.25% | 67.95% | - |

Histograms of Grassmannian Points | 98.61% | 90.75% | 80.15% | - |

Multi-Scale Action Detection | - | - | - | 63.9% |

Bags of Gesturelets | - | - | - | 87.1% |

Extended Gesturelets | - | - | - | 91.2% |

LSTM and Grassmannian Pyramids | - | 92.38% | - | 94.6% |

Proposed | 100% | 92.6% | 83.1% | 92.8% |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Barmpoutis, P.; Stathaki, T.; Camarinopoulos, S.
Skeleton-Based Human Action Recognition through Third-Order Tensor Representation and Spatio-Temporal Analysis. *Inventions* **2019**, *4*, 9.
https://doi.org/10.3390/inventions4010009

**AMA Style**

Barmpoutis P, Stathaki T, Camarinopoulos S.
Skeleton-Based Human Action Recognition through Third-Order Tensor Representation and Spatio-Temporal Analysis. *Inventions*. 2019; 4(1):9.
https://doi.org/10.3390/inventions4010009

**Chicago/Turabian Style**

Barmpoutis, Panagiotis, Tania Stathaki, and Stephanos Camarinopoulos.
2019. "Skeleton-Based Human Action Recognition through Third-Order Tensor Representation and Spatio-Temporal Analysis" *Inventions* 4, no. 1: 9.
https://doi.org/10.3390/inventions4010009