You are currently viewing a new version of our website. To view the old version click .
Symmetry
  • Technical Note
  • Open Access

Published: 21 April 2015

Motion Key-Frame Extraction by Using Optimized t-Stochastic Neighbor Embedding

,
,
and
Key Laboratory of Advanced Design and Intelligent Computing, Dalian University, Ministry of Education, Dalian 116622, China
*
Author to whom correspondence should be addressed.

Abstract

Key-frame extracting technology has been widely used in the field of human motion synthesis. Efficient and accurate key frames extraction methods can improve the accuracy of motion synthesis. In this paper, we use an optimized t-Stochastic Neighbor Embedding (t-SNE for short) algorithm to reduce the data and on this basis extract the key frames. The experimental results show that the validity of this method is better than the existing methods under the same experimental data.

1. Introduction

Intelligent algorithms for synthesizing human motion capture data are useful in many fields such as entertainment, education, training simulators and so on. At the same time, more and more human motion data is becoming freely available for public, such as CMU database []. The opening data enable researchers to synthesize human motions with some simple applications. In order to fully use these existing data, many methods of human motion synthesis have been proposed. Motion graph is a typical method to synthesize human motion data. Recently, extracting keyframe is almost the fundamental to reprocess those existing motion data. Figure 1 is an example of key-frame.

For example, motion graph was emerged to automatic synthesis human motion data as a very promising technique []. A motion graph is built with many motion capture clips and added transitions motion between similar frames in these clips. Once constructed, a path through the graph represents a multi-behavior motion. In order to construct the motion graph, most researchers extract key-frames and calculate the comparability between them. Key-frame extraction is the fundamental for motion graph construction. Standard methods calculate similarity between all motion frames. Theoretically, it will be more time consuming and poorer efficiency following the growth of the motion frames. So, some researchers aimed to present more efficient solutions. For example, Krüger et al. [] introduced a method which could fast search the local and global similarity in large motion capture databases. He also used a spatial data structure (KD-tree) to only search for k similar poses in a knn search. On the other hand, if people calculate the inter-frame similarity between all of motion sequences, the frames which satisfied the threshold may not be in keeping with the physical correctness. For example, one synthesizes a motion from “going” to “running”, the intermediate state of two typical movements is similar and the transitional frames which satisfied the threshold may appear in the time when the feet are off the ground. Under this circumstance, the transitional motion appears in the substandard physical laws position so that the result of motion syntheses is serious distortion. To prevent this problem, we need to extract key-frame and calculate similarity based upon those motions.

In this paper, we use an optimized manifold learning method based on t-SNE first presented in [] to effectively extract key-frames. This method has the following advantages. Firstly, we use optimized parameters for human motion data based on standard t-SNE algorithm. Secondly, key-frames are selected according to the low-dimensional motion posture characteristic curves. The curves are mapped by high-dimensional motion data. In this way, it can not only ensure the reliable and accurate calculations but also improve the building efficiency of motion graph.

The experimental results show that the optimized t-SNE nonlinear manifold learning is faster and more accurate when extracting key-frame than the previous methods based on PCA or ISOMAP []. In order to prove the validity of this algorithm, we analyzed and contrasted the results by comparative experiments.

3. Algorithm Description

Stochastic Neighbor Embedding (SNE) algorithm describes the similarity of data points pi to data points pj with the conditional probability pij []:

p i j = exp ( d i j 2 ) k i exp ( d i k 2 )
where d i j 2 can be regarded as a squared Euclidean distance between two high-dimensional points pi, pj:
d i j 2 = x i x j 2 2 σ i 2
where σi is the variance. The details of the SNE can be referred to as [].

Although SNE constructs reasonably and has good visualizations, the cost functions are difficult to be optimized. For this reason Maaten and Hinton presented a new method called “t-Distributed Stochastic Neighbor embedding” (t-SNE for short) that tries to improve this algorithm [,,].

The standard t-SNE algorithm used an N × D matrix as data set. The rows correspond to the N instances and columns correspond to the D dimensions. In the case that the data set are specified, the code plots the intermediate solution every ten times iterations. The data set is only used in the visualization of the intermediate solutions: t-SNE is an unsupervised dimensionality reduction technique. The dimensionality of the visualization constructed by t-SNE can be specified through N dims. The standard t-SNE algorithm set 2 as the value [].

In this paper, we used optimized t-SNE algorithm to extract key-frame. The technique is a variation of Stochastic Neighbor Embedding that is much easier to optimize, and produces significantly better visualizations by reducing the tendency to crowd points together in the center of the map. Firstly, we projected high dimensional human motion data to lower one by using optimized t-SNE algorithm. The data sets we used are N × D matrix. The N rows corresponded to the frame of human motion data and columns corresponded to the D dimensions (96 were set as default value of D). We also set the value of perplexity of the Gaussian distributions, which employed in the high-dimensional space as 30 to fit for human motion data. In this way, we can get an N × 3 matrix that specified the coordinates of the three low-dimensional data points. Secondly, the characteristic curve could be drawn with the low-dimensional data set, which was a reflection of the high-dimensional motion data. Three dimensional low-dimensional embedding reflects the transform direction of human motion in three-dimensional. In order to extract the key-frames, the shifting trends of human motion were needed. Therefore, we get the characteristic curve through interpolation with the low-dimensional data, and the result of interpolation reflects the shifting trends of human motion. The inflection points with the x-axis of characteristic curve can represent the key-frame point of high-dimensional space human motion posture. The criteria of select those points is just based on this algorithm. Thirdly, the key-frame of the motion segment could be set with the characteristic curve. We got the key-frame at the inflection points of characteristic curve and the point of intersection of the x-axis and characteristic curve.

4. Experiments and Results Analysis

In this paper, we used CMU Graphics Lab Motion Capture Database [] to test our algorithm. The format of the motion capture data used in the experimental section is BVH. We selected three clips including walking, running and jumping. Among these data, walking and running were regular motion, while jumping was chaotic motion. By this way, we could prove the versatility of our algorithm. The format of motion data was “BVH” (developed by Biovision) with 120 frame/second sampling rate and comprised with 31 key points in 96-dimensional data composition. The joint root is represented by a 3D translation and rotation vector. The translation of joint root determines the current position of the skeleton while the rotation of joint root determines the overall orientation of the skeleton. For the other joints, they are represented by a 3D rotation vector which describes their orientation in the local coordinates of their parent joints, and their position can be calculated by the orientation vector and the position of their parents. The data used in this paper contains the 3D translation vector of the root joint and the 3D rotation vectors with all of 31 key joints. The software environment was the Matlab 7.12 platform upon Windows 7 Ultimate. The hardware experimental environment was Intel Core i3-2100 3.10 GHz CPU, 2.00 GB RAM.

4.1. The Comparison with Different of Nonlinear Global Manifold Learning

In order to verify the effectiveness of the optimized t-SNE method for our algorithm, linear global dimension reduction method, i.e., PCA (Principal Component Analysis) and nonlinear local dimensionality reduction method ISOMAP are compared. We used three representative and different data sets of motion sequences. Three data sets are walking, running and jumping. In terms of subject matter, the figures fall under five categories, and they are characteristic curves (including Figures 210), manifold (including Figures 1119), key-frames (including Figures 2023), comparative data 1 (Figure 24) and comparative data 2 (Figure 25).

In the first group experiment, we used optimized t-SNE algorithm. As shown in Figure 2, low-dimensional characteristic curves of walking by optimized t-SNE clearly represent the motion characteristics of high-dimensional space. The low-dimensional feature curve was changed following with the changing of freedom of body movement. At the inflection points of the characteristic curve (Red dot) was the changing key-frame of human motion posture. Figure 11 shows the low-dimensional manifold of walking. One can see that the low-dimensional manifold with optimized t-SNE was smoother than Figures 12 and 13, which are drawing based on PCA and ISOMAP algorithm, respectively. That is to say, the PCA and ISOMAP dimensionality reduction characteristic curve could not clearly represent the characteristics of high-dimensional space of human motion. The extracted key-frames got by t-SNE, PCA and ISOMAP were shown in Figure 20 respectively (in order to make the result more clearly, the motion sequence is stretched according to a fixed ratio). From the comparison, it is easily to find that the key-frames extracted by t-SNE accurately represented human motion feature. Consequently, the experiments show that our method, the optimized t-SNE to reduce the dimension of the original data, is valid.

Similar tests were completed base on the data sets of running and jumping. Similar results could be found in the second group and the third group of experiments. Figures 922, similar conclusions as the first group experiments could be obtained.

Furthermore, the proposed approach is applied on a longer motion sequence which contains 4592 frames and the label of the motion sequence in the CMU motion capture data base is 13–29. In the experiment, 188 key-frames were extracted from the motion sequence, due to the limitation of the space, only the first 100 key-frames were exhibited in Figure 23. So, as the whole experiments showed, we could obtain accurate key-frames with optimized t-SNE algorithm for different kind of motion data, which ranged from simple regular motion data to complex irregular motion sequences.

4.2. Comparison of the Different Key Frames Extraction Algorithm

In this section, we apply our approach and other methods (such as uniform sampling, curve saliency and the quaternion distance) to test the different kinds of motion sequences.

From Table 1 and Figure 24, we found that the range of compression ratio varies around 5% in our method.

Then, we compute the mean absolute error with the following formula:

E = [ ( F ( n ) F ( n ) ) 2 ] / N

Here, F(n) is the original motion data, Fꞌ(n) is the reconstructed motion data by the linear interpolation algorithm according to the key frames. N is the number of frames multiplied by 96.

We apply four kinds of methods to get the same number of key frames from the same motion type. Then we reconstruct the motion sequence through the linear interpolation. As shown in Table 2 and Figure 25, we can find that mean absolute errors with our approach are almost less than the other methods under the same compression ratio. In short, we know that our approach works well through testing a large amount of data.

5. Conclusions

In this paper, we use optimized t-SNE algorithm as motion data processing and then extract the key frames. By performing two widely used experimental comparison algorithms, we verified the validity of this algorithm. The contribution of this paper was mainly as follows: an optimized t-SNE algorithm was applied to reduce dimension of human motion data and the motion data processed by this method were advantageous for extracting key-frame. The method used the characteristic curve of low-dimensional human motion data, and could dynamically extract the key-frames. Moreover, it was not only highly effective to simple regular motion data, but also equally effective for complex irregular motion sequences.

There are still many problems remaining, such as missing frames slightly when processing some complex chaotic motion data. So in the future works, we need to make the algorithm more intelligent and efficient by optimizing those issues, such as using other dimensionality reduction algorithm or optimizing the extraction method.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Nos. 61370141, 61300015), Natural Science Foundation of Liaoning Province (No. 2013020007), the Program for New Jinzhou District Science and Technology Research (No. 2013-GX1-015, KJCX-ZTPY-2014-0012).

Author Contributions

Yi Yao drafts this manuscript; Qiang Zhang, Dongsheng Zhou and Rui Liu contributed to the direction, content, re-writing and also revised the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Zhao, L.; Safonova, A. Achieving good connectivity in motion graphs, Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Dublin, Ireland, 7–9 July 2008; pp. 127–136.
  2. Kovar, L.; Gleicher, M.; Pighin, F. Motion graphs. ACM Trans. Graph. 2002, 21, 473–482. [Google Scholar]
  3. Lee, J.; Chai, J.; Reitsma, P.S.A.; Hodgins, J.K.; Pollard, N.S. Interactive control of avatars animated with human motion data. ACM Trans. Graph. 2002, 21, 491–500. [Google Scholar]
  4. Pullen, K.; Bregler, C. Motion capture assisted animation: Texturing and synthesis. ACM Trans. Graph. 2002, 21, 501–508. [Google Scholar]
  5. Rahim, R.A.; Suaib, N.M.; Bade, A. Motion Graph for Character Animation: Design Considerations. Int. Conf. Comput. Technol. Dev. 2009, 2, 435–439. [Google Scholar]
  6. Sakamoto, Y.; Kuriyama, S.; Kaneko, T. Motion map: Image-based retrieval and segmentation of motion data, Proceedings of the 2004 ACM SIGGRAPH/Eurographics symposium on Computer animation, Grenoble, France, 27–29 August 2004; pp. 259–266.
  7. Van der Maaten, L. Learning a Parametric Embedding by Preserving Local Structure, Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AI-STATS), Clearwater Beach, FL, USA, 16–18 April 2009; 5, pp. 384–391.
  8. Krüger, B.; Tautges, J.; Weber, A.; Zinke, A. Fast local and global similarity searches in large motion capture databases, Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Madrid, Spain, 2–4 July 2010; pp. 1–10.
  9. Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  10. Tenenbaum, J.B.; Silva, V.D.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290, 2319–2323. [Google Scholar]
  11. Liu, Y.Y.; Zhang, Q.; Wei, X.P.; Zhou, C.J. Key-frames extraction based on motion capture data, Proceedings of Conference on Intelligent CAD and Digital Entertainment, Dalian, China, 22–24 July 2008.
  12. Park, S.I.; Shin, H.J.; Shin, S.Y. On-line locomotion generation based on motion blending, Proceedings of the 2002 ACM SIGGRAPH/Eurographics symposium on Computer animation, San Antonio, TX, USA, 21–22 July 2002; pp. 105–111.
  13. Van der Maaten, L.J.P.; Postma, E.O.; Van den Herik, H.J. Dimensionality Reduction: A Comparative Review. J. Mach. Learn. Res. 2009, 10, 66–71. [Google Scholar]
  14. Xiao, J.; Zhuang, Y.T.; Wu, F. Getting distinct movements from motion capture data, Proceedings of the International Conference on Computer Animation and Social Agents, Geneva, Switzerland, 5–7 July 2006; pp. 33–42.
  15. Seward, A.E.; Bodenheimer, B. Using nonlinear dimensionality reduction in 3D Figure animation, Proceedings of the 43rd Annual Southeast Regional Conference, Kennesaw, GA, USA, 18–20 March 2005; 2, pp. 388–392.
  16. Lee, C.S.; Elgammal, A. Human motion synthesis by motion manifold learning and motion primitive segmentation, Proceedings of 4th International Conference on Articulated Motion and Deformable Objects, Mallorca, Spain, 11–14 July 2006; pp. 464–473.
  17. Kovar, L.; Michael, G. Automated extraction and parameterization of motions in large data sets. ACM Trans. Graph. 2004, 23, 559–568. [Google Scholar]
  18. Barbič, J.; Safonova, A.; Pan, J.Y.; Faloutsos, C.; Hodgins, J.K.; Pollard, N.S. Segmenting motion capture data into distinct behaviors, Proceedings of the 2004 Graphics Interface Conference, London, ON, Canada, 17–19 May 2004; pp. 185–194.
  19. Zhou, F.; De la Torre, F.; Hodgins, J.K. Hierarchical aligned cluster analysis for temporal clustering of human motion. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 582–596. [Google Scholar]
  20. Vögele, A.; Krüger, B.; Klein, R. Efficient Unsupervised Temporal Segmentation of Human Motion, Proceedings of ACM SIGGRAPH/Eurographics Symposium on Computer Animation, Copenhagen, Denmark, 21–23 July 2014; pp. 167–176.
  21. Müller, M.; Röder, T.; Clausen, M. Efficient content-based retrieval of motion capture data. ACM Trans. Graph. 2005, 24, 677–685. [Google Scholar]
  22. Müller, M.; Röder, T. Motion templates for automatic classification and retrieval of motion capture data, Proceedings of the 2006 ACM SIGGRAPH/Eurographics symposium on Computer animation, Vienna, Austria, 2–4 September 2006; pp. 137–146.
  23. Müller, M.; Baak, A.; Seidel, H.P. Efficient and robust annotation of motion capture data, Proceedings of the 2009 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, New Orleans, LA, USA, 1–2 August 2009; pp. 17–26.
  24. Hintonand, G.E.; Roweis, S.T. Stochastic Neighbor Embedding. Adv. Neural Inf. Process. Syst. 2002, 16, 833–840. [Google Scholar]
  25. Van der Maaten, L.; Hinton, G. Visualizing Non-Metric Similarities in Multiple Maps. Mach. Learn 2012, 87, 33–55. [Google Scholar]
  26. Van der Maaten, L.; Weinberger, K. Stochastic Triplet Embedding, Proceedings of IEEE International Workshop on Machine Learning for Signal Processing (MLSP), Santander, Spain, 23–26 September 2012; pp. 1–6.
  27. Van der Maaten, L.; Welling, M.; Saul, L.K. Hidden-Unit Conditional Random Fields, Proceedings of the International Conference on Artificial Intelligence & Statistics (AI-STATS), Ft, Lauderdale, FL, USA, 11–13 April 2011; 15, pp. 479–488.
  28. Van Der Maaten, L. Fast Optimization for t-SNE, Proceedings of Neural Information Processing Systems (NIPS) 2010 Workshop on Challenges in Data Visualization, Whistler, BC, Canada, 10–11 December 2010; p. 100.
  29. CMU Graphics Lab Motion Capture Database, Available online: http://mocap.cs.cmu.edu/ accessed 14 April 2015.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.