Choreographic Pattern Analysis from Heterogeneous Motion Capture Systems Using Dynamic Time Warping

: The convention for the safeguarding of Intangible Cultural Heritage (ICH) by UNESCO highlights the equal importance of intangible elements of cultural heritage to tangible ones. One of the most important domains of ICH is folkloric dances. A dance choreography is a time-varying 3D process (4D modelling), which includes dynamic co-interactions among different actors, emotional and style attributes, and supplementary elements, such as music tempo and costumes. Presently, research focuses on the use of depth acquisition sensors, to handle kinesiology issues. The extraction of skeleton data, in real time, contains a signiﬁcant amount of information (data and metadata), allowing for various choreography-based analytics. In this paper, a trajectory interpretation method for Greek folkloric dances is presented. We focus on matching trajectories’ patterns, existing in a choreographic database, to new ones originating from different sensor types such as VICON and Kinect II. Then, a Dynamic Time Warping (DTW) algorithm is proposed to ﬁnd out similarities/dissimilarities among the choreographic trajectories. The goal is to evaluate the performance of the low-cost Kinect II sensor for dance choreography compared to the accurate but of high-cost VICON-based choreographies. Experimental results on real-life dances are carried out to show the effectiveness of the proposed DTW methodology and the ability of Kinect II to localize dances in 3D space.


Introduction
Intangible Cultural Heritage (ICH) is a prominent element of people's cultural identity as well as a significant aspect for growth and sustainability [1].The expression of identity through Intangible Cultural Heritage takes many forms, among which folkloric dances hold a central position [2].It is reasonable to consider that analyzing choreographic sequences is essentially a multidimensional modelling problem, given that both temporal and spatial factors should be taken into account.Research has been published in the literature pertaining to ICH preservation which focuses on the time element [3][4][5][6].Typical preservation acts include digitization, modelling, and documentation.
Another important factor in preserving any type of performing arts, would be the development of an interactive framework that enhances the learning procedure of folklore dances.The recent advances in depth sensors, which have concluded to the development of low-cost 3D capturing systems, such as Microsoft Kinect [7] or Intel RealSense [8], permit easy capturing of human skeleton joints, in 3D space, which are then properly analyzed to extract dance kinematics [9].The preservation of folk dances can be facilitated by modern Information and Communication Technologies by leveraging recent developments in a variety of areas, such as storage, image and video processing, machine learning, cloud computing, crowdsourcing, and automatic semantic annotation, to name a few [10].
Nevertheless, the digitization and the modelling of the information remains the most valuable task.Due to the tremendous growth of the motion-capturing systems, depth cameras are a popular solution employed in many cases, because of their reliability, cost-effectiveness, and usability and despite their limited range.Kinect is one of the most recognizable sensors in this category and in the choreography context can be used for recording sequences of points in 3D space for body joints at certain moments in time.Several recent research papers in the literature make use of such sensors for dance analysis, for example educational dance applications using sensors and gaming technologies [11], trajectory interpretation [12], advanced skeletal joints tracking [13], action or activity recognition [14][15][16][17][18][19][20], key pose identification [21] and key pose analysis [22].Apart from Kinect, another popular alternative motion capture system is VICON which is significantly more sophisticated and accurate [9,23,24].
In [25], a comparison between abilities of the Kinect and VICON for gait analysis is introduced in the orthopedic and neurologic field.In [26], the authors focus on the precision of the Kinect and the VICON motion-capturing systems creating an application for rehabilitation treatments.In [27], the authors propose that the Kinect was able to accurately measure timing of clinically relevant movements in people with Parkinson disease.Contrary to the linear regression-based approaches that have been carried out in the bio-medical field [25][26][27] regarding the similarities/dissimilarities and the precision of the adopted motion-capturing system, in this work we follow a Dynamic Time Warping (DTW) approach in the kinesiology field.Moreover, the aforementioned approaches pertain to simple movement sequences, i.e., knee flexion and extension, hip flexion and extension instead of our proposed choreographic dataset which includes more complex movements that combine several joints variations (see Table 1).
In [28], the authors introduce a motion classification framework using DTW.The aforementioned work uses DTW algorithm to classify motion sequences using the minimum set of bones (7 body joints).On contrary, our proposed framework uses 25 body joints analyzing the motion sequences using the DTW and Move-Split-Merge algorithms, respectively.In [29], the authors propose an algorithm for 3D motion recognition which allows extensions of DTW with multiple sensors (view-point-weighted, fully weighted and motion-weighted) and can be employed in a variety of settings.DTW algorithm has also adopted to extract the kinesiology details from video sequences.In [30], the authors propose a video human motion recognition approach, which uses DTW to match motion projections in non-linear manifold space.In [31], the authors present a technique for motion pattern and action recognition, which employs DTW to match motion projections in Isomap non-linear manifold space.Our proposed framework focuses on the similarity assessment of folkloric dances, using data from heterogeneous sources; i.e., data from high-cost devices such as VICON and low-cost devices such as Kinect II using predefined choreographic sequences.Research outcomes target on the underlying relationships among dances captured using the VICON and Kinect systems (see Table 2).VICON is a high-cost, motion-capturing system, which exploits markers attached on dancers' joints to extract motion variations and the trajectory of a choreography.The VICON motion-capturing system requires i) a properly equipped room of cameras and trackers, ii) experienced staff to manage the VICON devices, iii) a pre-capturing procedure, which is obligatory to calibrate the whole system.On the other hand, Kinect II is a low-cost depth sensor, which requires no markers to extract the depth and humans' skeleton joints.This makes Kinect II applicable to non-professional users (everybody) from any environment (everywhere) and at any time.However, the captured trajectories are not as accurate as the ones extracted by the VICON system.
Consequently, the Kinect II device can be used as an in-home learning tool for most of dance choreographies by simple (non-experienced) users.This paper relates dance motion trajectories captured by the accurate VICON and the non-accurate Kinect II system.A Dynamic Time Warping (DTW) methodology is adopted to find out similarities/dissimilarities between the two devices, considering as accurate reference dance motion trajectory the one derived from the VICON system.DTW algorithm can localize dance steps patterns which cannot be accurately represented by the Kinect system and patterns that Kinect can be sufficiently described.
The contribution of this work can be summarized in the following: First, we present a comparative study on trajectory similarity estimation approaches, on data obtained by two types of sensors, using a complex dataset with challenging choreographic sequences, where joint movements are often varied and unstructured.Furthermore, the conducted experiments indicate that if significant levels of precision are ensured during initial data collection, design, development, and fine-tuning of the system, then low-cost and widely popular motion-capturing sensors, such as Kinect II, suffice to provide a smooth and integrated experience on the user end, which would allow for relevant educational or entertainment applications to be adopted at scale.
The remainder of this paper is organized as follows: Section 2 describes the proposed methodology; Section 3 presents the Dynamic Time Warping framework; Section 4 provides an experimental evaluation of the proposed methods and Section 5 concludes the paper with a summary of findings.

The Proposed Methodology
In this work we investigate the possibility of using skeleton data points as reference points, for the identification of dance choreographies.Data originates from professional motion capture equipment.These instances are used against corresponding skeletal data, recorded using low-cost sensors.The proposed approach consists of the following steps: (a)data capturing using high-end motion capture system, (b) feature extraction, (c) descriptive frames selection, for the database creation, (d) data capturing using low-cost sensors, (e) extraction of corresponding body joints and (f) similarity assessment among the dance patterns between high-end and low-cost sensors.
The idea of spatial-temporal information management [24,32] is applied, so that recorded dance sequences are summarized to a sequence of keyframes.This is achieved by employing an iterative clustering scheme, imposing time constraints.The proposed data managing scheme reduces the dance sequence to a few keyframes, which are selected using density-based clustering, in predefined time related subsets.It is important to note that noise or tempo variations do not affect the proposed approach.Given as set of keyframe sequences, for different dances, a comparison is performed among them.The sequences are signals containing information over dancer's joints' position and rotation.Signal similarity, employing the correlation measure is performed.Consequently, variations of the same dance should be easily identified, due to high similarity scores.Figure 1

Data Capturing
Two type of sensors are used for the feature extraction: Kinect II and VICON.Despite their differences both sensors provide similar information for many of the body joints.Therefore, any comparisons based on leg joints as knees, angles, and hips are feasible with minor prepossessing steps, e.g., the frame rate reduction.Table 2 summarizes different aspects between VICON and Kinect II sensors.As we can observe, Kinect II is of low cost but also of low accuracy.On the other hand, VICON is of high accuracy but it yields a high-cost sensing.For this reason, VICON is used as a reference trajectory in our case.
Figure 2 shows a snapshot of the proposed Kinect II architecture while Figure 3 depicts the architecture of the VICON components.These shots have been obtained at one of our experiments conducted at the premises of Sports Science department of the Aristotle University of Thessaloniki.In these premises, different dances have been recorded as described in Section 4. Figure 4 shows a snapshot of the conducted dance experiment.

Kinect Sensor
In this work, a subset of human joints is considered for the dance analysis.Figure 5 depicts every frame as obtained by Kinect.Kinect permits capturing human motion variation as 3D data [15,32], combining the skeletal tracking data of multiple sensors to address the issue of occlusions [33] hence making skeletal tracking more robust.Kinect II sensor by Microsoft includes: (a) a depth sensor, (b) an RGB camera and (c) a four-microphone array that can deliver a full body 3D motion capture [7].
At first the positions and rotation values for each frame, I i , i = 1, . . ., t of a dance are extracted.Each choreographic sequence is represented by a matrix D i of size b × k × t, where b is the number of body joints (i.e., 6, namely J 12 , J 13 , J 14 , J 16 ) k is the number of feature vectors values (i.e., 3), and t is the sequence length.Please note that for every joint we have the following feature values: three coordinates and four rotations, and additionally two binary indicators, denoting whether values derive from measurement or estimation.

VICON Motion-Capturing System
The motion capture area of VICON is surrounded by several high-resolution cameras with LED strobe light rings (see Figure 4).A setup of VICON workstation is illustrated in Figure 3. Reflective markers facilitate the recording of the moving subject by the cameras (Figure 6a), while signal collection is controlled by Data station controls (Figure 3b).Signals are then passed to the VICON workstation (Figure 3b), equipped with a specialized software for collection, filtering, and processing of raw data.Two-dimensional data from cameras are processed and combined for the three-dimensional motion to be reconstructed.

Database Creation
Table 1 shows the dance recorded in our experiments along with a short description of them.As you can see from Table 1 six different Greek folkloric dances have been recorded, each of different beat duration and characteristics.Each dance was executed by different dancers including men and women so that we can evaluate different properties and characteristics of each dance.Table 3 depicts the different duration of these dances across three different dancers.
Table 3.The considered dances and their variations along with the length of each sequence for each of the three dancers.These dances were recorded using Kinect II.We assume here that any dance can be represented by a subset of frames, namely keyframes [34].The keyframes are the most representative postures of each dance.Random sampling over k-means clusters [21] is an improvement of random selection with similar instances being likely to be clustered together.In this way, a random sampling from each cluster could be considered to give adequate information regarding the data set.The final dataset is a cell array, which contains all the recorded dances for each of the dancers.

Dynamic Time Warping
Dynamic Time Warping [35] calculates an optimal match between two temporal sequences.DTW generated matching path is based on linear matching, but has specific conditions that need to be satisfied, in particular the conditions pertaining to continuity, boundary condition, and monotonicity.In the following a brief description on matching between curve points is provided.If N 1 and N 2 are the number of points in two curves, then i-th point of curve 1 and the j-th point of curve 2 match if: It should be mentioned that each point can match with maximum one point of the other curve.The boundary condition forces a match between the first points of the curve and a match between the last points of the curve.The continuity condition decides how much the matching can differ from the linear matching.The aforementioned condition is the heart of DTW.We formulate the aforementioned assumption as follows: In the case that during the process of matching it is concluded that the i-th point of the first curve should match with the j-th point of the second curve, it is not possible: (i) that any point of the former with an index greater than i matches with a point of the latter with an index smaller than j, and (ii) that any point of the former with an index smaller than i matches with a point on the latter with index greater than j.

Kinect II Evaluation Using DTW
In our proposed methodology, we denote as reference sequences those are derived by the VICON motion-capturing system.In addition, each choreographic sequence obtained by the low-cost sensor Kinect II is contrasted to the VICON sequence.Our scope is to define the similarities/dissimilarities comparing the choreographic sequence for each dance using the DTW algorithm [35].Furthermore, each choreographic sequence is depicted as a curve with different characteristics (e.g., duration, length).Our proposed framework is to define the similarities/dissimilarities between the curves of the heterogeneous motion-capturing systems.Every index of the choreographic sequence is matched with one (or more) indices of the other sequence for each dance.Figure 7 depicts time alignment between two independent signals, in our framework the signals are obtained by the motion-capturing systems.Let us denote, as X the sequences of the Kinect sensor and Y the sequences of the VICON accordingly.The X and Y enclosure the kinesiology features (body joints variations) for each dancer creating a motion database for the heterogeneous capturing system.To compare each feature, we define a local cost measure describing the similarity/dissimilarity of each feature.The cost matrix is defined as P ∈ R NxM P(n, m) = p(x n , y m ).An (N,M) dynamic warping path p = (p 1 ,• • • ,p s ) determines an alignment between the X and Y vectors by assigning the element x ns of X to the element y ms of Y.The vectors X and Y are denoted as follows: Sequence X Sequence Y Time (t) In the following, we create a space defined by F. Then x n , y m ∈ F for n ∈ [1:N] and m ∈ [1:M].In our framework, we define as X and Y the features which are obtained by the motion-capturing system indicating every joint of the dancer's body.Due to the heterogeneous motion-capturing system, we should define the local coordination system.Figure 8 depicts the transformation from the global coordination system to a local system for each motion-capturing system, which is simultaneously a type of range fix that takes into consideration body parameters such as limb length.Inevitably, for the aforementioned constraints we denote as ) the l-th out of the L = 25 obtained by the Kinect II sensor respectively.Variables x G i , y G i and z G i indicate the coordinates of the respective i-th joint with regard to a reference point setting VICON architecture (in our case the center of the square surface).We have acquired the aforementioned joints after applying a density-based filtering on the entirety of the detected joints to eliminate noise introduced during the acquisition procedure.The main difficulty in directly processing the extracted joints C G k , k = 1,2,...,M is the coordinates system.Thus, we need to transform the from the VICON coordinate system to a local coordinate system, the center of which is the center of mass of the dancer.We follow the same procedure for the Kinect II architecture.This is obtained through the application of Equation ( 5) on the joint's coordinates J G k , where H cm denotes the dancer's center of mass regarding the coordination system expressed as: and we recall that M, L refers to the total number of joints extracted by the VICON and Kinect capturing system, respectively.Let us denote as cost matrix p( X, Y) = p( C L k , I L l ) the total cost of a warping path p between C L k and The DTW distance between the C L k and I L l is defined as follows:

Kinect II Evaluation Using Move-Split-Merge
Motivated by the superiority of DTW for motion analysis shown in previous works e.g., against SVM [30], or approaches based on Locally Linear Embedding (LLE), Locality Preserving Projections (LPP) and LLP-HMM [31] we adopt DTW as our main reference algorithm.Moreover, we conduct further comparative experiments to also evaluate against a recent technique called Move-Split-Merge [36].The Move-Split-Merge distance algorithm provides a means of measurement that resembles other distance-based approaches, where similarities/dissimilarities are computed by employing a series of operations for the transformation of a series "source" into a series "target".Move-Split-Merge algorithm uses as building blocks three fundamental operators.The move operation is equivalent with a replacement operation, in which one value substitutes another.Split inserts an identical copy of a value immediately after its first instance, while Merge erases a value if it directly follows an identical value.Let us assume X i = (x i , ..., x m ) as a finite motion sequence of real numbers x i .The move operation and the cost operation are defined as follows: Split i (X) = (x 1 , ..., Merge i (X) = (x 1 , ..., x i−1 , x i+1 , ..., x m ) (15)

Experimental Results
In our study, for capturing of the dancers' movement variations, we employ a multi-faceted motion capture system including one Kinect II depth sensor, the i-Treasures Game Design module (ITGD) module created in the context of i-Treasures project [1] and VICON motion-capturing system.The ITGD module gives the possibility of recording and annotating mocap data acquired by a Kinect sensor.The employed algorithms were implemented in MATLAB.A variety of Greek folk dances with varying levels of complexity have been obtained.Three dancers (two men and one woman) each performed every dance twice: Once in a straight line and once in a semi-circular curved line.Figures 9 and 10 depict the most representative postures of the Syrtos at 2 beats and Enteka dance, respectively.Figure 11 depicts the main choreographic steps of Enteka dances and Figure 12 of Syrtos dance at 3 beats.Figure 13

Dataset Description
The dataset comprises six different folklore dances.For the Kinect capturing process, we use a single Kinect II sensors placed in the front.Every dance is described by a set of consecutive image frames.Every frame I l , i = 1, . . ., l has a corresponding extensible mark-up language (XML) file with positions, rotations and confidence scores for 25 joints on the body (see Figure 5) addition to timestamps.In Table 1, a brief description of the dances is provided [24].After a series of processing steps, a skeleton from the VICON system is represented as shown in Figure 6a.In the discussed setting, ten Bonita B3 cameras were used.The capturing space was a square of 6.75 m width, and the square's center constitutes the origin of the VICON coordinate system.We used a calibration wand with markers to optimize the calibration procedure.The dancers' movements were captured using 35 markers at fixed positions on their bodies.

Similarity Analysis
Similarity analysis entails to a dance matching problem.Specifically, given a set of frames, from multiple body joints, captured using the Kinect, we try to identify the most closely related trajectories from the choreographic database.Assume that we have n experienced dancers in the database.Then each time a new user performs a dance, the algorithm calculates the similarity scores among the newly recorded dance and the existing dances in the DB.Then, for each of the n experienced dancers, we get the top 3 closest trajectories, given a distance metric.Thus, we have a total of n times 3 dance suggestions.In this study, we have 3 experienced dancers.Thus, we had 9 dance suggestions every time.The similarity score (i.e., DTW or MSM) is then used to rank the results.Performance analysis focuses on how accurate the system is in matching correctly the recorded dance.
At first, we asked the dancer to execute a specific choreography.Since, VICON's frame rate is 4 times greater than Kinect, we have considered a sub-sample approach in a ratio 1 to 4; that way the frame rate matches the Kinect.Then, we exploit the similarity tests with existing entries in the database.Despite the variations in the trajectories, we expect that the movement itself will be similar among dancers.Thus, the similarity analysis has a solid base.Figures 14 and 15 illustrate the left foot joint movement on the floor for two different dancers.As we observe, the choreographic pattern of each dance is extracted indicating not only the kinesiology variation of the dancers' joints but also the music tempo.The main patterns appear the same, despite the variations in descriptive characteristics (e.g., length and height).Proposed approach's matching performance is displayed in Figure 16.Results illustrate the number of matches, for a specific recorded dance, to the existing dances in data base.There are three performance classes, denoted as Top3, Top6, and Top 9. Numbers 3, 6, and 9 indicate the number of the closest matched dances (from the database to the one currently performed).Recall that we have three professional dancers and each of them performed the same six dances.Thus, the highest possible score in category TopX is 3. Results indicate that the suggested methodology managed to match correctly at least once all the investigated dances, despite their complexity, as explained in [37].Figure 16 provides further insights to the similarity between the VICON and the Kinect II sensors.The x axis depicts the name of each the dance (see Table 3) and the y axis the number of the matches according to the choreographic database.For example, Makedonikos in circular trajectory (MakCirc) Top9 score indicates that among the nine closest trajectory patterns, we have 3 matches with the Makedonikos dance captured by Kinect II, one per dancer in the choreographic database.Consequently, Makedonikos dance captured by VICON system was matched to Makedonikos dance captured by Kinect II; to an extent, most of the choreographies were successfully matched, by defining a score using the DTW or MSM algorithms, despite the differences in employed motion capture technologies.

Conclusions
In this paper, we explore the feasibility of pattern matching between heterogeneous motion-capturing systems.The case study emphasizes on northern Greek folklore dances, which although complex and with several variations and particularities in their pattern, are characterized by elements of structure, contrary to chaotic versions of movement trajectories (e.g., [38]) in which similar explorations are far more difficult to perform.In this work, a two-step process is adopted.The first step uses Kinect II sensors, which provide dancer's skeleton feature values and a database is created.The second step involves the comparison of the trajectories in the database with a second database, created using VICON.The employed algorithms calculate similarity scores.According to these scores the algorithm provides a similar dance suggestion, for each of the dancers, in the choreographic database.The obtained results suggest that low-cost sensors such as Kinect II can be used in the context of dance-related educational or entertainment applications, at least as part of the end-user side.Such a setup would however require the employment of a detailed and highly accurate dataset for training and development of the system, captured by a high precision system such as VICON.The conducted experiments indicate that if significant levels of precision are ensured during initial data collection, design, development, and fine-tuning of the system, then low-cost and widely popular motion-capturing sensors suffice to provide a smooth and integrated experience on the user end, which would allow for relevant educational or entertainment applications to be adopted at scale.Nevertheless, the proposed approach would not be appropriate for tasks that require great precision and accuracy in measurement of movement and positioning of individual joints, such as medical or rehabilitation applications.

Figure 4 .
Figure 4.A snapshot from the experiments conducted at the Aristotle University of Thessaloniki for capturing the folklore dances.

Figure 5 .
Figure 5.A list of body joints captured by Kinect.For each joint, position and rotation values are stored in XML format (source: https://vvvv.org/documentation/kinect).

Figure 6 .
(a) VICON body joints capturing capabilities.(b) Placement of markers to the Dancer.

Figure 7 .
Figure 7. Time alignment of two choreographic sequences.Aligned points are depicted by the arrows.

Figure 8 .
Figure 8. VICON global coordination system being transformed to a local coordination system.Its center is the center of mass of the dancer [21].This allows for compensation of the dancer spatial positioning.
depicts the most representative postures of the Syrtos at 3 beats dance.Each choreographic posture is illustrated with different color indicating the most representative frames that summarize the whole choreographic sequence providing the kinesiology patterns.

Figure 12 .
Figure 12.An instance that illustrates seven frames from the Syrtos at 3 beats dance.

Figure 13 .
Figure 13.The most representative postures of the Syrtos at 3 beats dance.Each posture is depicted with different color indicating the keyframes of the folklore dance.These postures summarize the whole choreographic sequence.

Figure 14 .
Figure 14.The coordinates of the trajectory of the left foot joint, which shows the rhythm of the dance performed by dancer 1.

Figure 15 .
Figure 15.The coordinates of the trajectory of the left foot joint, which shows the rhythm of the dance performed by dancer 2.

Figure 16 .
Figure 16.Performance illustration for the matching process.

Table 1 .
Greek folklore dances and the main choreographic steps.

Table 2 .
Summary of VICON and Kinect II sensing.

Motion Capture System Cost Accuracy Calibration Camera Resolution
depicts a block diagram of the proposed methodology.