Unsupervised 3D Motion Summarization Using Stacked Auto-Encoders

: In this paper, a deep stacked auto-encoder (SAE) scheme followed by a hierarchical Sparse Modeling for Representative Selection (SMRS) algorithm is proposed to summarize dance video sequences, recorded using the VICON Motion capturing system. SAE’s main task is to reduce the redundant information embedding in the raw data and, thus, to improve summarization performance. This becomes apparent when two dancers are performing simultaneously and severe errors are encountered in the humans’ point joints, due to dancers’ occlusions in the 3D space. Four summarization algorithms are applied to extract the key frames; density based, Kennard Stone, conventional SMRS and its hierarchical scheme called H-SMRS. Experimental results have been carried out on real-life dance sequences of Greek traditional dances while the results have been compared against ground truth data selected by dance experts. The results indicate that H-SMRS being applied after the SAE information reduction module extracts key frames which are deviated in time less than 0.3 s to the ones selected by the experts and with a standard deviation of 0.18 s. Thus, the proposed scheme can effectively represent the content of the dance sequence. proposed architecture for video dance summarization using stacked auto-encoders and representative algorithms.


Introduction
One interesting procedure for video visual analysis is video content summarization, a technique which has received wide research interest in recent years due to its wide application spectrum. The scope of a video summarization algorithm is to find out a set of the most representative key-frames of a video sequence, taking into consideration salient events and actions on video content so as to form a short but meaningful synopsis [1]. The existing video summarization techniques abstract the input data using three different approaches [2]. The first is the so-called representative key-frame selection that creates video summaries through a collection of representative key frames [3]. The key subshot-oriented approach selects the representative subshots of key-frames to form the video synopsis [4]. Finally, the key object detection method decomposes the whole video sequence into several single frames, each revealing representative objects in a given video sequence [5].
In the context of performing arts, such as dance sequences, variations of human body signals and gestures are essential elements describing a storyline or choreography in a symbolic way [6]. One important aspect in the analysis is the extraction of the choreographic motifs since these elements provide a fine summarization of the semantic information encoded the overall storyline [7][8][9].
Automatic summarization of choreographic sequences is an important issue in computer graphics research due to the following reasons. First, labelling procedures are time-consuming and occasionally require feedback from experts since motion capturing data are often unlabelled. Second, spatio-temporal analysis demands the reduction of 3D motion data and thus the automatic definition of all important features in a dance sequence. Third, implementation of advanced classification algorithms, based for example on deep learning neural network structures [10] require a large amount of labelled training data. Therefore, unsupervised summarization methods are necessary of producing representative training samples especially when large amount of video content is available.
The recent achievements of deep machine learning [10] have been proven to be very effective for visual recognition especially in the context of motion primitive identification or for object detection and recognition on benchmarked datasets [11]. The main advance of deep learning compared to traditional shallow learning approaches is that the former can automatically extract a set of optimal features for classification (pre-training) by deeply process raw visual content and analyse it on a discriminatory basis. Instead, the traditional shallow learning methods exploit hand-crafted image descriptors in their analysis which is application sensitive.
However, few works can be found dealing with the identification of 3D moving subjects and extracting motion primitives from dance sequences, creating a summarized representation of a choreography. In general, video summarization within motion content exploits methods that receive as inputs 3D skeleton data, captured by motion capturing systems (i.e., Kinect, OptiTrak, VICON) representing choreographic primitives of a dancer's performance. In particular, the capturing system extracts 3D coordinates of salient humans' joints measured them in a global coordination system and then video summarization is carried out by processing these (x, y, z) data instead of the raw image pixels. Usually, representational models have been applied for performing the summarization of a dance such as the Sparse Modeling for Representative Selection (SMRS) algorithm [12] or its hierarchical implementation [6]. However, since there is a great redundancy both in space and time (many frames represent similar characteristics), these methods fail to effectively represent dance video sequences, especially when multiple actors (dancers) are performing.
To address the aforementioned difficulties, we introduce a novel unsupervised-driven summarization scheme for dance sequences. Our method first exploits a stacked auto-encoder (SAE) mechanism followed by representational algorithms for key frame extraction. The purpose of SAE is to compress the raw captured inputs (containing a significant amount of redundant information both in space and time) in a way that an optimal reconstruction is achieved from the compressed data. That is, the encoded data (e.g., the compressed ones) are reconstructed in a way to optimally represent the raw input signals [13]. Data compression can be achieved using other approaches, apart from SAE. The wavelet transform is one of these approaches [14]. It can be applied to identify the salient features and reduce the redundancy/irrelevancy in a deterministic process using a time-frequency decomposition. This yields sufficient results, depending on the selection of the mother wavelet. However, highly non-linear schemes, like neural networks can be more effective especially when the statistical properties of the signal are dynamically changed [15,16]. Yet, SAEs is a deep example of a highly non-linear compression scheme which, through an unsupervised training phase, can learn all important properties of the dance, handling efficiently variations in spatial and temporal redundancy.
The 3D skeletal coordinates are used for data sequence representation obtained using the VICON motion capturing interface. The 3D motion coordinates are propagated into a stacked encoder with the main purpose to produce a compressed input signal of low redundancy that can optimally characterize the dance sequence. Then, representational algorithms, such as the hierarchical SMRS, are implemented to perform the final summarization. This way, the performance is maximized since summaries are extracted on a compressed input signal instead of the redundant high-dimension input signal data.
Previous works [6,8,17] implemented summarization techniques to extract the synopsis of choreographic sequences. Our work exploits the reduction of the redundant raw input-data to create a fine-grained representation. This is achieved by refining the input data using SAEs, so that any redundant information is discarded. Such an approach is very important especially when multiple dancers are present in the dance sequences, unlike to the previous works, which focus on the performance of a single dancer. The presence of multiple dancers make the analysis much more complicated due to (i) humans' joint occlusions (some joints of one dancer are not visible since they are occluded by the other dancers in the 3D space) and (ii) merging of some joints of the dancers together. Although, the VICON motion capturing system can extract the labels of the passive markers with respect to the dancers, in our setup, we have not considered these labels, making the problem more challenging. Figure 1 shows an example of the geometric challenges that the presence of two dancers causes to our analysis. Looking these two dancers, the right hand of the left dancer is overlapped with the left hand of the right dancer. Another example is depicted in Figure 2. By looking at the fifth and sixth frame of the sequence, one can notice that only one dense body (dancer) executes the choreography (fourth row) while as it can be observed from the RGB content the dancers are two (third row). Thus, the application of conventional video summarization algorithms will yield to a failure. All these bottlenecks, that is, (i) overlapping of the skeletal joints and (ii) redundancy of the raw input data are addressed in this paper through the use of a combined SAE scheme followed by a hierarchical implementation of a SMRS.  This visual sequences depict the motion capturing process. 3D skeletal data are obtained by the VICON motion capturing system (second and fourth row) and the respective RGB content (first and third row). This figure refers to Makedonikos dance sequence, executed by two dancers simultaneously. This article is organized as follows: In Section 2, a description of the current state-of-the-art is given along with the proposed contribution of this paper (see Section 2.1). Section 3 gives an overview of the proposed summarization workflow which combines an SAE scheme with sampling algorithms. The adopted SAE structure is discussed in Section 4. Section 5 presents the hierarchical sparse modelling representative selection algorithm, called H-SMRS. In Section 6, experiments are carried out using real-life dance sequences and objective criteria are proposed for a comparative evaluation for different summarization methods. Finally, Section 7 draws the conclusions of this paper.

Related Works
In general, video summarization techniques are distinguished into the following main categories [2]: (a) representative frame-based selection, (b) key-frame subshot detection and (c) key object detection algorithms. The representative frame-based selection focuses on identifying a series of discontinuous frames to comprise a synopsis that represents the whole video content as much as possible. In this context, References [18,19] propose different key frame extraction methods based on visual descriptors. On the other hand, Reference [3] performs the key frame extraction using only the temporal variations of a video sequence.
In the context of key-subshot detection, Reference [4,20] can be considered. These approaches extract short-time segments of a video as meaningful representations of its visual content. In Reference [21], an unsupervised video summarization algorithm is introduced that uses title-based image search results to find out shots of visual similarity. In Reference [22], the authors introduce a video summarization technique that decomposes the whole sequence into key objects. This representative selection problem is formulated as a sparse dictionary selection problem. Finally, Reference [23] identifies key-frames as a set of local interest points description and repeatability graph clustering. The selection of key frames is performed using graph clustering by approaching modularity principle.
In choreographic context, video summarization techniques use motion variations of spatio-temporal data in order to define the most representative key frames of the dance sequence. An example of this category is Reference [6] that applies the SMRS algorithm under a hierarchical modification for video dance summary. This method captures the variations or movements of each human action in different subspaces, which allow them to be represented as sequences of transitions from one subspace to another. This work is valid only for a single dancer while its performance when multiple dancers are present severely deteriorates. In Reference [24], the problem of learning motion primitives as one of temporal clustering is addressed deriving an unsupervised hierarchical bottom-up framework called hierarchical aligned cluster analysis (HACA).
HACA defines a partition of a given multidimensional time series into disjoint segments such that each segment belongs to one of clusters. HACA combines kernel k-means. In Reference [25], a robust method for detection and tracking human poses in videos is presented by matching video trajectories to a 3D motion capture model. The main novelty of this work resides in computing the correspondences between video and motion capture data. Reference [26] detects local minima in the temporal variation of the motion speed. The analysis is obtained by applying a low-pass filter to a one-dimensional motion speed data stream. In Reference [17], the authors have incorporated unsupervised clustering method for extracting key frames from choreographies. Toward this direction, classification of motion primitives of a dance using Long-Short Term Memory (LSTM) structures has been proposed in Reference [27].
In Reference [7], motion motifs and motion signatures are represented as a succinct but descriptive representation of motion sequences. Firstly, the motion sequences decomposed to short-term movements called motion words, and then the words are clustered in a high-dimensional feature space to find motion patterns. To this end, a deep learning architecture is exploited to embed the motion words into features. In Reference [28], an exploratory search system for large data collections of motion capture data is presented. The system provides an overview of human poses in a hierarchical dendrogram visualization that represents the result of a clustering procedure. A node-link diagram enables the user to analyze human poses as nodes, where each node shows a collection of similar pose instances.

Our Contribution
As we have previously stated, the main limitation of the aforementioned methods is that they apply the representational algorithms for dance summarization directly on the raw captured data, containing a significant amount of redundancy. Therefore, their performance is deteriorated, especially for long-dance video sequences. The redundancy problem is even more evident when multiple dancers are presence in a choreography, since the interactions among them may lead to a high confusion, as far as, the extracted key-frames are concerned. To address these issues, we introduce an SAE scheme prior to the representational sampling algorithms to reduce redundancy and, therefore, increase the dance summarization performance.
The paper compares the summarization performance using four sampling algorithms all applied over the SAE scheme's projected data. The results on real-world dance sequences, captured using two dancers performing, indicate that the proposed SAE-based redundancy reduction scheme can yield an effective repsentation of the dances sequences which on average deviates less than 0.30 s from the key-frames selected by dance experts (ground truth data) and with a standard deviation of about 0.18 s. Figure 3 presents the main architecture of the proposed unsupervised approach for dance summarization. Initially, from each (x, y, z) coordinates of a skeletal dancer's joint, kinematics attributes are extracted such as velocity and acceleration [8]. Then, the enhanced 3D motion primitives are forwarded into a stacked auto-encoder with the main purpose of compressing (encoding) the raw motion captured attributes into low dimensional representations. Encoding is performed in a way that the decoder is able to optimally reconstruct the raw input signals from the compressed ones, significantly reducing spatio-temporal redundancy [10,13]. The final module of the proposed architecture is the unsupervised representational algorithm for extracting the most importance key-frames of the dance sequence. The representational algorithm receives the low dimensional compressed data as inputs instead of the high redundant (both in space and time) raw signals, improving the overall summarization performance.

Physics-Based Attributes of 3D Motion Primitives
In the following, let us denote as J G ) the k-th joint out of the M extracted by the Vicon architecture for each dancer for the t-th frame of the dance sequence. In our case M = 40, that is, 40 joints are extracted per human dancer. Variables x G k (t), y G k (t) and z G k (t) indicate the coordinates of the k-th joint with respect to a reference point setting by the VICON architecture (in our case the center of the square surface) for the t-th frame. These joints have been obtained after the application of a density-based filtering on all the detected joints to remove noise from the acquisition process (see the third paragraph of Section 6). This noise becomes apparent when multiple dancers are performing in the choreography.
The main problem in directly processing the extracted joints J G k (t) is that they refer to the VICON coordination system which do not reflect the dancer's position in 3D space. For this reason, we first compute the center of the mass for each dancer and then the coordinates of J G k (t) is transformed to a local coordinate system, the origin of which coincides with the center of mass of a dancer, where C cm (t) denotes the center of mass of a dancer. As far as the kinematics attributes is concerned, the velocity and the acceleration are taken into account. In particular, the velocity is given as u k (t) = d J L k (t)/dt, while the acceleration as γ k (t) = d u k (t)/dt for each detected human joint. Since velocity and acceleration are given through a derivative formula, their calculation is independent from local/global coordination system and thus they are independent of a global translation. Alternative, we could use global dancers' velocity along with small local velocities of the joints to improve the feature analysis. But in this paper, we prefer to concentrate on simpler features. Gathering all these features together a vector is constructed as ( J L k (t), u k (t), γ k (t)). In the aforementioned notation, we focus only on one dancer and thus we omit indices describing the dancers for clarity purposes. Figure 2 show the humans' joints extracted both on RGB content (the first and the third row of Figure 2) and on a plane depicting the movement of the dancers in the space (second and fourth row of Figure 2). Since we have two dancers executing the choreography, it is clear that severe occlusions and merges are encountered, mainly due to the 3D geometry of the dancers. This is the case, for example, of the fifth and sixth frame of Figure 2 where one can notice, by observing the frame content, that only one dancer appears to perform.

The Proposed Stacked Auto-Encoder (SAE) Module for Dimensionality Reduction
The core idea of our SAE representation is to capture a meaningful content of the main patterns of the raw data inputs by discarding any redundant information, that is, any outlier in data samples which will not be justified well using that representation. The learning process is described simply as minimizing a loss function over a training set. But since no desired outputs are required, the whole process is unsupervised. That is, the desired outputs are the same with the inputs. The final results will be a representation of low dimensionality of the input data. Thus, an SAE works similar to a Principal Component Analysis (PCA) but under a non-linear framework. Figure 4 depicts the proposed SAE approach for input data dimensionality reduction. In the following Section 4, we analyze with more details the SAE structure adopted in this article.

Unsupervised Representational Sampling Algorithms
The last step of the proposed unsupervised video summarization algorithm employs traditional representational methods, such as the hierarchical SMRS [6], SMRS [12], K-OPTICS and Kennard Stone [29] for performing the final dance sequence summarization. K-OPTICS combines density-based and centroid based approaches [17,30]. The idea is implemented in a two step process. Start by clustering the available data using a centroid based approach, for example, k-means. Then, in each cluster run a density based approach, that is, OPTICS. The Kennard Stone (KenStone) algorithm applied in order to generate a training set when no standard experimental design can be implemented. All samples are considered as candidates for the training set. The selected candidates are chosen sequentially.
Sparse Modelling for Representative Selection (SMRS) estimates correlations among different frames to extract the key ones. The principle of this scheme is to make the coefficient matrix as sparse as possible so as to achieve reconstruction of the whole dance sequence only from few data samples, that is the representative ones. In our recent work [6], a hierarchical implementation of the SMRS, called H-SMRS has been introduced. This hierarchical approach extracts a set of representative frames using the compressed input data under a hierarchical manner to take into account dance content complexity and fluctuations.

The Proposed Sae Scheme for Dance Sequence Summarization
The structure of the proposed SAE is depicted in Figure 4. As is observed, an SAE includes two modes of operations; the encoding and decoding mode. The goal of training is to minimize a loss function, say L(·) over a mean square error criterion. In particular, if x are the input data, then the loss function is expressed as L(x, g(β(x))). In this notation β(·) is the overall non-linear function of the SAE encoder, whereas g(·) denotes the non-linear function of the decoder. Therefore the relationship g(β(x))) denotes the operation of the encoding followed by the decoding.
In our particular implementation, three hidden layers are used for encoding phase. As we are moving deeper and deeper in the encoding hidden layers, the number of neurons that a hidden layer consists of is reduced. This forces the encoder to compress the input signals into a lower transformed versions of them. The input signal x k ∈ R n of the encoder are the kinematic driven attributes of 3D skeletal human's joint points (see Section 3.1). Variable n denotes the dimension of the input signal, that is, it is equal to the number of frames of the dance sequence N, by the number of joints per dance M, by the number of dancers D. That is, n = N * M * D. In our case, we focus on two dancers and on 40 humans' joints thus, M = 40 and D = 2. In addition, number N depends on the length of the dance sequence. In the current notation, we have omitted the dependence of the feature vector x k on time t just for simplicity purposes.
The x k triggers the first hidden layer to generate a transformed version of it of lower dimension.
In particular, the output of the first hidden layer h (1) k ∈ R m (1) is given by where W 1 is the encoding weight matrix, b 1 is the corresponding bias vector and f (·) the sigmoid vector-valued function. Variable m (1) denotes the dimension of the first hidden layer output signal. It is held that m (1) << n in order to yield a compressed version of the input signal x k . In a similar way, the output of the second hidden layer transforms the hidden signals of the first layer (that is the h k ∈ R m (1) ) into a further dimensionality reduced representation h 2 k ∈ R m (2) . Then, the ne woutput will be given as where W 2 is the respective weight matrix of the second hidden layer, b 2 the respective bias and again f (·) the sigmoid vector-valued function. It is held that m (2) << m (1) , so that a further compression is carried out. With the same way, the output of the second hidden layer h 2 k is propagated to the third hidden layer to produce a new reduced version h 3 k ∈ R m (3) of the input signal with a much lower dimension m (3) << m (2) . The parameters of the SAE, that is, the matrices W T i as well as the bias b i , are given through a training procedure minimizing a least square loss function L(·). The unsupervised operation of SAE is to generate as outputs, signals which are as close as possible to the input signals x k . This is achieved through minimization of the following loss function.
whereˆ x k denotes the approximate version of the input signal x i as generated by the encoder-decoder. This means thatˆ x k = g(β( x k )). Training is performed over a set of Q samples of the same form of x k . Dropout is used to reduce overfitting in the training process of neural networks. The overfitting problem is faced when the training dataset is small, which would result in a low accuracy on the test dataset. Dropout can randomly affect the neurons of the hidden layer to lose power in the training process. Technically, dropout is able to be achieved by setting the output date of some hidden neurons to 0 and then these neurons cannot be related to the forward-propagation algorithm.

The Hierarchical-Sparse Modelling Representative Selection
A hierarchical implementation of the Sparse Modelling Representative Selection (SMRS) algorithm, say H-SMRS [6], is adopted in this paper for key-frame extraction. The H-SMRS is applied on the compressed transformed signals, h (n) k of the encoding mode of SAEs instead of our previous works where this algorithm has been applied directly on the 3D attributes. This way, we discard redundant information existing in the data samples, a process which is very important especially in case where multiple humans are dancing in a sequence.
The proposed hierarchical scheme is based on the Sparse Modelling for Representative Selection (SMRS) algorithm [12] which reconstructs the N total frames of the dance sequence from K representatives. The optimization of the algorithm is achieved using the Alternative Direction Method of Multipliers (ADMM) [31]. Actually, this method comprises of iterative steps, taking into consideration the Lagrange multipliers.
The traditional SMRS algorithm is sensitive to temporal redundancies. Therefore, it fails to model the temporal dependencies of a choreography. To overcome this difficulty, we have introduced in Reference [6] a hierarchical decomposition scheme of the SMRS algorithm which first detects time intervals on which further decomposition takes place so as to create hierarchies of the key frame representatives. Thus, hierarchical SMRS segments the initial feature space into suitable sub-spaces that better model the choreography. The proposed H-SMRS is able to efficiently describe more complicated choreographic patterns, since the feature fluctuation within a sub-time interval (sub-space) is less than the fluctuation of the entire feature space of the sequence. Figure 5 presents an example of the hierarchical decomposition framework (H-SMRS). At the first layer, three representatives are extracted to model the whole video sequence (marked in green). Therefore, the initial video sequence is decomposed into four further sub-sequences (intervals), since the first and the last frame are also considered as representatives. Then, we assume that the third out of the fourth video sub-sequences. that is the interval ∆τ(1, 2), is further decomposed. ∆τ(1, 2) expresses the first layer at the second sub-sequence (interval). For this reason, the SMRS algorithm is applied within the interval ∆τ(1, 2) for extracting representatives that best fit the frames of this interval. In this example, two representatives are identified, again marked in blue color at layer 1. Therefore, the video segment of ∆τ(1, 2) is further decomposed into three more sub-segments. This procedure is iteratively applied until the decomposition criterion identifies that no further decomposition is required. Layer l=0 Layer l=1 t=K Δτ(t=l-1,i-0)

Representatives
Video sub-segments time istances

Experimental Results
In this section, we present several experiments to demonstrate the performance of the proposed unsupervised 3D motion summarization framework based on a stacked auto-encoder used to reduce the redundant information. The proposed stacked auto-encoder scheme is evaluated over three different dance sequences (see Section 6.2). Each choreographic sequence is executed by two humans, dancing simultaneously. We present several experiments to demonstrate (i) the encoding capabilities and (ii) the similarity of the automatically selected frames against the ground-truth.
As input data we use the ones presented in Section 3.1. That is, we extract for each human joint the relative coordinates and its kinematics, that is 5 elements (3 for the joint coordinates and two for the velocity and acceleration). We recall that we have 40 joints per human dancer. Thus, the total feature space is of dimension 400 (40 joints by 2 dancers by 3 coordinates per joint plus velocity and acceleration).
Due to the presence of two dancers in the sequences, a severe noise exists. To remove it, we first pre-process the data to exclude some frames which seem to be noisily represented. This is accomplished by just thresholding the differences of the joint coordinates among few consecutive frames. If this difference is greater than a threshold, this implies that a severe difference is noticed among the successive frames revealing an erroneous performance in 3D data encoding. A dancer (and thus his/her joint coordinates) cannot be moved long within the grid space during a choreographic performance. Having refined the captured data from potential noisy inputs, then we feed the features into the proposed SAE scheme to get a compressed input signal where all redundant information will be discarded.
Once, the stacked auto-encoder (see Equation (4)) is trained, we maintain the encoder part and project the feature values onto a latent space of lower dimension. In our experiments, we keep only 48, out of 400, feature element dimensions. This number has been selected after several experiments since it gives an acceptable performance while retaining the dimension as low as possible. A set of summarization approaches are applied, including the adopted unsupervised representational algorithms, along with other prominent methods such as k-OPTICS and Kennard Stone [29]. The last step of the analysis involves the calculation of similarity scores and the time divergence between the summarized frames and a set of selected key-frames by expert users in traditional dances (ground truth data sets). The former is calculated by the correlation scores between each frame of the original dance sequence to all the frames, provided by the sampling method. A higher score indicates a better match. Time divergence is simply calculated by the difference in frames, which is the same as the difference in times (seconds). In this case, the lower the difference is, the better the sumarization performs.

The Acquisition Module
The heart of the acquisition module adopted for modelling the dancers' motion trajectories in 3D space is based on the VICON Motion Capturing System. In our implementation, ten Bonita B3 cameras are used, running the Nexus 1.8.5.61009 h software. The movement area is 6.75 m 2 . The origin of the VICON coordinate system is the centre of the square surface. A wand with markers is used to calibrate the ten cameras. User body is measured by attaching passive markers on it at fixed positions for each dancer. After sticking all the markers, the height, weight and other specific anthropometric characteristics of the users are measured (see Figure 6). The data sets contain three recordings from Greek folklore dances, performed simultaneously by two professionals. We chose male and female expert-dancers since for those particular dances, the choreographic performance between men and women is slightly different. Specifically, men dance proud and imperious, while women modest and humble. On the contrary, dance style differences among professionals of the same gender are slight and mainly due to the personality of the dancer and how she/he executes the predefined choreographic performance.

Dataset Description
Three dance sequences have been recorded using the VICON motion capturing platform [32]. These dance sequences refer to three different performances (dances), each executed simultaneously by two dancers (one male and one female). The recording process took place at the School of Physical Education and Sport Science of the University of Thessaly in Trikala Greece in January 2019. All sequences are Greek traditional folkloric dances, the selection of which was made by dance experts of traditional dances of the Schools of Sport Science of the Universities of Thessaloniki and Thessaly in Greece. The selection fulfils (i) different types of complexities in the dance main patterns, (ii) circular performances of the dance, (iii) different styles and (iv) different rhythmical tempos. All dancers are professional actors and each dance was executed twice per actor so as to record different paths of the same choreography. Figure 6 presents a photo of the environment used for the acquisition of the dance sequences using the VICON motion interface.
In Table 1, we also present a brief description of the dances along with their main steps. These steps have been defined by the dance experts who have designed the whole choreography and refer to the main variations of the dance as acquired through the VICON capturing module. Thus, the main steps of the dance in Table 1 do not refer to the steps of the choreography as being taught to a dancer trainee but to the main "activities" of the dance as being captured by the digitization unit.  Figure 7). Then, the main patterns of the dance stop and the choreography starts from scratch.

Sirtos (3-Beat)
A Greek folklore dance in a slow rhythm performed by both women and men.

Evaluation Metrics
As we have stated above, ground truth data have been created by experts of Greek traditional dances. These experts are affiliated with the schools of sport science of the University of Thessaloniki and University of Thessaly in Greece. The ground truth data include a set of desired key frames, as being specified by the experts. Let us denote as g l the selected key frames by the experts, with l = 1, 2, ..., L where L is the number of representative frames as being indicated by the experts. We also symbolize as G the set containing all these selected frames, that is, G = { g 1 , · · · , g L }. Let us also denote as r k , k = 1, 2, .., K the extracted representative frames by any summarization algorithm and as R = { r 1 , · · · , r K } the respective set containing all K representatives extracted. Indices l, k are actually the frame instances of the ground truth key frames and the ones extracted by a summarization algorithm respectively. Thus, one objective criterion for evaluating the performance of a summarization scheme is to find, for each of the K extracted frames by an algorithm, the time instance (i.e., the frame index) of the experts' selected frame which is closest to the first one and then take the frame index difference of the ideal (experts' selected frame) and the extracted one. In other words, wherel(k) is the optimal frame index returned over all L selected frames in G for an examined extracted frame in R, say the k-th. We should notice that different extracted key frames r k 1, r k 2 with k 1 = k 2 may yield the same selected frame gl (k) meaning that some of the L selected frames may not correspond to any of the K extracted key frames. Then, the absolute difference |l(k) − k| describes how close is the k-th representative frame (by a summarization algorithm) to the closest ground truth one. In particular, where µ is the average time instance deviation among all K extracted representatives and µ max the maximum deviation (worst case) among all K extracted frames. Another criterion is to estimate how well all frames of a dance sequence can be reconstructed (represented) by the key frames. This is performed in our case by calculating the correlation coefficient of the feature vector for each frame of the dance sequence x i , i = 1, ..., N against all representative frames r k , k = 1, ..., K.
where ρ(·) refers to the correlation coefficient of two vectors. The maximum the value ρ is the better the matching of that particular feature to a key frame. Thus, by taking the maximum value over all representative frames r k as being set by a summarization algorithm, we estimate the best relation of any frame of the dance sequence to the extracted representatives. If this correlation is high, then the extracted key frames can well represent all frame sequences. Instead a small maximum correlation for some frames means that these cannot be reliably reconstructed by the key representatives.

Dance Summarization Experiments
In this sub-section, we present some results of different summarization algorithms on the above-mentioned dance sequences. In particular, Figure 8 demonstrates the results obtained on Syrtos (2 beat) dance sequence, consisting of more than 5000 frames, using as summarization algorithm the K-OPTICS. More specifically, we extract 32 key-representatives using the K-OPTICS algorithm and then we calculate the maximum correlation score ρ max i for each frame of Syrtos (2 beat) dance sequence against the 32 key frames extracted [see Equation (5)]. As shown in Figure 8, the average ρ max i for all 5000 frames (that is for all i ∈ N) is 0.5 with a variance of 0.25, which is a relatively low score. However, as we have stated previously, some frames of the dance sequence have been erroneously encoded mainly due to the simultaneous presence of two dancers in the choreography and the dense occlusions this causes. Thus, if we refine the frames of the dance sequence by excluding the ones whose the joint coordinates between two consecutive frames present high differences, greater than a threshold (in our case the threshold is set to 20% rate of change in joint's coordinates, for more than 20% of joints), then the correlation score is significantly improved. In particular, in this case the average ρ max i for all 5000 frames becomes more than 0.6, indicating a good summarization ability. Additionally, the majority of excluded frames, shown as purple crosses in Figure 8 can be found bellow the average similarity score. Such an outcome suggests that the applied rules for corrupted frames removal are adequate for the problem at hand. Figure 9 illustrates the summarization performance when the Kennard Stone sampling algorithm is applied over Syrtos (3 beat) dance sequence. Again, as in Figure 8, the non-corrupted frames achieve a high average similarity score, close to 0.67, indicating that the summarized sequence can adequate describe (correlate) most of the originally captured frames. The fluctuations are also limited, and appear around frame 1500. Table 2 summarizes the maximum correlation coefficients scores before and after the exclusion of the corrupted frames for all the three dances and the four examined sampling algorithms. It can be seen that the correlation scores obtained is about 0.6 revealing a satisfactory performance of the key frames as representatives of the whole dance sequence variation. In this table, we have presented as bold the highest correlation values. Figure 10 demonstrates the average differences in frames (time instances) between a frame selected using a specific sampling approach (i.e., a summarization algorithm) and the experts' selected frames (ground truth), for a particular dance. That is, the criterion µ of Section 6.3. Since the the frame rate of the system is 120 fps, a value of 50 indicate that the sampling approach generates frames less than half-a-second earlier/latter compared to the experts' selection. The impact of using raw against encoded data is, also, assessed. Results indicate that SMRS based approaches perform better to the other summarization schemes, for both raw and encoded data, when we have a single dancer sequence.    Figure 10. Data input type summarization impact when two dancers performed simultaneously for all the examined dance sequences.
In this figure, we also compare the performance derived against the four summarization methods; that is, K-OPTICS, Kennard Stone, SMRS, and the proposed hierarchical SMRS, H-SMRS. As we can observe from Figure 10, the H-SMRS gives the best performance for all dances with a deviation around 50 frames (or, approximately, 0.41 s), when encoded frames are used as inputs. The H-SMRS scheme also provides much better performance for the Syrtos(3b) dance, which seems to be more complicated than the other two dances, resulting in higher time deviations for the rest of the samplers. It is also worth mentinign the complex effect of coupling different features and samplers. For example, Syrtos(2b) input type does not affect significantly the performance for all four samplers. Table 3 shows the average time deviation of key frames extracted by the four summarization algorithms and the ground truth data, that is, the value µ, measured, however, in seconds and not in frame index differences just for clarity. As is observed, the best performance is given for the the H-SMRS algorithm when the SAE scheme is used. In particular, the highest deviation of the H-SMRS is achieved for the Syrtos (3b) equal to 0.26 s deviation on average which is in fact a very small deviation value. Similar performances of 0.23 and 0.24 sec deviations is also noticed for the other two dances. In the same table, we also present the standard deviation of the time shift to the ground truth data to show how these values vary. Again, H-SMRS yields the smallest standard deviation values which is about 0.18 s using the SAE, revealing its robustness against the other compared summarization algorithms. Table 3. Average time shift among the summarization outcomes and the experts' annotations with and without the Stacked Auto-Encoder (SAE)-based data compression scheme.

Summarization
Algorithm In the same table, we illustrate the results without using the SAE scheme. All summarization approaches, except KenStone algorithm, provide better results when the SAE-based compression framework is adopted. We get better scores in both average time shift and standard deviation, compared to the expert's annotated frames. For the Kenstone algorithm and only for two out of three dances, the performance remains, approximately the same, regardless of using or not the SAE. Table 4 shows how much the average time shift of the four examined summarization algorithms and the ground truth data is improved when the SAE-based compressed scheme is applied on the raw 3D data in case of Syrtos (3b) dance sequence. The results have been depicted for two different executions of the dance, one with a single dancer and one with two dancers. It is observed that in case of a two dancers' performance the improvement ratio is much greater than the single dancer performance execution. Moreover, the adoption of the H-SMRS combined with SAE schema exhibits great improvement which reaches 81.80%. Table 4. The improvement ratio among the adopted summarization algorithms with and without the SAE framework for Syrtos (3b) dance sequence. Two different performances of the dance are assumed; one for a single dancer and one for two dancers.  Figure 11 provides further insights on the similarity among extracted key frames, using summarization algorithms, and some user annotated (selected) key frames. This allows us to visually judge on the similarity between the key frames extracted by the summarization algorithms and the ground truth ones. The results demonstrate five basic postures from Makedonikos dance. Then, for each the four summarization approaches, we select the closest frame to the user annotated posture of reference. As is observed, H-SRMS selections are closer to the experts' defined key frames, compared to K-OPTICS, SMRS, and KenStone approaches.  Figure 11. A visual representation of the key frames extracted by the four summarization algorithms than the ground truth ones in case of Makedonikos dance. Figure 12 demonstrates the encoding capabilities for the adopted SAE scheme. Recall that 400 values have been reduced to 48 and then reconstructed back using SAEs. As shown, the representation of the decompressed data [see Figure 12a] are close to the original skeletal data [see Figure 12b] and maintain the two body postures and the general body form while the great compression (we retain only 48 joints than the 400 total ones). However, upper limps' joints positions have been gathered towards the body core. However, a better representation could be feasible by increasing the training epochs, which due to the limited training samples, that is, dance frames, does not affect significantly the training times. Another important criterion is how results vary (fluctuate) from the average values, as depicted in Figure 10. This is also illustrated in Table 3 where the standard deviation of the average time shift is given. But in Table 5 we also present the minimum (best) and the maximum (worst) performance [that is, µ max of Equation (4)] for all the three dances. As we can see, µ max reaches 0.72 s for the most difficult Makedonikos dance in case of H-SMRS. For the other two dances the worst (maximum) deviation is of about 0.5 s for the H-SMRS indicating an excellent summarization performance which is much smaller than the other summarization schemes. Regarding the minimum difference, all the summarization schemes yields excellent performance. This means that the best results obtained are very satisfactory. Table 5. The minimum (best) and maximum (worst) time deviation (µ max ) among the key frames extracted using a summarization algorithm and the ground truth data. The comparison is carried out using four summarization algorithms, K-OPTICS, Kennard Stone, SMRS and H-SMRS and for the three dances.The values are in seconds.

Dance
Minimum

Conclusions
In this paper, we propose a deep stacked auto-encoder scheme followed by a hierarchical Sparse Modelling for Representative Selection (H-SMRS) summarization algorithm for performing accurate synopses of dance sequences. The sequences have been recorded through a motion capturing framework such as of VICON which produces 3D point joint of the dancers. The originality of this paper lies in the fact that our recorded dance sequences consist of two dancers performing simultaneously. This causes severe and intense errors in capturing the humans' joints in 3D coordination space. Thus, we adopt a stacked auto-encoder (SAE) scheme to reduce the redundant information of the 3D point joints and thus improve the performance of the summarization than applying the summary algorithms directly on the raw captured data.
Regarding summarization, this paper compares the results using four key frame extraction algorithms. The K-OPTICS scheme, the Kennard Stone, the conventional SMRS and its hierarchical representation called H-SMRS. Our approach has been evaluated over three real-world dance sequences, each executing by two dancers. The results achieved show that the H-SMRS outperforms the other three algorithms for all the examined dance sequences. More specifically, the average time deviation is less than 0.3 s compared to ground truth selected frames being annotated by dance experts. Even in its worst performance, H-SMRS yields at least 0.72 s time deviations which is an excellent result. The proposed SAE approach also reduces the time required for executing the summarization algorithms than applying the summarization schemes directly on the raw data. This way, summarization become applicable to many engineering scenarios.