Due to the presence of two dancers in the sequences, a severe noise exists. To remove it, we first pre-process the data to exclude some frames which seem to be noisily represented. This is accomplished by just thresholding the differences of the joint coordinates among few consecutive frames. If this difference is greater than a threshold, this implies that a severe difference is noticed among the successive frames revealing an erroneous performance in 3D data encoding. A dancer (and thus his/her joint coordinates) cannot be moved long within the grid space during a choreographic performance. Having refined the captured data from potential noisy inputs, then we feed the features into the proposed SAE scheme to get a compressed input signal where all redundant information will be discarded.
Once, the stacked auto-encoder (see Equation (4)) is trained, we maintain the encoder part and project the feature values onto a latent space of lower dimension. In our experiments, we keep only 48, out of 400, feature element dimensions. This number has been selected after several experiments since it gives an acceptable performance while retaining the dimension as low as possible. A set of summarization approaches are applied, including the adopted unsupervised representational algorithms, along with other prominent methods such as k-OPTICS and Kennard Stone [
29]. The last step of the analysis involves the calculation of similarity scores and the time divergence between the summarized frames and a set of selected key-frames by expert users in traditional dances (ground truth data sets). The former is calculated by the correlation scores between each frame of the original dance sequence to all the frames, provided by the sampling method. A higher score indicates a better match. Time divergence is simply calculated by the difference in frames, which is the same as the difference in times (seconds). In this case, the lower the difference is, the better the sumarization performs.
6.3. Evaluation Metrics
As we have stated above, ground truth data have been created by experts of Greek traditional dances. These experts are affiliated with the schools of sport science of the University of Thessaloniki and University of Thessaly in Greece. The ground truth data include a set of desired key frames, as being specified by the experts. Let us denote as
the selected key frames by the experts, with
where
L is the number of representative frames as being indicated by the experts. We also symbolize as
G the set containing all these selected frames, that is,
. Let us also denote as
the extracted representative frames by any summarization algorithm and as
the respective set containing all
K representatives extracted. Indices
are actually the frame instances of the ground truth key frames and the ones extracted by a summarization algorithm respectively. Thus, one objective criterion for evaluating the performance of a summarization scheme is to find, for each of the
K extracted frames by an algorithm, the time instance (i.e., the frame index) of the experts’ selected frame which is closest to the first one and then take the frame index difference of the ideal (experts’ selected frame) and the extracted one. In other words,
where
is the optimal frame index returned over all
L selected frames in
G for an examined extracted frame in
R, say the
k-th. We should notice that different extracted key frames
with
may yield the same selected frame
meaning that some of the
L selected frames may not correspond to any of the
K extracted key frames. Then, the absolute difference
describes how close is the
k-th representative frame (by a summarization algorithm) to the closest ground truth one. In particular,
where
is the average time instance deviation among all
K extracted representatives and
the maximum deviation (worst case) among all
K extracted frames.
Another criterion is to estimate how well all frames of a dance sequence can be reconstructed (represented) by the key frames. This is performed in our case by calculating the correlation coefficient of the feature vector for each frame of the dance sequence
against all representative frames
.
where
refers to the correlation coefficient of two vectors. The maximum the value
is the better the matching of that particular feature to a key frame. Thus, by taking the maximum value over all representative frames
as being set by a summarization algorithm, we estimate the best relation of any frame of the dance sequence to the extracted representatives. If this correlation is high, then the extracted key frames can well represent all frame sequences. Instead a small maximum correlation for some frames means that these cannot be reliably reconstructed by the key representatives.
6.4. Dance Summarization Experiments
In this sub-section, we present some results of different summarization algorithms on the above-mentioned dance sequences. In particular,
Figure 8 demonstrates the results obtained on Syrtos (2 beat) dance sequence, consisting of more than 5000 frames, using as summarization algorithm the K-OPTICS. More specifically, we extract 32 key-representatives using the K-OPTICS algorithm and then we calculate the maximum correlation score
for each frame of Syrtos (2 beat) dance sequence against the 32 key frames extracted [see Equation (
5)]. As shown in
Figure 8, the average
for all 5000 frames (that is for all
) is 0.5 with a variance of 0.25, which is a relatively low score. However, as we have stated previously, some frames of the dance sequence have been erroneously encoded mainly due to the simultaneous presence of two dancers in the choreography and the dense occlusions this causes. Thus, if we refine the frames of the dance sequence by excluding the ones whose the joint coordinates between two consecutive frames present high differences, greater than a threshold (in our case the threshold is set to 20% rate of change in joint’s coordinates, for more than 20% of joints), then the correlation score is significantly improved. In particular, in this case the average
for all 5000 frames becomes more than 0.6, indicating a good summarization ability. Additionally, the majority of excluded frames, shown as purple crosses in
Figure 8 can be found bellow the average similarity score. Such an outcome suggests that the applied rules for corrupted frames removal are adequate for the problem at hand.
Figure 9 illustrates the summarization performance when the Kennard Stone sampling algorithm is applied over Syrtos (3 beat) dance sequence. Again, as in
Figure 8, the non-corrupted frames achieve a high average similarity score, close to 0.67, indicating that the summarized sequence can adequate describe (correlate) most of the originally captured frames. The fluctuations are also limited, and appear around frame 1500.
Table 2 summarizes the maximum correlation coefficients scores before and after the exclusion of the corrupted frames for all the three dances and the four examined sampling algorithms. It can be seen that the correlation scores obtained is about 0.6 revealing a satisfactory performance of the key frames as representatives of the whole dance sequence variation. In this table, we have presented as bold the highest correlation values.
Figure 10 demonstrates the average differences in frames (time instances) between a frame selected using a specific sampling approach (i.e., a summarization algorithm) and the experts’ selected frames (ground truth), for a particular dance. That is, the criterion
of
Section 6.3. Since the the frame rate of the system is 120 fps, a value of 50 indicate that the sampling approach generates frames less than half-a-second earlier/latter compared to the experts’ selection. The impact of using raw against encoded data is, also, assessed. Results indicate that SMRS based approaches perform better to the other summarization schemes, for both raw and encoded data, when we have a single dancer sequence.
In this figure, we also compare the performance derived against the four summarization methods; that is, K-OPTICS, Kennard Stone, SMRS, and the proposed hierarchical SMRS, H-SMRS. As we can observe from
Figure 10, the H-SMRS gives the best performance for all dances with a deviation around 50 frames (or, approximately, 0.41 s), when encoded frames are used as inputs. The H-SMRS scheme also provides much better performance for the Syrtos(3b) dance, which seems to be more complicated than the other two dances, resulting in higher time deviations for the rest of the samplers. It is also worth mentinign the complex effect of coupling different features and samplers. For example, Syrtos(2b) input type does not affect significantly the performance for all four samplers.
Table 3 shows the average time deviation of key frames extracted by the four summarization algorithms and the ground truth data, that is, the value
, measured, however, in seconds and not in frame index differences just for clarity. As is observed, the best performance is given for the the H-SMRS algorithm when the SAE scheme is used. In particular, the highest deviation of the H-SMRS is achieved for the Syrtos (3b) equal to 0.26 s deviation on average which is in fact a very small deviation value. Similar performances of 0.23 and 0.24 sec deviations is also noticed for the other two dances. In the same table, we also present the standard deviation of the time shift to the ground truth data to show how these values vary. Again, H-SMRS yields the smallest standard deviation values which is about 0.18 s using the SAE, revealing its robustness against the other compared summarization algorithms.
In the same table, we illustrate the results without using the SAE scheme. All summarization approaches, except KenStone algorithm, provide better results when the SAE-based compression framework is adopted. We get better scores in both average time shift and standard deviation, compared to the expert’s annotated frames. For the Kenstone algorithm and only for two out of three dances, the performance remains, approximately the same, regardless of using or not the SAE.
Table 4 shows how much the average time shift of the four examined summarization algorithms and the ground truth data is improved when the SAE-based compressed scheme is applied on the raw 3D data in case of Syrtos (3b) dance sequence. The results have been depicted for two different executions of the dance, one with a single dancer and one with two dancers. It is observed that in case of a two dancers’ performance the improvement ratio is much greater than the single dancer performance execution. Moreover, the adoption of the H-SMRS combined with SAE schema exhibits great improvement which reaches 81.80%.
Figure 11 provides further insights on the similarity among extracted key frames, using summarization algorithms, and some user annotated (selected) key frames. This allows us to
visually judge on the similarity between the key frames extracted by the summarization algorithms and the ground truth ones. The results demonstrate five basic postures from Makedonikos dance. Then, for each the four summarization approaches, we select the closest frame to the user annotated posture of reference. As is observed, H-SRMS selections are closer to the experts’ defined key frames, compared to K-OPTICS, SMRS, and KenStone approaches.
Figure 12 demonstrates the encoding capabilities for the adopted SAE scheme. Recall that 400 values have been reduced to 48 and then reconstructed back using SAEs. As shown, the representation of the decompressed data [see
Figure 12a] are close to the original skeletal data [see
Figure 12b] and maintain the two body postures and the general body form while the great compression (we retain only 48 joints than the 400 total ones). However, upper limps’ joints positions have been gathered towards the body core. However, a better representation could be feasible by increasing the training epochs, which due to the limited training samples, that is, dance frames, does not affect significantly the training times.
Another important criterion is how results vary (fluctuate) from the average values, as depicted in
Figure 10. This is also illustrated in
Table 3 where the standard deviation of the average time shift is given. But in
Table 5 we also present the minimum (best) and the maximum (worst) performance [that is,
of Equation (
4)] for all the three dances. As we can see,
reaches 0.72 s for the most difficult Makedonikos dance in case of H-SMRS. For the other two dances the worst (maximum) deviation is of about 0.5 s for the H-SMRS indicating an excellent summarization performance which is much smaller than the other summarization schemes. Regarding the minimum difference, all the summarization schemes yields excellent performance. This means that the best results obtained are very satisfactory.