An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features

Esaki, Kazuhiro; Nagao, Katashi

doi:10.3390/app14145981

Open AccessArticle

An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features

by

Kazuhiro Esaki

and

Katashi Nagao

^*

Graduate School of Informatics, Nagoya University, Nagoya 464-8603, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(14), 5981; https://doi.org/10.3390/app14145981

Submission received: 24 April 2024 / Revised: 1 June 2024 / Accepted: 17 June 2024 / Published: 9 July 2024

(This article belongs to the Special Issue Virtual Reality Technology: Current Applications, Challenges and Its Future)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Virtual Reality Simulation and Training for Dance Performance Improvement.

Abstract

As a significant form of physical expression, dance demands ongoing training for skill enhancement, particularly in expressiveness. However, such training often faces restrictions related to location and time. Moreover, the evaluation of dance performance tends to be subjective, which necessitates the development of effective training methods and objective evaluation techniques. In this research, we introduce a self-training system for dance that employs VR technology to create an immersive training environment that facilitates a comprehensive understanding of three-dimensional dance movements. Furthermore, the system incorporates markerless motion capture technology to accurately record dancers’ movements in real time and translate them into the VR avatar. Additionally, the use of deep learning enables multi-perspective dance performance assessment, providing feedback to users to aid their repetitive practice. To enable deep learning-based dance evaluations, we established a dataset that incorporates data from beginner-level dances along with expert evaluations of those dances. This dataset was specifically curated for practitioners in a dance studio setting by using a total of four cameras to record dances. Expert annotations were obtained from various perspectives to provide a comprehensive evaluation. This study also proposes three unique automatic evaluation models. A comparative analysis of the models, particularly contrastive learning (and autoencoder)-based expression learning and a reference-guided model (where a model dancer’s performance serves as a reference), revealed that the reference-guided model achieved superior accuracy. The proposed method was able to predict dance performance ratings with an accuracy of approximately ±1 point on a 10-point scale, compared to ratings by professional coaches. Our findings open up novel possibilities for future dance training and evaluation systems.

Keywords:

virtual reality; dance training; automatic evaluation; deep learning; contrastive learning

1. Introduction

Dance, a globally recognized medium of expressive movement, requires ongoing training and precise evaluations to enhance performance and expression. In this study, “dance performance” refers to expressions that have the characteristics of dance, such as the power and magnitude of body movements. Modern technology, including motion capture and VR, has revolutionized traditional training methodologies, yielding more effective and realistic training experiences. For example, in learning folk dance, it has been shown that subjects who received effective feedback achieved better performance by using a VR application to mimic the movements of a professional dancer [1]. Also, in learning salsa dancing, it has been reported that an interactive application in the form of a VR game improved learning effectiveness through visual and tactile feedback [2].

VR technology, through its ability to provide a virtual environment, is predicted to revolutionize various types of training, from disaster simulations to vocational courses. For example, studies conducted on six projects (medical training, industrial training, safety instruction training, disaster management training, procedural skills training, and dance training) [3] compared the effectiveness of learning in virtual and real environments and concluded that using VR is more effective than traditional “watching videos” learning. VR is also being rapidly adopted in construction engineering education and training, with a wide range of applications reported, including health and safety training and training for equipment operation tasks [4]. In dance training, VR enables dancers to practice within immersive environments by simulating diverse settings such as authentic stages or dance studios. To utilize VR more efficiently for dance, we have developed a more streamlined and practical motion capture system using two cameras. This system accurately captures the holistic movement of dance in real time in a three-dimensional format, allowing for the capture of body movements in real time within a VR space. The realistic visual feedback provided by the VR environment is projected to enhance physical sensation and expression in dance.

Accurate feedback is vital for supporting dance practice. However, the evaluation of dance is largely subjective due to a lack of consistent evaluation standards in an art form characterized by diverse movement patterns. Most existing dance practice support systems, such as Dance Commune (https://dancecommune.com/), rely on postural congruency as feedback, which is distant from expert evaluation and guidance. Therefore, with the aim of facilitating more comprehensive and objective evaluations, we have created a new dataset for automatic dance performance evaluation, and we propose an automatic evaluation model using deep learning.

There are many dance games that do not use VR, but these games basically only use controllers to evaluate timing, not body movement. Dance games in VR also exist, but their training effects have not yet been demonstrated. However, previous studies have shown that the use of VR can increase the immersion of the lesson and make it easier to understand the movements in three dimensions. The proposed system would also benefit from the ability to confirm one’s own movements from a third-person perspective. The research question of this study is to objectively demonstrate the effectiveness of VR for training that focuses mainly on physical movements, such as dance.

We examine the architecture and parameters of the automatic dance evaluation model and establish a baseline in this dataset. We compare feature extraction methods based on Laban Movement Analysis (LMA) with representation learning using supervised contrastive learning. Additionally, we consider a reference-guided model that uses the performance of a model dancer as a reference. The objective is to identify the most effective learning model. Our proposed evaluation framework can be applied to other sports besides hip-hop dancing, such as golf, where the golf swing can be evaluated via comparison to a role model.

2. Related Work

In this section, we examine prior research pertinent to our system. We first discuss previous dance evaluations that bear direct relevance to our system. Subsequently, we delineate past research on training systems utilizing virtual spaces and motion-tracking techniques. Finally, we cover contrastive learning, the deep learning technique employed in this study.

2.1. Dance Motion Analysis and Evaluation

This review of related work begins with an exploration of LMA, a critical framework in conventional dance performance evaluation. This study applies a technique that employs LMA features as a standard for comparison and contrasts it with the method we propose. Bernstein et al. [5] described LMA as a comprehensive and effective method for observing, describing, notating, and interpreting human movement. Their study served as an important groundwork, as it presented a technique for automatically recognizing Laban features using skeletal information obtained from Kinect. Hachimura et al. [6] used dance movement data obtained with motion capture technology to investigate the principal effort and shape components of LMA. They established corresponding bodily feature values and by observing temporal variations in those values, identified body movements corresponding to LMA components. These were then compared with evaluations from LMA experts. The present study employed a random forest model that leverages their LMA features as the baseline.

Further extending the use of LMA, Aristidou et al. [7] innovatively proposed a VR system for the evaluation of ethnic dances. The system introduced a motion analysis and comparison framework that employs LMA to compare user performance with a 3D avatar performing ethnic dance and provide feedback. Meanwhile, H. J. Kim et al. [8] studied the perceptual properties of the LMA effort element, exploring its potential to support applications that create or perceive expressive movement. Complementing this, Ajili et al. [9] proposed a new descriptor vector for expressive motion based on LMA, aiming to encapsulate human motion and emotion in a compact and informative representation. Deep learning methodologies for dance performance evaluation have also been investigated. S. Wang et al. [10] proposed a dance emotion recognition method, which combines convolutional neural networks (CNNs) and long short-term memory (LSTM) networks, with LMA serving as the foundation. On a different note, Lei et al. [11] crafted a new algorithm designed to extract the correspondence between music and dance movements, demonstrating the interplay between different forms of art.

Pushing the boundary further, Zhai [12] designed a dance motion recognition algorithm based on attribute mining, an approach that considers the dancers’ attribute features to capture more expressive motion features. Jin et al. [13] took a specific aspect of dance (i.e., the “stops”) and proposed a detection and visualization system to support remote dance practice.

Adding a technological touch to rhythm assessment, Dias Pereira Dos Santos et al. [14] proposed a machine learning method that harnesses motion sensor data to assist in evaluating rhythm skills in ballroom dancing. The study effectively combined coaches’ expertise with machine learning’s quantitative precision.

Several studies have targeted training methods based on dance performance evaluation. Davis et al. [15] provided strategies to boost learning efficiency in dance through a virtual training system based on a behavioral skills training framework. Meanwhile, Choi et al. [16] proposed a smartphone application for dance self-learning, introducing a novel method for evaluating various dance poses.

Using the power of visualization, Guo et al. [17] presented DanceVis, a system for visualizing group performance in cheerleading and dance, with the aim of enhancing training efficiency. Krasnow and Chatfield [18] contributed to the assessment tools field by proposing a new measurement tool, the PCEM, for assessing qualitative aspects of dance performance.

Recognizing the importance of video processing, Guo et al. [19] proposed a method for analyzing dance videos to measure physical coordination in online cheerleading and dance training. Gupta et al. [20] developed a learning game that combines dance and gamified learning, showing how technology can make dance learning more engaging and effective.

Esaki and Nagao [21] presented a VR dance training system similar to the one in this study. Their evaluation model was trained on data from four novice college student dancers, with machine learning structured as a classification problem. Instead of providing individual scores for each of the four performance aspects, it identified the lowest-scoring aspect. The present study expanded the dataset, increasing the evaluation categories to 6 and collecting data from 18 dance practitioners of various ages. This enriched dataset mirrors diverse performances and actual dance studio environments, thereby improving data reliability. Unlike their approach of identifying the weakest aspect, our model provides individual scores for each category, enabling detailed evaluation and feedback. Furthermore, the number of keypoints for pose estimation has been increased. Esaki and Nagao used 17 keypoints, whereas this study increased them to 33 to enhance the capture of the dancer’s detailed movements.

2.2. Training Support in Virtual Space

Our study developed an immersive virtual environment for dance training. Previous studies have explored similar concepts. Chan et al. [22] created a dance training system with 3D VR visualization and motion capture. Iqbal and Sidhu [23] used AR technology to improve dance skill acquisition and retention, finding that its ease of use and usefulness increased user acceptance despite some challenges. These studies helped improve the user experience in the VR environments we built.

Kico et al. [1] examined a VR application for folk dance learning, where they used an avatar of a professional dancer and optical motion capture for feedback and evaluation. D. Li et al. [24] found a 30% increase in learning efficiency when applying VR in physical education and sports training. These studies are consistent with the objectives of our VR-based dance training system, and we were able to incorporate their findings. Senecal et al. [2] developed an interactive VR game for salsa dance training with features like a virtual partner and visual and haptic feedback. P. Wang et al. [4] highlighted the growing adoption of VR in construction engineering education, which influenced our approach to developing a VR-based system. Xie et al. [25] discussed VR training’s challenges, solutions, and evaluation methods, providing insights that will be adopted in our future research. Taken together, these studies helped us to improve the effectiveness and user experience of our system. Ahir et al. [26] summarized VR’s development and potential for enabling group practice in sports at home. Lastly, Magar and Suk [3] recognized VR’s potential in various training fields, including dance, emphasizing pre-training, self-learning, and feedback. Izard et al. [27] constructed a VR simulator for scoliosis surgery, showing the potential of 3D recording in education. These studies touched on advanced visualization techniques and were helpful to us in considering the creation of highly immersive dance training environments and future directions.

2.3. Motion Capture

To enhance training immersion and enable efficient automated dance performance evaluation, this research, building upon key studies in pose estimation using cameras, introduces a system that uses 3D motion tracking to map a user’s body movements to an avatar in a virtual space.

Mediapipe Pose [28] is a comprehensive open-source library dedicated to pose estimation. It is capable of identifying 33 keypoints from images and videos, offering real-time responses with impressive precision. Owing to its capability to swiftly detect essential dance elements like ankle movements and hand orientations, we chose Mediapipe Pose as the 2D pose estimation tool for our research. Grishchenko et al. [29] and Bazarevsky et al. [30] developed methods for real-time pose estimation from images, which were utilized in this study. F. Zhang et al. [31] proposed a decoding method for heat maps, the basis for skeletal detection, emphasizing its importance for performance. Jiang et al. [32] introduced a real-time multi-person pose estimation framework that tackles common challenges in 2D pose estimation. ViTPose, developed by Y. Xu et al. [33], employs a vision transformer for efficient pose estimation, considering model simplicity and scalability. ZoomNet, presented by L. Xu et al. [34], is a unified network model designed for pose estimation, with a focus on the hierarchical human body structure and scale variations.

Several studies have addressed 3D motion capture with multiple views. Y. Zhang et al. [35] developed a real-time algorithm for multiple-person motion capture using multi-view video input. Desmarais et al. [36] reviewed various methods and their structures for human pose estimation, while Kanko et al. [37] and L. Chen et al. [38] proposed alternative approaches for human gait kinematics and multi-person 3D poses. Zeng et al. [39] and He et al. [40] proposed frameworks that improve efficiency and handle limitations like occlusion and oblique viewing angles. J.-W. Kim et al. [41] and Iskakov et al. [42] introduced methods for 3D pose estimation based on a humanoid model and a learnable triangulation method, respectively. Zhu et al. [43] proposed a pre-training phase to handle noisy 2D observations, and Y. Zhang et al. [44] presented a lightweight motion capture system using sparse multi-view cameras. Malleson et al. [45] introduced a real-time motion capture system combining multiple cameras and IMUs, and Tu et al. [46] presented an approach that aggregates features from all camera views in a common 3D space for robust pose estimation.

Various studies have focused on datasets for motion-tracking learning models. Tsuchida et al. [47] developed a large, unique street dance database with 13,939 videos from 40 dancers across ten genres, underscoring the lack of attention to dance information processing. R. Li et al. [48] enhanced the AIST Dance Video Database with 2D and 3D keypoints and proposed a cross-modal transformer network, which significantly benefited this study’s data collection. Ionescu et al. [49] constructed the Human3.6M dataset, which comprises 3.6 million poses from typical human activities, making it invaluable for training human sensing systems and evaluating pose estimation models. Calabrese et al. [50] introduced DHP19, a benchmark dataset for human pose estimation using event-based cameras, demonstrating the applicability of CNNs in this context.

2.4. Contrastive Learning

For automatic dance evaluation, this study employs a deep learning approach known as contrastive learning, a form of representation learning effective even with smaller datasets.

T. Chen et al. [51] simplified self-supervised contrast learning via a framework named SimCLR, underscoring the importance of data augmentation. This was useful in our study due to the small dataset we used. X. Chen and He [52] proposed SimSiam, improving Siamese networks for unsupervised representation learning. J. Li et al. [53] put forward Prototypical Contrastive Learning (PCL) and ProtoNCE loss, which outperform other contrastive learning methods, especially in low-resource transition learning. Singh et al. [54] applied contrastive learning to video recognition, which showed better performance than semi-supervised image recognition methods. Khosla et al. [55] explored supervised contrastive learning. They proposed a method to segregate clusters from different classes in the embedding space to improve accuracy and stability in hyperparameter settings. Their supervised contrastive learning was used in this study’s automatic dance evaluation model.

3. Immersive Dance Training System

This section begins by presenting an overview of the immersive self-training dance system, followed by an explanation of the camera-based motion capture mechanism. Lastly, it details the dance feedback provided by the system.

3.1. System Overview

This study developed a VR-based self-training system for dance that enables students to practice within a home setting without an instructor while simulating the feel of a real lesson. The system aids dance practitioners in understanding 3D movements and allows them to practice in an environment that mimics actual production settings, such as a stage. The VR Dance Training System consists of a VR application that provides dance training, a motion capture system that performs real-time full-body motion capture, and an evaluation server that performs automated dance evaluations. The hardware consists of a VIVE XR Elite headset, a PC for motion capture, two USB webcams, and a router for Wi-Fi connectivity (Figure 1). It is also capable of pass-through, hand tracking, and simple gesture recognition. The dance training system was developed as a standalone VR application using the Unity game engine. The PC used for motion capture was a MinisForum NAB6 with USB connectivity for multiple webcams. The motion capture PC and headset are connected to the same network via Wi-Fi, and data are sent and received in both directions using WebSocket communication (SimpleWebSocketServer Version 0.1.2 downloadable at https://pypi.org/project/simple-socket-server/ accessed on 10 June 2024). For dance evaluation, a dedicated server was built and designed to send dance motion data via HTTP and return the evaluation results to the system.

The workflow of the system is depicted in Figure 2. The system is divided into three main phases: practice, evaluation, and reflection. The user materializes as an avatar within a VR dance studio, selects a lesson to practice from the VR panel, and transitions to the practice scene where a reference model avatar demonstrates a model dance. The user watches the model dance and imitates the moves step by step to learn the choreography and practice the dance. All the user’s dances are automatically recorded and can be replayed at any time. In addition, pressing the “Evaluate” button will take the user to the evaluation scene, where the last dance practiced will be evaluated. The dance is assessed using the automatic evaluation model, which will be described in a subsequent section. In the reflection phase, the user can replay their dance moves and look back at their performance. The system visualizes the evaluation results, allowing the user to recognize points to improve and gain insights into their dance performance. Overall, this iterative process helps the user improve their dance skills effectively through continuous practice, evaluation, and reflection.

The developed system is depicted in Figure 3. Within the virtual environment, an avatar serving as a model instructor is presented, allowing the user to engage in a dance routine alongside this avatar coach.

In the practice scene, it is possible to play, stop, and rewind the tune. In addition, the playback start and end positions can be specified, allowing the user to concentrate on a specific part of the choreography for repetitive practice. Playback speed can be adjusted from 0.5× to 1.5×, allowing users to practice choreography in slow motion. The avatar is displayed in front of the user’s left by default, but it can be moved to any position, allowing the user to observe it in three dimensions from any direction. Users can record and replay their own dances, which can then be replayed side by side with the model avatar for comparison as shown in Figure 4.

In the evaluation scene, the user does a series of dance performances from the beginning to the end of the tune. The system automatically evaluates the user’s dance performance as follows. In the review scene, the system feeds back the evaluation results to the user. As shown in Figure 5, the user compares his/her own dance with the model’s dance using the replay function and recognizes areas for improvement based on the objective evaluation of his/her own dance. This allows the user to continue practicing while being aware of the movements that need to be improved.

3.2. 3D Motion Capture Using Two Cameras

This system employs cameras to capture the user’s movements. Specifically, as depicted in Figure 6, two webcams connected to a PC are used to record the user, with posture estimation performed using the Mediapipe Pose model [28] (Mediapipe Version 0.9.0.1 downloadable at https://pypi.org/project/mediapipe/ accessed on 10 June 2024). This open-source software can detect 33 keypoints in real time from images and videos, including crucial dance elements like ankle movements and hand directions. The 3D coordinates are calculated from the 2D coordinates of the keypoints of the pose and the projection matrices of the camera (3D pose and parameters calculated from intrinsic and extrinsic parameters) obtained from each viewpoint. The calculation is performed by the singular value decomposition (SVD) method, which estimates the optimal 3D coordinates from each viewpoint algebraically [42]. The 3D pose data are transmitted to Unity via UDP communication and are reflected in the VR avatar using Forward Kinematics.

3.3. Extrinsic Camera Calibration

Calibrating external camera parameters is vital for 3D multi-view analysis. Traditional methods, which often use 2D features like chessboard corners, struggle with wide camera spacings. Our approach uses human skeletal data from Mediapipe to automatically adjust camera settings as a user moves in a predefined space. This movement allows the camera to collect data without over- or under-collecting. We also introduce room calibration for global positioning, where the user’s initial pose and the height of their VR headset are used for scale transformation.

However, this approach has a limitation: it is highly sensitive to the user’s initial pose. Because the frontal plane is determined based on the orientation of the body, it contains a large amount of prediction error, which can introduce instability and affect the accuracy of the calibration. To mitigate this issue, we have developed a method that utilizes the movement trajectory of the VR headset for a more robust estimation that is less dependent on the user’s initial pose. Specifically, this new method synchronizes the headset’s trajectory data with the ‘nose’ keypoint coordinates obtained from pose estimation to achieve a more accurate and stable calibration.

In contrast to prior research [56,57] that utilized human skeletons as oriented point clouds rather than 2D or 3D corresponding points, our method avoids time-consuming bundle adjustments often required for system optimization. This is particularly beneficial in our case, where the system comprises only two cameras, making the benefits of bundle adjustment marginal at best. Although our method necessitates specific user movements for calibration, it enables rapid and straightforward calibration without the computational overhead of bundle adjustment.

Table 1 compares our technique with earlier studies with respect to various evaluation metrics. These metrics include the Riemannian metrics

E_{R}

for rotation matrices, the root mean square errors

E_{t}

for translational vectors, and the root mean square errors

E_{3 D}

for the accuracy of reconstructed 3D joint positions.

N_{c}

is the number of cameras used. Our dual-camera approach not only delivers high real-time performance but also comes close to achieving the highest accuracy when compared to more complex methods that utilize five cameras and bundle adjustment.

The environment in which camera calibration is performed is as follows. Automatic calibration is performed by walking behind the avatar that appears in the MR space. If the estimated camera position is significantly off, the position can be adjusted manually as shown in Figure 7.

3.4. Automatic Evaluation and Feedback

When conducting an evaluation of a dance, determining the appropriate evaluation items is crucial. The evaluation items should be coherent and independent of each other, not merely numerous. Furthermore, the content must be easily understandable to the users being evaluated. The evaluation items were determined based on the findings of previous studies and the opinions of experts obtained through data collection.

The Laban body movement analysis theory proposed four Effort attributes—weight, space, time, and flow—to capture movement intensity, rhythm, direction, and continuity. In a previous study on emotion recognition and interaction in dance physical expression [58], factor analysis was applied to the impressions received from dance and three elements were defined: dynamics (intense change), expansiveness (spatial extent), and stability (balance of movement and harmony of expression). These factors capture the expressive diversity of dance and play an important role in the selection of evaluation items.

The data collection included advice given by a dance expert who has been dancing for over ten years and is an instructor at a dance school, and the most frequent advice given during instruction was tabulated (Table 2). For example, the advice ‘dance as hard as you can because you are nervous and your body is not moving’ was related to strength and dynamics, while the advice ‘dance bigger overall’ was related to the size and spatial characteristics of the movement. These pieces of advice are considered to reflect important perspectives in the teaching of dance in practice.

Based on the above, six evaluation items were determined, as shown in Table 3.

This study proposes an automatic evaluation model for dance based on deep learning. Automatic evaluation is defined here as a problem of predicting the expert’s evaluation of features related to expression such as dynamics, sharpness, scalability, timing, accuracy, and stability, where the prediction is based on the input of time-series information on the position, rotation, and speed of the major joint points of the human body as motion data. The automatic evaluation model was trained in two steps. First, an encoder was trained to convert motion data into features. Two methods were used to learn the encoder: supervised contrast learning and autoencoder. Then, the evaluation model was trained to predict the expert’s evaluation using the obtained features as input.

Following the automatic evaluation of the dance, users receive feedback to aid them in practicing their dances while being mindful of areas requiring improvement. The feedback comes in two forms: 3D feedback via dance replays and visual feedback through graphs.

Dance replay feedback allows users to review their dance and compare it to a model dance. This facilitates the identification of specific movement areas for improvement, which can be applied to subsequent practice.

Graphical feedback first presents the evaluation results on a radar chart. The evaluation is performed for each one-second segment, but the visualization results show the average of the evaluation values for the entire dance. This offers insights into the perspectives to concentrate on during practice. Moreover, it allows users to visually track their progress by comparing past results. Along with this, behavioral coaching is provided to give feedback on specific actions, for instance, “Your movements are becoming smaller; let’s extend your arms and legs a bit further and make larger moves.” Examples of these types of feedback are illustrated in Figure 8. The system provides user feedback by displaying the assessment outcomes from the automated evaluation model in a radar chart, as detailed below. Along with this, it offers pre-prepared personalized suggestions based on these evaluation results, equipping users with actionable insights for their subsequent practice sessions. In addition, the number of times advice was given for each item is recorded, and if the evaluation values are close, the item for which less advice was given is prioritized. This is because feedback on the same thing over and over again is considered to be less effective. In order to provide feedback tailored to the user’s characteristics, the amount of change in the advice and its related evaluation value are recorded, and the advice with a greater amount of change is judged to be effective by that user. Then, it is adjusted for each user so that such advice is preferentially selected thereafter. Furthermore, the evaluation results are stored independently for each recording, empowering users to track their progress over time by comparing their current performance with previous evaluations.

As depicted in the radar chart, our system assesses dance performance across six categories, each of which critically informs the overall evaluation of a dance performance.

4. Automatic Evaluation of Dance Performance

This section elucidates the process of automatic dance performance evaluation. It first outlines the procedure for generating the dataset needed for model training and then describes the annotation system used in creating it. Subsequently, it details the input data, preprocessing techniques, teacher data, model architecture, and the methods employed for model training and evaluation.

In our method, as shown in Figure 9, the user’s dance movements are embedded by the motion encoders. The encoded user motions are then compared to the expert’s movements as a reference. Each evaluation model, corresponding to the six aforementioned metrics, assesses different aspects of dance performance. The evaluation results are used to query the advice database, generating specific feedback. Finally, the system provides the user with detailed feedback and a radar chart visualizing their performance across evaluated criteria.

4.1. Dataset

Data were gathered from 20 students, spanning from elementary school attendees to university students and 1 dance professional coach, all of whom were partaking in dance studio classes (though the statistical power is low). Subjects were gathered from a wide range of age and skill groups. The data collection took place following a one-hour lesson dedicated to learning the choreography, which implied that the students were still working on enhancing their individual expressions. The recorded dance was a single type of dance performed to a hip-hop tune at 161 counts per minute and lasting for 40 counts (approximately 20 s). Each of the 20 subjects and the instructor, who acted as a reference model, performed and recorded the dance twice.

The dataset for this study consists mainly of video data of dances, 3D motion data, and evaluation annotation data for those dances. After the recording, the coach provided time for verbal advice to the student, which was also recorded on video. The content of this feedback was used to build a database of advice for use in this system.

4.1.1. Video Data

The dance was recorded utilizing cameras positioned at each corner of the room. Employing four cameras ensures the capture of each dance move from diverse perspectives. All cameras were synchronized, and they documented the dance movements at a rate of 60 frames per second.

4.1.2. Motion Data

Pose estimation was conducted for all four videos captured using the Mediapipe Pose model [28]. The SVD method was used to calculate the 3D coordinates from the 2D coordinates and camera parameters acquired from each viewpoint. This method estimates the optimal 3D coordinates from each viewpoint in an algebraic fashion [42]. The 3D pose estimation process incorporates camera parameters, such as the projection matrix, which are also included in the publicly available dataset.

4.1.3. Evaluation Data

A professional dance instructor provided evaluation annotations to the dances. The instructor rated the complete dance performance on a scale of 1 to 10, considering six aspects: dynamics, sharpness, scalability, timing, accuracy, and stability. The evaluation utilized the annotation system described in the following subsection.

The dataset content is summarized in Table 4.

4.2. Dance Annotation System

For dance annotation, a web-based application was developed to evaluate dance performances (Figure 10). This application enables users to assess each performance item while viewing a dance video and to save the outcomes. Evaluations are segmented according to pairs of playback and end positions of the video (referred to as dance segments), permitting users to clearly demarcate the portion of the dance that is being assessed.

On the evaluation screen, the annotator can play videos and assess the dance performance. The annotator then watches the entire dance video and evaluates all dance segments in it. The “Next” and “Previous” buttons enable the annotator to navigate between the dance videos to be evaluated. Clicking the “Evaluate” button located in the upper right corner brings up the evaluation panel, where scores ranging from 1 to 10 can be selected for each item. The evaluation results are stored on the server but can also be cached locally using the “Download” button.

The application provides the user the ability to customize the video list and dance segments and freely define the labels and rating options for the evaluation items. The user can adapt these features to evaluate other dance genres in the future using different metrics. The user can also specify the ID of each video along with its corresponding URL. With this application, dance performances can be efficiently assessed, thereby promoting improvement in each performance.

4.3. Evaluation Model Architecture

The evaluation model takes motion data as the input and predicts an expert’s evaluation score of 1–10. The model was designed as a class classification problem in which each evaluation item is trained separately, each score of 1–10 is treated as a different class, and the model predicts which class it belongs to. The architecture of the evaluation model utilized the learned encoder model described in Section 4.4 and Section 4.5 for feature extraction. The encoder extracts features from the dance motion data, performs further feature extraction using MLP on the resulting representation vector, and finally predicts the classification probability for each class through the softmax layer. The architecture of the evaluation model is shown in Figure 11.

The final score is calculated by a weighted average of the classification probabilities into each class output by the evaluation model.

4.4. Encoding Model with Autoencoder

When an autoencoder [59] (see Appendix A) is applied to dance feature extraction, it should be possible to generate more essential expressions by capturing features of dance movement such as speed, rhythm, and continuity of movement. These features are important elements in analyzing and evaluating the quality of dance.

The motion data that serve as the input and output of the autoencoder are the poses (joint positions and rotations) at each point in time, arranged in chronological order. The time series length was set to four counts (approximately 1.5 s). The encoder was designed to first compress the pose information at each time point and convert it into spatial features, and then capture temporal motion features such as motion and continuity over the entire frame.

GCN [60] (see Appendix B) can be applied to data with a skeleton structure, such as the human skeleton, to analyze the relationship (edges) between body parts (nodes) and joints. For example, it is possible to analyze the relationship between multiple parts of the body, such as the interaction between the movements of the hands and feet, or the relationship between the central part of the body and the movements of the extremities.

In this study, graph convolution was introduced as part of the encoder model to capture spatial relationships between joint points. Bidirectional LSTM (BiLSTM) was used to capture time series features; BiLSTM is expected to capture time-series data in the context of forward and backward motion and effectively extract temporal features, especially continuity and rhythm of motion. The output of the encoder is the last cell of the LSTM and the state of the hidden layer. The decoder takes the output (compressed feature representation) from the encoder and reconstructs the original time-series data. To apply the encoder process in reverse, the decoder consists of a TimeDistributed layer that transforms the input data into a time series, an LSTM layer that restores temporal features, and an MLP layer that restores the location information of joint points at each point in time. The TimeDistributed layer uses the LSTM state output by the encoder as the initial state to recover the spatial features of the pose at each point in time. Finally, through the MLP layer, the goal is to produce an output that is as close as possible to the original input data. The model architecture of the autoencoder is shown in Figure 12.

Mean squared error (MSE) was used as the loss function to train the autoencoder model and mean absolute error (MAE) was used as the evaluation metric. The MSE and MAE measure the error between the reconstructed output and the original input data and evaluate how accurately the model can reconstruct the data. Other methods exist to calculate losses that account for errors in a time series. Among them, Dynamic Time Warping (DTW) is useful in certain contexts not related to rhythm (e.g., pose and balance). However, it can lose information in contexts that include time elements such as strength and timing. DTW was not used in this study because we wanted to use a common encoder.

4.5. Encoding Model with Contrastive Learning

Supervised contrastive learning is a learning approach that leverages pre-labeled data to identify similarities and differences. In this approach, relationships between data are based on pre-defined labels, and positive and negative examples are defined based on the labels. The model learns the feature space so that data are differentiated on the feature space according to these labels.

By applying supervised contrastive learning to the learning of another dance motion encoder, this study aims to capture subtle differences more accurately in dance expression and style as features. Specifically, we use the evaluation scores in the six items annotated for dance performance to train a generic encoder by assuming that data with similar evaluation scores are similar to each other. Score values were standardized by item in order to homogenize the differences in scores between each rated item.

Specifically, the following equations are used for calculation.

z_{i j} = \frac{(s_{i j} - μ_{i})}{σ_{i}}

(1)

z_{j} = [z_{i j}, z_{2 j}, \dots, z_{6 j}]

(2)

d_{E} (z_{j}, z_{k}) = \sqrt{\sum_{i = 1}^{6} {(z_{i j} - z_{i k})}^{2}}

(3)

d_{M} (z_{j}, z_{k}) = \sum_{i = 1}^{6} |z_{i j} - z_{i k}|

(4)

where i represents the evaluation item, μ is the mean of the score distribution for item i, and σ is the standard deviation. Equation (1) calculates the standardized score z for each item. Equations (3) and (4) are then used to calculate the Euclidean and Manhattan distances between datum j and datum k.

These scores are treated as vectors, and the distances between data are calculated as the distance between these vectors. To calculate the distance between vectors, Euclidean and Manhattan distances were used in comparative experiments.

In this study, a two-layer BiLSTM was used to extract features from motion data. An MLP (multilayer perceptron) was also applied to the LSTM output for further feature extraction, which helps to capture nonlinearities in the data and identify more complex relationships and patterns. When training encoders, a model with an embedding layer added to the output layer was also used. This layer was not used as the output of the encoder, but rather only to compute the distance between data in contrast training. Figure 13 shows the model architecture for contrastive learning in this study.

The reference-guided model uses two inputs: the motion data of the person being evaluated and the motion data of a reference model. In this model, an encoder is applied to each of the two inputs to extract features, and the resulting expression vectors are compared to predict the evaluation score (Figure 14). This method mimics the process in which the evaluator compares the motions of the model and the evaluated person and allows quantitative analysis of the closeness of the expression to the model without relying on the dance choreography.

In order to compare the expression vectors of the rater and the exemplar, it is necessary to integrate the two vectors, and in this study, three methods—subtract, concatenate, and multiply—were used to compare their accuracy. For the new feature vectors obtained by this method, the MLP is used to predict the evaluation scores.

The model’s parameters are adjusted to minimize this discrepancy. Specifically, in contrastive learning, loss functions such as contrastive loss and triplet loss are predominantly employed.

As shown in the equation below, contrastive loss receives two sets of data as input. The model is trained to decrease the distance between the expression vectors for positive examples (similar data) and increase the distance for negative examples (dissimilar data).

C o n t r a s t i v e L o s s = y d^{2} + (1 - y) m a x (0, α - d^{2})

(5)

For positive example pairs (y = 1), the distance 0 is learned to be closer to 0. For negative example pairs (y = 0), the distance d is learned to be more than a certain distance α away.

Conversely, triplet loss is a generalization of contrastive loss that considers three sets of data: reference data (anchor), similar data (positive example), and dissimilar data (negative example). The objective is to reduce the distance between the anchor and the positive example while increasing the distance between the anchor and the negative example. The formula for triplet loss is as follows.

T r i p l e t L o s s = m a x (0, d_{p} - d_{n} + α)

(6)

In triplet loss, the distance

d_{n}

between the anchor and the negative is learned to be larger than the distance

d_{p}

between the anchor and the positive by a certain distance α or more.

To evaluate the performance of the model with contrastive learning applied, an evaluation metric called SwapError was employed. This metric is used to quantitatively assess how effectively the model discriminates between similarities and differences. Specifically, we compare the distance between the feature vectors output by the model for all pairs in the dataset. If the distance between an anchor and a positive example is greater than the distance between a negative example, the pair is considered a “swap”; SwapError is defined as the rate at which these swaps occur, and the closer the value is to zero, the better the model performance is evaluated.

4.6. Label Distribution Learning and Evaluation Measure

Label distribution learning (LDL) [61] is a branch of machine learning that is particularly suited for learning in situations where multiple labels are involved. While traditional class classification problems assign a single label to each piece of data, LDL can assign multiple labels to each piece of data as a distribution, rather than a single label, making it possible to learn the probability distribution of labels. LDL can also be used when classes are not independent and there is an ordering among classes; LDL is effective when different emotions or features may coexist or when multiple features exist simultaneously, and is expected to have many applications, including sentiment analysis and multi-label classification.

In this study, dance performance evaluation was defined as a 1 to 10 score-prediction problem. In the traditional class classification problem, these scores are treated as independent categories, but in reality, there is an ordinal relationship between the scores. For example, scores of 6 and 7 are closely related to each other, while scores of 1 and 10 exhibit very different characteristics. Using LDL allows score prediction to be treated not as a simple class classification problem, but rather in a way that takes into account the ordinal relationship between each score.

In conventional class classification, a one-hot vector is used as the correct answer label such that the correct answer class is 1 and all other classes are 0. On the other hand, LDL assumes a normal distribution for the output and creates the correct answer label so that the probability of the score’s being the correct answer is the highest. In this study, the standard deviation of the normal distribution was set to 1 to create the correct answer label. The left side of Figure 15 represents the one-hot vector for a correct answer score of 3, while the right side represents the transformed correct answer label assuming a normal distribution to apply LDL.

We used the quadratic weighted kappa [62] as the rating index for the rating model. The kappa score is a statistical index used to measure the degree of agreement among raters; the weighted kappa score allows different weights to be assigned depending on the degree of disagreement between different categories, which is particularly useful when using ordered categories. It is particularly useful when using ordered categories because it allows assigning different weights depending on the degree of disagreement between different categories. For example, if the correct answer is 1 point and the wrong data are 2 points or 10 points, the score can be evaluated so that the wrong data with 10 points have a smaller degree of agreement. Also, the quadratic is a quadratic kappa score, where the square of the difference between raters is used so that greater weights can be applied to more distant ones. This kappa score ranges from −1 to 1 and is generally interpreted as shown in Table 5.

The kappa score is especially useful in situations where multiple raters are involved or where subjective evaluations are required. The kappa score provides a quantitative assessment of the reliability and consistency of the assessment.

Therefore, this study employed the quadratic weighted kappa as the evaluation metric for the score prediction problem from 1 to 10. This evaluation metric evaluates the degree of agreement between the predicted and actual scores, taking into account the ordinal relationship among the scores. For example, if the predicted score exactly matches the actual score, the highest agreement score of 1.0 is obtained, but the agreement decreases as the distance between scores increases. The quadratic weighted kappa also gives a heavier penalty for larger score discrepancies because the weighting is based on the square of the distance between scores (Table 6). This allows for an evaluation that is sensitive not only to the accuracy of score predictions but also to minute differences in predictions.

5. Experimental Results

5.1. LMA and Statistical Machine Learning

Laban Movement Analysis (LMA) is particularly useful for analyzing dance performance. By describing a dancer’s movements in detail and translating them into specific parameters, it is possible to quantitatively assess performance characteristics; LMA parameters quantify the quality and characteristics of the movements, which can then be used as a basis for analysis in machine learning models.

The LMA has four attributes as its main components: the effort attribute for movement quality, the shape attribute for movement shape, the space attribute for movement direction, and the body attribute for body parts. Effort represents movement dynamics and emotional expression. The space attribute represents the space between the body part and the dynamics of the movement. For example, performing the same movement forcefully, quickly, or fluidly changes the emotion or intent of the movement and creates a different emotional impression on the observer. This includes weight, space, time, and flow, which are elements used to capture the intensity, rhythm, direction, and continuity of movement. Shape captures changes in body posture and shape. Space represents how movement is oriented in space. This includes the direction, path, and relationship of movement. Body indicates which body part a particular movement is performed by.

A method for extracting dance features based on LMA was proposed in a previous study, Analysis and Evaluation of Dancing Movement Based on LMA [2]. In the present study, we implemented a method to compute each of the following four effort attributes based on that previous study.

Weight is calculated using velocity information from the center of the body (hips) and extremities (ankles and wrists). Time is calculated from the acceleration information of the dancer’s movements. Space is calculated from the dancer’s gaze direction (face direction) and body direction (hip direction), and its relative angle is used as the feature value. This allows the spatial development and directionality of the dance to be analyzed. Flow assumes the entire body as a rectangular body and uses the amount of change as a feature quantity, indicating how much it varies along the left–right, up–down, and front–back directions. This allows the continuity and fluidity of movement to be evaluated.

The above LMA-based features are expressed as time-series data, which makes it possible to capture temporal changes in movement. As statistics, the mean, variance, maximum, and minimum values are calculated, and they are used as the final feature values. This statistical approach enables a more comprehensive evaluation of dance by capturing its consistency and variability.

Statistical machine learning is a method of machine learning that exploits the statistical properties of data to learn patterns to make predictions and decisions. Random forest, one of the statistical machine learning models, is a widely used ensemble learning method that combines many decision trees.

In the random forest training process, a number of decision trees are trained on their respective subsets of data. Each decision tree classifies the data using a subset of randomly selected features. The final prediction is determined by majority voting on the results of these decision trees.

In the following, the evaluation of dance by LMA is implemented using random forest and used for comparison.

5.2. Encoder Training and Results Using Autoencoder

The encoder model was trained using the autoencoder described in Section 4.4. For the hyperparameters, Optuna was used to search for the optimal parameters. However, the number of LSTM units was set to 512, the same number as in contrastive learning, because the larger the number of LSTM units, the better the accuracy of the model. Table 7 shows the details of the parameters and their values.

The learning transition of the autoencoder is shown in Figure 16.

The resulting error was MAE = 0.027, with an average of approximately 3 cm. Figure 17 shows an example of the results of motion restoration using the autoencoder. Blue is the correct joint position and orange is the result of visualizing the restored position in a skeleton.

Next, the trained encoders were used to train the reference-guided dance evaluation model described in Section 4.5. The hyperparameters used the same architecture as the evaluation model with contrastive learning, also described in Section 4.5. Table 8 shows the results of training as a class classification problem for each evaluation item. LMA, a rule-based motion feature extraction method, and random forest (RF), a statistical machine learning method, were used as baseline models, and autoencoder (AE) was used for the feature extraction model and their accuracy was compared.

5.3. Encoder Training and Results Using Contrastive Learning

In deep learning, it is common to divide the created dataset into three parts: training data, validation data, and test data. During training, the training and validation data are used to search for optimal parameters, and the test data are used to evaluate the model. The dataset contains dance data collected from 20 dancers, each with 40 counts. In this study, the data from 17 dancers were split into training and validation data, and the data from the remaining three dancers were split into testing data. These three dancers were randomly selected from each group after dividing each group into three groups with low, medium, and high scores, respectively. This method allows us to validate the robustness of the evaluation model for unknown dancers.

The encoder model was trained using the supervised contrastive learning technique described in Section 4.5. The hyperparameters were searched for optimal parameters in approximately 50 trials using Optuna (https://optuna.org/), which performs a parameter search based on Bayesian optimization. Table 9 shows the details of the parameters and their values.

As described in Section 4.5, two different methods were used to calculate the distances: Euclidean and Manhattan distances were used for learning. The learning curves and results are shown in Figure 18 and Table 10, respectively.

From Table 10, we can see that the model using the Euclidean distance has a lower swap error and is superior. Swap Error = 0.039 represents approximately four wrong guesses when comparing 100 pairs, which seems almost practical.

Next, the learned encoders were used to train the dance evaluation model (reference-guided) described in Section 4.5. Hyperparameters were similarly used to search for optimal parameters using Optuna. Table 11 details the parameters and their values.

For the dance evaluation model, LMA and random forest (RF) were used as baseline models, and performance was compared with a model that used encoders with contrastive learning (CL) for feature extraction on top of them. The results of training as a class classification problem for each evaluation item and the comparison of the kappa scores for the test data are shown in Table 12.

In all items, the contrastive learning method using the Euclidean distance resulted in the highest accuracy.

5.4. Learning and Results of the Evaluation Model

The encoder models described in Section 4.4 and Section 4.5 can perform feature extraction on motion data, respectively. The feature vectors obtained from these encoders were combined to train the evaluation model. Specifically, in the combined model (CL + AE), feature vectors obtained from each encoder are combined and used as input to the evaluation model. LMA and random forest (RF) were used as baseline models, and then a model using encoders with contrastive learning for feature extraction (CL), a model using autoencoders for feature extraction (AE), and a model using both contrastive learning and autoencoders (CL + AE) were compared for performance. In addition to the kappa score and quadratic weighted kappa described in Section 4.6, correlation coefficients and mean absolute error (MAE) were used as evaluation measures. The results are shown in Table 13, Table 14, Table 15, Table 16, Table 17 and Table 18 for each evaluation item.

The results show that the model (CL + AE) that combines contrastive learning and an autoencoder achieved an average accuracy improvement of more than 10 points for the quadratic weighted kappa. Compared with the other evaluation metrics, (CL + AE) generally resulted in higher accuracy. In addition, the accuracies of “scalability” and “timing” were particularly low compared to the other items.

Figure 19 and Figure 20 show the prediction results for the test data. Figure 19 shows the prediction results for “dynamics” for the three dancers in the test data in confusion matrices, with the vertical axis representing the correct score and the horizontal axis representing the prediction score. The left side of this figure shows the results of the baseline (LMA + RF), and the right side shows the results of the proposed method (CL + AE). In this confusion matrix, higher values near the diagonal indicate higher prediction accuracy. Compared to the baseline, the proposed method is more accurate and has less variation in prediction, indicating that it is superior in terms of accuracy and consistency. Figure 20 shows radar charts of the prediction results for all evaluation items for the same three dancers, with blue representing the correct score and orange the prediction score, and the degree of agreement between the blue and orange lines indicates the accuracy of the prediction. These radar charts provide a visual indication of the performance of the prediction models in assessing dance performance, and, in particular, the size of the graphs represents the overall performance rating. Each value in the chart in Figure 20 represents the average value throughout the entire set of evaluation result values that were compared with the model dance every second. In most charts, the blue and orange lines are fairly consistent, indicating that the model makes predictions that are close to the correct score.

These results indicate that the automatic evaluation model can be applied to unknown dancers and can produce reliable evaluations.

Next, we compared the reference-guided model, which compares the model with a reference example. In this study, we proposed a reference-guided model that enables relative evaluation through comparison with a model. To test the robustness of the reference-guided model to unknown choreography, the first 32 counts of each dancer’s dance were used for training and the remaining eight counts were used for testing. The classification accuracies of the evaluation model without a model (no reference) and the reference-guided model with a model are shown in Table 19.

The training results showed that the reference-guided model had the best accuracy in all items. In particular, the kappa score improved by more than 5 points in the evaluation of “sharpness”, “accuracy”, and “stability.” This is thought to be due to the fact that more specific performance evaluation became possible by referring to the model motion data.

Figure 21 and Figure 22 show the prediction results for the test data. Figure 21 shows the prediction results for “dynamics” for all 20 dancers in confusion matrices when the last count in the test data was used as the input. As before, the left side of this figure shows the baseline (LMA + RF) results and the right side shows the results for the proposed method (CL + AE). Compared to the baseline, as before, the proposed method is more accurate and has less variation in prediction, indicating that it is superior in terms of accuracy and consistency. Figure 22 shows the prediction results for all evaluation items for 12 dancers’ test data. In many of the charts, the blue and orange lines are fairly consistent, indicating that the model makes predictions that are close to the correct score.

These results indicate the usefulness of the reference-guided model, in which a model dance is input as a reference, in the automatic evaluation of dance performance. However, the dataset used in this study was not large, and future work is needed to collect data on different types of dances and choreographies to further validate the versatility of the model.

The indicators described above should be sufficient for the evaluation of each item. However, since the purpose of this evaluation model is to select effective feedback items, it is also possible to evaluate the model from other perspectives. In this study, we evaluated the overall performance of the model by calculating the precision and recall in ranking search as the overall evaluation indices of the model. Specifically, evaluation items that fell below the average of the evaluation values for each item were defined as “items requiring improvement (IRIs)”, and the prediction score and the correct answer score were compared. The precision here is an index that indicates the percentage of items among the predicted IRIs that actually matched the correct IRIs, and the recall indicates the percentage of items among the correct IRIs that were correctly identified as IRIs by the prediction. In addition to these indicators, the accuracy and F1 score were also calculated. Table 20 shows the results. The accuracy indicates how many of the overall predictions were correct, and the F1 score indicates the harmonic mean of the precision and recall.

In addition, the percentage of items with the lowest prediction score that were included in the correct IRIs was also calculated and is shown in Table 21. This is an indicator to check whether the lowest-scoring item in the prediction is included in the actual IRIs and to evaluate how accurately the prediction identifies critical areas for improvement.

These results confirm that the proposed method demonstrates the ability to select effective feedback items compared to the baseline. Furthermore, the lowest score accuracy, which indicates the percentage of the lowest predictive score included in the actual IRIs, is as high as 0.679 for the CL + AE model compared to 0.500 for the baseline. In particular, the CL + AE model was found to be superior in its ability to accurately identify important movements that need to be improved.

6. Concluding Remarks

This paper explored several key aspects of immersive dance training systems, including camera-based motion capture and automatic dance performance evaluation. The dance training system we propose employs two cameras for real-time body tracking to enhance the system’s ease of use and practicality.

Moreover, we created a dataset specifically for the automatic evaluation of dance performance. This expert-annotated dataset served as the foundation for the training and evaluation of our automatic dance evaluation model. Of all the models tested, the reference-guided model was found to deliver the highest accuracy. This model emulates the process by which an evaluator compares a dancer’s movement to a standard model, a critical component in assessing dance performance.

Looking forward, our research intends to assess the effectiveness and suitability of feedback provided by the automatic evaluation model, using both quantitative and qualitative perspectives via subject experiments. We will also explore features and enhancements aimed at increasing the system’s precision and user-friendliness, all in an effort to offer superior dance training experiences. Furthermore, we will examine scalability to accommodate a wide range of dance styles and choreographies.

7. Limitations and Future Work

7.1. Limitations

Because only predetermined choreographies were evaluated in this study, it was not possible to evaluate choreographies that were modified by the dancers. In addition, because the system is designed for self-training at home, it cannot cover a wide area and does not support practice for more extensive movements on stage. Another limitation is that the system is limited to a single user and cannot track multiple people. Furthermore, the current feedback mode does not adequately take into account the level and characteristics of individual users, leaving room for further personalization. Although the current study is limited to hip-hop, the proposed evaluation framework could be applied to other dance genres. The annotations are based on six abstracted items, but for a more detailed analysis, segmented labels are needed.

7.2. Future Work

Our future research will refine the evaluation to provide specific feedback on individual movements in addition to the evaluation of the dance as a whole. In addition, a framework will be developed to provide personalized feedback that takes into account the user’s history and orientation and to provide appropriate coaching based on individual progress. To enable deeper analysis, actual feedback data from coaches will be used to improve the accuracy of feedback. Furthermore, the usefulness of the feedback and history of evaluation scores generated by the VR dance training system should be evaluated quantitatively and qualitatively to improve its functionality and accuracy. In addition, a motion capture system that covers a wider area for on-stage practice will be needed. Also, the dataset will be expanded to cover a variety of dance styles, and new annotations will be added to the existing data. Furthermore, with the goal of generating detailed and complex advice, a generative large language model will be utilized to provide more effective and specific feedback.

The source codes for the machine learning models and dance annotation system developed in this study, along with a demonstration video, are provided in the Supplementary Materials as indicated below.

Supplementary Materials

The following supporting information can be downloaded at: https://github.com/kazuhiro1999/Automatic-Evaluation-of-Dance-Movements, source codes of the training model, the annotation system and the dataset created in this study; https://youtu.be/3j6oWCV4Xdk, Demonstration video (both linkes accessed on 10 June 2024).

Author Contributions

Conceptualization, K.E. and K.N.; data curation, K.E.; methodology, K.E. and K.N.; project administration, K.N.; supervision, K.N.; writing—original draft, K.E. and K.N.; writing—review and editing, K.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Our created dataset (for publicly consented data only) is available. See Supplementary Materials Section above.

Acknowledgments

The authors wish to express gratitude to the students and instructors at the dance studio, who participated as subjects in the experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Autoencoder

Autoencoders [59], also called self-encoders, are deep-learning architectures for compressing input data into a low-dimensional latent space. They are unsupervised learning models that aim to output the same data as the original input data and are widely used in feature extraction and dimensionality reduction. The model consists of an encoder that compresses the input data into a low-dimensional latent space and a decoder that reconstructs the original data from the compressed representation. Through this process, the model learns important features of the data and can eliminate useless information. The learning of the autoencoder aims to minimize the difference between the original input data and the reconstructed data (reconstruction error). This reconstruction error serves as a measure of how well the model captures important features of the data.

Appendix B. Graph Convolutional Network

A Graph Convolutional Network (GCN) [60] is a deep learning architecture that can efficiently capture relationships among graph-structured data. Convolution, one of the most commonly used techniques in deep learning, can extract local features by convolving surrounding information. The connections between nodes are given by an adjacency matrix, and each node aggregates information from its neighbors to grasp the overall structure.

References

Kico, I.; Zelnicek, D.; Liarokapis, F. Assessing the Learning of Folk Dance Movements Using Immersive Virtual Reality. In Proceedings of the 2020 24th International Conference Information Visualisation (IV), Melbourne, VIC, Australia, 7–11 September 2020; IEEE: Melbourne, VIC, Australia, 2020; pp. 587–592. [Google Scholar]
Senecal, S.; Nijdam, N.A.; Aristidou, A.; Magnenat-Thalmann, N. Salsa Dance Learning Evaluation and Motion Analysis in Gamified Virtual Reality Environment. Multimed. Tools Appl. 2020, 79, 24621–24643. [Google Scholar] [CrossRef]
Magar, S.T.; Suk, H.J. The Advantages of Virtual Reality in Skill Development Training Based on Project Comparison (2009–2018). Int. J. Contents. 2020, 16, 19–29. [Google Scholar]
Wang, P.; Wu, P.; Wang, J.; Chi, H.-L.; Wang, X. A Critical Review of the Use of Virtual Reality in Construction Engineering Education and Training. Int. J. Environ. Res. Public Health 2018, 15, 1204. [Google Scholar] [CrossRef]
Bernstein, R.; Shafir, T.; Tsachor, R.; Studd, K.; Schuster, A. Laban Movement Analysis Using Kinect. World Academy of Science, Engineering and Technology. Int. J. Comput. Electr. Autom. Control. Inf. Eng. 2015, 9, 1574–1578. [Google Scholar]
Hachimura, K.; Takashina, K.; Yoshimura, M. Analysis and evaluation of dancing movement based on LMA. In Proceedings of the ROMAN 2005, IEEE International Workshop on Robot and Human Interactive Communication, Nashville, TN, USA, 13–15 August 2005; IEEE: Nashville, TN, USA, 2005; pp. 294–299. [Google Scholar]
Aristidou, A.; Stavrakis, E.; Chrysanthou, Y. Motion Analysis for Folk Dance Evaluation. In Proceedings of the Eurographics Workshop on Graphics and Cultural Heritage, Darmstadt, Germany, 6–8 October 2014. [Google Scholar] [CrossRef]
Kim, H.J.; Neff, M.; Lee, S.-H. The Perceptual Consistency and Association of the LMA Effort Elements. ACM Trans. Appl. Percept. 2022, 19, 1–17. [Google Scholar] [CrossRef]
Ajili, I.; Mallem, M.; Didier, J.-Y. Human Motions and Emotions Recognition Inspired by LMA Qualities. Vis. Comput. 2019, 35, 1411–1426. [Google Scholar] [CrossRef]
Wang, S.; Li, J.; Cao, T.; Wang, H.; Tu, P.; Li, Y. Dance Emotion Recognition Based on Laban Motion Analysis Using Convolutional Neural Network and Long Short-Term Memory. IEEE Access 2020, 8, 124928–124938. [Google Scholar] [CrossRef]
Lei, Y.; Li, X.; Chen, Y.J. Dance Evaluation Based on Movement and Neural Network. J. Math. 2022, 2022, 1–7. [Google Scholar] [CrossRef]
Zhai, X. Dance Movement Recognition Based on Feature Expression and Attribute Mining. Complexity 2021, 2021, 9935900. [Google Scholar] [CrossRef]
Jin, Y.; Suzuki, G.; Shioya, H. Detecting and Visualizing Stops in Dance Training by Neural Network Based on Velocity and Acceleration. Sensors 2022, 22, 5402. [Google Scholar] [CrossRef]
Dias Pereira Dos Santos, A.; Loke, L.; Yacef, K.; Martinez-Maldonado, R. Enriching Teachers’ Assessments of Rhythmic Forró Dance Skills by Modelling Motion Sensor Data. Int. J. Hum.-Comput. Stud. 2022, 161, 102776. [Google Scholar] [CrossRef]
Davis, S.; Thomson, K.M.; Zonneveld, K.L.M.; Vause, T.C.; Passalent, M.; Bajcar, N.; Sureshkumar, B. An Evaluation of Virtual Training for Teaching Dance Instructors to Implement a Behavioral Coaching Package. Behav. Anal. Pract. 2023, 16, 1–13. [Google Scholar] [CrossRef]
Choi, J.-H.; Lee, J.-J.; Nasridinov, A. Dance Self-Learning Application and Its Dance Pose Evaluations. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, Virtual, 22–26 March 2021; pp. 1037–1045. [Google Scholar]
Guo, H.; Zou, S.; Xu, Y.; Yang, H.; Wang, J.; Zhang, H.; Chen, W. DanceVis: Toward Better Understanding of Online Cheer and Dance Training. J. Vis. 2022, 25, 159–174. [Google Scholar] [CrossRef]
Krasnow, D.; Chatfield, S.J. Development of the “Performance Competence Evaluation Measure”: Assessing Qualitative Aspects of Dance Performance. J. Dance Med. Sci. 2009, 13, 101–107. [Google Scholar] [CrossRef]
Guo, H.; Zou, S.; Lai, C.; Zhang, H. PhyCoVIS: A Visual Analytic Tool of Physical Coordination for Cheer and Dance Training. In Computer Animation and Virtual Worlds; Wiley: Hoboken, NJ, USA, 2021; Volume 32. [Google Scholar] [CrossRef]
Gupta, A.; Arun, A.; Chaturvedi, S.; Sharan, A.; Deb, S. Interactive Dance Lessons through Human Body Pose Estimation and Skeletal Topographies Matching. Int. J. Comput. Intell. IoT 2018, 2, 4. [Google Scholar]
Esaki, K.; Nagao, K. VR Dance Training System Capable of Human Motion Tracking and Automatic Dance Evaluation. PRESENCE Virtual Augment. Real. 2023, 31, 23–45. [Google Scholar] [CrossRef]
Chan, J.C.P.; Leung, H.; Tang, J.K.T.; Komura, T. A Virtual Reality Dance Training System Using Motion Capture Technology. IEEE Trans. Learn. Technol. 2011, 4, 187–195. [Google Scholar] [CrossRef]
Iqbal, J.; Sidhu, M.S. Acceptance of Dance Training System Based on Augmented Reality and Technology Acceptance Model (TAM). Virtual Real. 2022, 26, 33–54. [Google Scholar] [CrossRef]
Li, D.; Yi, C.; Gu, Y. Research on College Physical Education and Sports Training Based on Virtual Reality Technology. Math. Probl. Eng. 2021, 2021, 6625529. [Google Scholar] [CrossRef]
Xie, B.; Liu, H.; Alghofaili, R.; Zhang, Y.; Jiang, Y.; Lobo, F.D.; Li, C.; Li, W.; Huang, H.; Akdere, M.; et al. A Review on Virtual Reality Skill Training Applications. Front. Virtual Real. 2021, 2, 645153. [Google Scholar] [CrossRef]
Ahir, K.; Govani, K.; Gajera, R.; Shah, M. Application on Virtual Reality for Enhanced Education Learning, Military Training and Sports. Augment. Hum. Res. 2020, 5, 7. [Google Scholar] [CrossRef]
Izard, S.G.; Juanes, J.A.; García Peñalvo, F.J.; Estella, J.M.G.; Ledesma, M.J.S.; Ruisoto, P. Virtual Reality as an Educational and Training Tool for Medicine. J. Med. Syst. 2018, 42, 50. [Google Scholar] [CrossRef] [PubMed]
Lugaresi, C.; Tang, J.; Nash, H.; McClanahan, C.; Uboweja, E.; Hays, M.; Zhang, F.; Chang, C.-L.; Yong, M.G.; Lee, J.; et al. MediaPipe: A Framework for Building Perception Pipelines. arXiv 2019, arXiv:1906.08172. [Google Scholar]
Grishchenko, I.; Bazarevsky, V.; Zanfir, A.; Bazavan, E.G.; Zanfir, M.; Yee, R.; Raveendran, K.; Zhdanovich, M.; Grundmann, M.; Sminchisescu, C. BlazePose GHUM Holistic: Real-Time 3D Human Landmarks and Pose Estimation. arXiv 2022, arXiv:2206.11678. [Google Scholar]
Bazarevsky, V.; Grishchenko, I.; Raveendran, K.; Zhu, T.; Zhang, F.; Grundmann, M. BlazePose: On-Device Real-Time Body Pose Tracking. arXiv 2020, arXiv:2006.10204. [Google Scholar]
Zhang, F.; Zhu, X.; Dai, H.; Ye, M.; Zhu, C. Distribution-Aware Coordinate Representation for Human Pose Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Jiang, T.; Lu, P.; Zhang, L.; Ma, N.; Han, R.; Lyu, C.; Li, Y.; Chen, K. RTMPose: Real-Time Multi-Person Pose Estimation Based on MMPose. arXiv 2023, arXiv:2303.07399. [Google Scholar]
Xu, Y.; Zhang, J.; Zhang, Q.; Tao, D. ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation. arXiv 2022, arXiv:2204.12484. [Google Scholar]
Xu, L.; Jin, S.; Liu, W.; Qian, C.; Ouyang, W.; Luo, P.; Wang, X. ZoomNAS: Searching for Whole-Body Human Pose Estimation in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 5296–5313. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; An, L.; Yu, T.; Li, X.; Li, K.; Liu, Y. 4D Association Graph for Realtime Multi-Person Motion Capture Using Multiple Video Cameras. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 1321–1330. [Google Scholar]
Desmarais, Y.; Mottet, D.; Slangen, P.; Montesinos, P. A Review of 3D Human Pose Estimation Algorithms for Markerless Motion Capture. Comput. Vis. Image Underst. 2021, 212, 103275. [Google Scholar] [CrossRef]
Kanko, R.M.; Laende, E.K.; Davis, E.M.; Selbie, W.S.; Deluzio, K.J. Concurrent Assessment of Gait Kinematics Using Marker-Based and Markerless Motion Capture. J. Biomech. 2021, 127, 110665. [Google Scholar] [CrossRef]
Chen, L.; Ai, H.; Chen, R.; Zhuang, Z.; Liu, S. Cross-View Tracking for Multi-Human 3D Pose Estimation at Over 100 FPS. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 3276–3285. [Google Scholar]
Zeng, A.; Ju, X.; Yang, L.; Gao, R.; Zhu, X.; Dai, B.; Xu, Q. DeciWatch: A Simple Baseline for 10x Efficient 2D and 3D Pose Estimation. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022. [Google Scholar]
He, Y.; Yan, R.; Fragkiadaki, K.; Yu, S.-I. Epipolar Transformers. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 7776–7785. [Google Scholar]
Kim, J.-W.; Choi, J.-Y.; Ha, E.-J.; Choi, J.-H. Human Pose Estimation Using MediaPipe Pose and Optimization Method Based on a Humanoid Model. Appl. Sci. 2023, 13, 2700. [Google Scholar] [CrossRef]
Iskakov, K.; Burkov, E.; Lempitsky, V.; Malkov, Y. Learnable Triangulation of Human Pose. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Zhu, W.; Ma, X.; Liu, Z.; Liu, L.; Wu, W.; Wang, Y. Learning Human Motion Representations: A Unified Perspective. arXiv 2023, arXiv:2210.06551. [Google Scholar]
Zhang, Y.; Li, Z.; An, L.; Li, M.; Yu, T.; Liu, Y. Lightweight Multi-Person Total Motion Capture Using Sparse Multi-View Cameras. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 5540–5549. [Google Scholar]
Malleson, C.; Collomosse, J.; Hilton, A. Real-Time Multi-Person Motion Capture from Multi-View Video and IMUs. Int. J. Comput. Vis. 2020, 128, 1594–1611. [Google Scholar] [CrossRef]
Tu, H.; Wang, C.; Zeng, W. VoxelPose: Towards Multi-Camera 3D Human Pose Estimation in Wild Environment. arXiv 2020, arXiv:2004.06239. [Google Scholar]
Tsuchida, S.; Fukayama, S.; Hamasaki, M.; Goto, M. AIST Dance Video Database: Multi-Genre, Multi-Dancer, and Multi-Camera Database for Dance Information Processing. In Proceedings of the 20th International Society for Music Information Retrieval Conference, Delft, The Netherlands, 4–8 November 2019. [Google Scholar]
Li, R.; Yang, S.; Ross, D.A.; Kanazawa, A. AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021. [Google Scholar]
Ionescu, C.; Papava, D.; Olaru, V.; Sminchisescu, C. Human3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 36, 1325–1339. [Google Scholar] [CrossRef]
Calabrese, E.; Taverni, G.; Easthope, C.A.; Skriabine, S.; Corradi, F.; Longinotti, L.; Eng, K.; Delbruck, T. DHP19: Dynamic Vision Sensor 3D Human Pose Dataset. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 1695–1704. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
Chen, X.; He, K. Exploring Simple Siamese Representation Learning. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 15745–15753. [Google Scholar]
Li, J.; Zhou, P.; Xiong, C.; Hoi, S.C.H. Prototypical Contrastive Learning of Unsupervised Representations. arXiv 2021, arXiv:2005.04966. [Google Scholar]
Singh, A.; Chakraborty, O.; Varshney, A.; Panda, R.; Feris, R.; Saenko, K.; Das, A. Semi-Supervised Action Recognition with Temporal Contrastive Learning. arXiv 2021, arXiv:2102.02751. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised Contrastive Learning. arXiv 2021, arXiv:2004.11362. [Google Scholar]
Lee, S.-E.; Shibata, K.; Nonaka, S.; Nobuhara, S.; Nishino, K. Extrinsic Camera Calibration From a Moving Person. IEEE Robot. Autom. Lett. 2022, 7, 10344–10351. [Google Scholar] [CrossRef]
Takahashi, K.; Mikami, D.; Isogawa, M.; Kimata, H. Human Pose as Calibration Pattern: 3D Human Pose Estimation with Multiple Unsynchronized and Uncalibrated Cameras. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 1856–18567. [Google Scholar]
Shikanai, N.; Sawada, M.; Ishii, M. Development of the Movements Impressions Emotions Model: Evaluation of Movements and Impressions Related to the Perception of Emotions in Dance. J. Nonverbal Behav. 2013, 37, 107–121. [Google Scholar] [CrossRef]
Bank, D.; Koenigstein, N.; Giryes, R. Autoencoders. arXiv 2021, arXiv:2003.05991. [Google Scholar]
Ihsan Ullah, I.; Manzo, M.; Shah, M.; Madden, M.G. Graph convolutional networks: Analysis, improvements and results. Appl. Intell. 2022, 52, 9033–9044. [Google Scholar] [CrossRef]
Geng, X. Label Distribution Learning. arXiv 2016, arXiv:1408.6027. [Google Scholar] [CrossRef]
Cohen, J. Weighted Kappa: Nominal Scale Agreement Provision for Scaled Disagreement or Partial Credit. Psychol. Bull. 1968, 70, 213–220. [Google Scholar] [CrossRef]

Figure 1. System configuration of immersive dance training system.

Figure 2. Workflow of system.

Figure 3. Dance training system in use (real scene on the left; virtual space scene on the right).

Figure 4. Dance training scenes.

Figure 5. Dance training system in use (evaluation scene on the left; reflection scene on the right).

Figure 6. Motion capture process.

Figure 7. MR-based camera setup.

Figure 8. Feedback example.

Figure 9. Overview of our dance evaluation and feedback.

Figure 10. Dance annotation system.

Figure 11. Distribution of evaluation scores.

Figure 12. Autoencoder model architecture.

Figure 13. Model architecture for contrastive learning.

Figure 14. Architecture of reference-guided model.

Figure 15. Example of creating a correct label based on LDL.

Figure 16. Autoencoder learning curve.

Figure 17. Autoencoder visualization results (blue lines indicate the correct joint positions and orange lines indicate the results of visualizing the restored positions in a human body skeleton).

Figure 18. Encoder learning curve with contrastive learning (Euclidean distance).

Figure 19. Confusion matrices of evaluation model predictions (the three dancers in the test data, left, baseline; right, proposed method). The background color of each cell in the matrix is lighter for smaller values and darker for larger values.

Figure 20. Radar charts of evaluation model predictions (the three dancers in the test data). The blue line represents the correct score and orange the prediction score.

Figure 21. Confusion matrices of evaluation model predictions (the last count in the test data, left, baseline; right, proposed method). The background color of each cell in the matrix is lighter for smaller values and darker for larger values.

Figure 22. Radar charts of evaluation model predictions (the last count in the test data).

Table 1. Comparison with previous studies. TAKAHASHI [57]: Metrics obtained using a dual-camera setup. LEE (L) [56]: Metrics obtained with a five-camera setup using linear calibration only. LEE (L + B) [56]: Metrics obtained when bundle adjustment is additionally applied to LEE (L). OURS: Evaluation metrics obtained with our proposed dual-camera method.

	$E_{R}$	$E_{t}$	$E_{3 D}$	$N_{c}$
Takahashi [57]	0.368	5.355	-	2
Lee (L) [56]	0.043	1.414	1.817	5
Lee (L + B) [56]:	0.020	0.053	0.041	5
Ours	0.036	0.102	0.098	2

Table 2. Professional advice.

Frequency Order	Advice Content
1	You are nervous and your body is not moving, so dance as hard as you can.
2	Dance a whole lot bigger.
3	Where you stop moving, stop tight.
3	Move your hands smoothly and with awareness of the flow of your hands.
3	Bend your knees and drop your center of gravity more.
6	Move your hands more clearly.
6	Go over your choreography again.

Table 3. Dance evaluation items.

Evaluation Item	Evaluation Content
Dynamics	Elements of movement such as power, weight, and vigor
Sharpness	Continuity of movement, like precision and crispness
Scalability	Spatial aspects of movement, such as the magnitude and width of the dancer’s stride
Timing	Temporal features like rhythm and pace
Accuracy	Exactness of movements, including choreography, facial orientation, and rhythm
Stability	Aspects such as body control and balance, reflecting the steadiness and control of a dancer’s movements

Table 4. Content of dataset.

Subjects	1 coach 20 practitioners 7 elementary school students 6 junior high school students 5 high school students 2 university students
Number of dance data	44
Length of dance	14.9 s
Number of videos	176
Number of annotations	32 × 6 categories

Table 5. General criteria for kappa score.

Kappa Score	Interpretation
−1.0	Perfect mismatch
0.0	Match by chance
0.01–0.20	Slight match
0.21–0.40	Generally match with
0.41–0.60	Moderate match
0.61–0.80	Fairly close match
0.81–0.99	Almost match
1.0	Perfect match

Table 6. Example of quadratic weighted kappa weights.

Weight	1	2	3	4	5
1	1.0000	0.9375	0.7500	0.4375	0.0000
2	0.9375	1.0000	0.9375	0.7500	0.4375
3	0.7500	0.9375	1.0000	0.9375	0.7500
4	0.4375	0.7500	0.9375	1.0000	0.9375
5	0.0000	0.4375	0.7500	0.9375	1.0000

Table 7. Autoencoder model parameters.

Parameter	Search Range	Value
Number of GCN layers	1–3	2
Number of GCN filters	16–512	128
Bidirectional (encoder)	True, False	True
Bidirectional (decoder)	True, False	False
Number of LSTM units	32–1024	512
Number of units in dense layers	32–1024	256
Dropout rate	0.0–0.5	0.1
Learning rate	0.00001–1	0.005

Table 8. Evaluation model learning results (autoencoder).

	Dynamics	Sharpness	Scalability	Timing	Accuracy	Stability
LMA + RF	0.575	0.548	0.450	0.562	0.585	0.539
AE	0.752	0.752	0.615	0.566	0.636	0.674

Table 9. Parameters of encoder model with contrastive learning.

Parameter	Search Range	Value
Number of LSTM layers	1–3	2
Bidirectional	True, False	True
Number of LSTM units	32–1024	512
Dropout rate	0.0–0.5	0.4
Representation vector size	32–1024	64
Embedding size	32–1024	256
Threshold	0.0–6.0	2.0
Learning rate	0.00001–1	0.0005
Batch size	8–256	32

Table 10. Learning results of encoder model with contrastive learning.

Distance Calculation Method	Swap Error
Euclidean distance	0.039
Manhattan distance	0.146

Table 11. Parameters of evaluation model (reference-guided).

Parameter	Search Range	Value
Vector combining method	Sub, Concat, Mul	Sub
Number of dense layers	1–3	2
Number of units in dense layers	8–1024	(256, 64)
Dropout rate	0.0–0.5	0.2
Learning rate	0.00001–1	0.0005
Batch size	8–256	32

Table 12. Evaluation model learning results (contrastive learning).

	Dynamics	Sharpness	Scalability	Timing	Accuracy	Stability
LMA + RF	0.575	0.548	0.450	0.562	0.585	0.539
CL (Euclidean)	0.785	0.762	0.666	0.747	0.759	0.797
CL (Manhattan)	0.658	0.635	0.593	0.660	0.571	0.700

Table 13. Evaluation model learning results (dynamics).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.575	0.035	0.562	1.829
CL	0.785	0.117	0.793	1.238
AE	0.752	0.048	0.767	1.361
CL + AE	0.882	0.192	0.846	1.027

Table 14. Evaluation model learning results (sharpness).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.546	0.003	0.585	1.895
CL	0.762	0.075	0.800	1.423
AE	0.752	−0.003	0.805	1.402
CL + AE	0.890	0.185	0.846	1.032

Table 15. Evaluation model learning results (scalability).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.450	0.005	0.460	2.046
CL	0.666	0.060	0.675	1.578
AE	0.615	0.036	0.634	1.675
CL + AE	0.720	0.162	0.718	1.500

Table 16. Evaluation model learning results (timing).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.562	0.152	0.523	1.862
CL	0.747	0.033	0.759	1.462
AE	0.566	0.102	0.614	1.617
CL + AE	0.775	0.058	0.778	1.310

Table 17. Evaluation model learning results (accuracy).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.585	0.102	0.595	2.060
CL	0.759	0.038	0.854	1.505
AE	0.636	0.018	0.732	1.765
CL + AE	0.838	0.290	0.851	1.182

Table 18. Evaluation model learning results (stability).

	Quadratic Weighted Kappa	Kappa	Correlation Coefficient	MAE
LMA + RF	0.539	0.170	0.581	1.461
CL	0.797	0.139	0.810	1.173
AE	0.674	0.100	0.625	1.415
CL + AE	0.886	0.436	0.881	0.735

Table 19. Comparison with and without reference.

	Dynamics	Sharpness	Scalability	Timing	Accuracy	Stability
LMA + RF	0.575	0.548	0.450	0.562	0.585	0.539
No-reference	0.858	0.791	0.837	0.838	0.756	0.825
Reference-guided	0.874	0.880	0.873	0.885	0.908	0.920

Table 20. Degree of agreement on items requiring improvement.

	Accuracy	Precision	Recall	F1 Score
LMA + RF	0.536	0.441	0.544	0.487
CL + AE	0.619	0.676	0.556	0.610

Table 21. Lowest score accuracy.

	Accuracy
LMA + RF	0.500
CL + AE	0.679

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Esaki, K.; Nagao, K. An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features. Appl. Sci. 2024, 14, 5981. https://doi.org/10.3390/app14145981

AMA Style

Esaki K, Nagao K. An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features. Applied Sciences. 2024; 14(14):5981. https://doi.org/10.3390/app14145981

Chicago/Turabian Style

Esaki, Kazuhiro, and Katashi Nagao. 2024. "An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features" Applied Sciences 14, no. 14: 5981. https://doi.org/10.3390/app14145981

APA Style

Esaki, K., & Nagao, K. (2024). An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features. Applied Sciences, 14(14), 5981. https://doi.org/10.3390/app14145981

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Efficient Immersive Self-Training System for Hip-Hop Dance Performance with Automatic Evaluation Features

Abstract

Featured Application

Abstract

1. Introduction

2. Related Work

2.1. Dance Motion Analysis and Evaluation

2.2. Training Support in Virtual Space

2.3. Motion Capture

2.4. Contrastive Learning

3. Immersive Dance Training System

3.1. System Overview

3.2. 3D Motion Capture Using Two Cameras

3.3. Extrinsic Camera Calibration

3.4. Automatic Evaluation and Feedback

4. Automatic Evaluation of Dance Performance

4.1. Dataset

4.1.1. Video Data

4.1.2. Motion Data

4.1.3. Evaluation Data

4.2. Dance Annotation System

4.3. Evaluation Model Architecture

4.4. Encoding Model with Autoencoder

4.5. Encoding Model with Contrastive Learning

4.6. Label Distribution Learning and Evaluation Measure

5. Experimental Results

5.1. LMA and Statistical Machine Learning

5.2. Encoder Training and Results Using Autoencoder

5.3. Encoder Training and Results Using Contrastive Learning

5.4. Learning and Results of the Evaluation Model

6. Concluding Remarks

7. Limitations and Future Work

7.1. Limitations

7.2. Future Work

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Autoencoder

Appendix B. Graph Convolutional Network

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI