Investigating the Learning Process of Folk Dances Using Mobile Augmented Reality

: Learning how to dance is not an easy task and traditional teaching methods are the main approach. Digital technologies (such as video recordings of dances) have already been successfully used in combination with the traditional methods. However, there are other emerging technologies such as virtual and augmented reality that have the potential of providing greater assistance, in order to speed up the process as well as assisting the learners. This paper presents a prototype mobile augmented reality application for assisting the process of learning folk dances. Initially, a folk dance was digitized based on recordings from professional dancers. Avatar representations (of either male or female) are synchronized with the digital representation of the dance. To assess the e ﬀ ectiveness of mobile augmented reality, it was comparatively evaluated with a large back-projection system in laboratory conditions. Twenty healthy participants took part in the study, and their movements were captured using motion capture system and then compared with the recordings from the professional dancers. Experimental results indicate that augmented reality (AR) application has the potential to be used for learning process.


Introduction
Augmented reality (AR) has the ability to integrate virtual information into the real environment in real time performance [1]. Nowadays, with simultaneous localization and mapping (SLAM) algorithms, the integration is easier and gives the illusion that real and virtual objects co-exist in the same space. One of the important applications of AR is in the entertainment domain. Dance is a performing art and it consists of sequences of different movements. It can have various messages according to context, e.g., traditional dances are strongly connected to culture and culture heritage of place or nation [2]. Dances are mainly taught in two ways; attending dance lesson or self-learning by watching and imitating demonstrations presented in video or the three-dimensional (3D) virtual environment [3]. Major drawback of video or online websites, where dances can be found is a lack of interactive feedback but virtual reality (VR) and/or AR technology can overcome these limitations [4,5]. These approaches can also improve interaction and enable more functionalities, e.g., zoom in, watch the dancer from different angles, get feedback.
One the other hand, an obvious advantage of the AR application is that numerous dances can be added to the same application so users can learn many dances this way. Typically, teacher or professional dancer is recorded using different motion capture systems [6] and animations are added to avatars available in the application. During learning procedure, users' movements are also captured, compared to the teacher's and feedback is provided or captured data are stored and used for further offline analysis. Except providing additional functionalities and feedback, AR applications for dance learning can be used whenever and wherever users want to learn and practice on their own pace.

Background
Various applications for dance learning purposes have been proposed using different technologies. Magnenat Thalmann et al. [8] developed a 3D viewer for watching the dance. Interaction is provided by change of the point of view and zoom level or control of the speed are available. Another interface for dance learning is presented in [9] includes functionalities like start, stop, zoom, focus, change camera position, change speed. Creating game-like application is also a popular way to encourage specific behaviors and increase motivation and engagement [10]. Examples of game-like environments and interfaces can be found in [4,11,12]. They all propose similar way of learning; avatar of the teacher is shown, and user is supposed to imitate teacher's movements. User's movements are streamed to the application and feedback is given in form of score with corresponding message. Once the task is completed successfully, user can go to the next step.
VR offers an alternative experimental way how dance(s) can be taught. Kyan et al. [13] proposed a cave automatic virtual environment (CAVE) for ballet dance training. The environment consists of projectors screens placed to walls of room-sized cube where user can watch the virtual teacher and replicate movements. Training system based on the motion capture and VR technologies is presented in [14]. Virtual teacher's movements are projected on the wall screen and user must imitate movements. In another prototype VR simulator users can preview folk dance segments performed by a 3D avatar and repeat them [15]. Intuitive feedback based on comparison with the template motions is provided. Folk dance training is described in [16]. Head-mounted device (HMD) was used for the dance presentation in VR. Users' movements were captured and compared to the teacher's, and feedback is given in real time. Moreover, a VR application used the 3D computer-generated animation of teacher's movements [17]. Mousas in [18] presented a method for controlling a virtual partner's motion based on a performance capture process. A hidden Markov model (HMM) was trained to learn the structure of the dance motions and during the runtime system predicts the progress of the user's motion. To understand naturalness of synchronized motion and the control the user has on a partner's synthesized motion, a user study was conducted.
AR has been also experimentally used in entertainment and dances [1,19]. A conceptual tool for simulated dancing uses smart glasses to project virtual dancers in the environment [20]. Multi and single-user scenarios are proposed. In prior two users dance together while not being present physically in the same room, and in latter user dances with avatars that have a predesigned animation. A large-scale AR mirror for user training is presented where the mirror allows users to see their movements providing them visual feedback [21]. In [22], AR is used to present virtual dancer as an instructor or a dance partner and enables user to see their body movements as an external observer. Clay et al. [23] adapted and integrated AR related tools to enhance the emotion involved in cultural performances. As a part of work, stage in a live performance was augmented, and dance was an application case.
Our prototype, presented in [24], offers users the ability to visualize the avatar of professional dancer with prerecorded animation as shown in Figure 3. Users can switch between different part of the dance without any restrictions. There is no need to complete one step to proceed to the next one so that users can organize the learning procedure in their own pace. In addition, music of the dance is available so that they can coordinate their movements according to it. Similarly to [8,9], it is possible to change the speed and/or play/pause the dance and offer a personalized experience.

Capturing of Professional Dancers
The first step was recruitment and recording of professional dancers. Recruitment was done through social media. The recordings from professional dancers are used as ground truth data for data analysis. After user testing, animations were exported from recordings and they were used for animating avatars shown in the application. For obtaining the data, the procedure illustrated below was followed. Professional dancers with many years of experience were invited to dance a folk dance that is usually performed in pair. Dancers were wearing motion capture suits with 37 passive markers from [25]. Markers are placed on specifically determined positions on dancer's body and they are coated with a reflective material to reflect light that is produced near the cameras' lens [3]. The first male dancer was asked to put on a suit with markers and performed his part of the dance alone with music, three times in a row. Then, the male dancer performed three times with music again but this time with a partner. The male dancer was still wearing a motion-capturing suit, while the female dancer was not. The next step was to perform the dance in a pair, three times, with music and this time both dancers were wearing a suit with markers. The same procedure was followed for a professional female dancer [26].
After the recording phase, post processing was done in Motive [27]. Motive has built-in several ways of interpolation to fill the gaps, e.g., constant, linear, cubic, model-based, and pattern-based interpolation. In this work, cubic interpolation was applied since it is the simplest method that offers continuity between segments. It is possible to set the maximum size of the gap filled using selected method. Gaps in marker trajectories were calculated using cubic interpolation. In case of professional dancers, maximum size of the gap was set to 10 frames, while for participants in the experiment it was 100 frames. Filtered data were exported as animations and comma-separated values (CSV) files. Animations with the lowest number of markers with gaps and with the smallest size of gaps were used in the application. The corresponding CSV files were used for further offline analysis. Data from recording of professional dancers and used in both applications can be found in [28].

Projection Screen Visualization
An alternative way of learning the dance was similar to what learners are using (TV or computer monitor). In our case, instead of using a large TV monitor, we used projection screen to learn from. In this case, the same application was ported to a computer and projected on the screen. Interaction was provided using a Bluetooth mouse. Available options were the same as in learning mode in AR application. To pause the dance, participants had to press left button and to change the speed they had to press right button. To go to the previous or to the next segment they had to hold, and release left or right button, respectively. Figure 4 depicts user interacting with the application on projector screen during the learning phase and Figure 5 shows dance sequence of female avatar presented to the users on the projection screen.

Mobile Augmented Reality Visualization
A mobile AR prototype application was developed for assisting the process of learning folk dances [29]. A brief overview is shown in Figure 3 Implementation was based on ARCore platform because it provides motion tracking, environmental understanding and light estimation [30]. Motion tracking allows phone to track its position relative to the world and this technology is used for identification of features and for tracking how they move over the time. In addition, ARCore can detect flat surfaces, e.g., floor, which is what was used in this research.
To facilitate learning in AR, an avatar of the professional dancer performing the folk dance was superimposed in front of the user as shown in Figure 6. The avatar was positioned on a user defined position. To ensure that the avatar stayed on the same pose (position and orientation) a virtual anchor was used. This allows the user to walk around the avatar to examine the dance from different viewpoints and different angles.

Procedure
First of all, professional folk dancers, were recorded and animations were exported in fbx file format. Using Adobe Fuse CC [31] male and female avatars were created and uploaded to Mixamo [32]. Characters was rigged using Auto-Rigger and downloaded as fbx format. In Unity3D [33] exported animations were attached to avatars and added to the application.
A comparative user-study was conducted to assess the effectiveness of the AR application. Participants were divided into two groups where one group used head mounted viewer Google cardboard for AR application and the other group used a back-projection screen. Both applications consist of showcase and learning mode and the AR application can be used in portrait and landscape mode. In showcase mode, users can watch the whole dance without any interaction before they start learning. The idea is to introduce users to the dance so they can see what they are expected to learn and reproduce. After this step they can approach a learning mode. In this mode, the dance is split into two parts, since the dance consists of singing and dancing phase. Users can choose to watch the dance part by part or the whole dance at once.
In both cases, a user interface is available. In Figure 5 user interface available in landscape mode of the AR application is shown. Part of the interface that is shown on the top of the screen on the back-projection screen and in portrait mode of the AR application is the same for showcase and learning mode while the panel on the bottom changes depending on selected mode. The mode can be switched by selection of matching button. In the showcase mode users can pause/play the dance, change the speed of the animation and change the current part of the dance using slider. In the learning mode users can also pause/play the dance and change the animation speed. Instead of a slider, two buttons (previous and next) can be used to switch between parts of the dance. It is also possible to select to play all the steps. In addition, textual instructions are available (e.g., six times left hand swing). Using the interface, it is possible to switch between male and female version of the dance and move the avatar. If a user wants to move avatar after the corresponding button is pressed, user will see the message to tap on the green surface in order to place the avatar on the desired position.

Participants
A total of 20 healthy participants (12 males and 8 females), without prior knowledge of the dance (Pašovskà sedlckà), were employed for the study. This experiment was approved by The Research Ethics Committee of Masaryk University (reference number EKV-2019-067).
One of the participants was over 50 years old, while the rest were between 18-49 years old. The number of participants using AR application and projection screen was equal, 10 in each group. Two participants were excluded, one due to the technical issues and the other one due to the results that represented an outlier. Order of male and female participants and use of AR application and projection screen was randomized, shown in Table 1. Duration of the experiment was approximately 60 min. None of them reported any problems and all successfully completed the task. Participants were informed that they can drop out anytime throughout the experiment. All participants were informed about the task and gave their written consent.

Data Collection
Data collection was done using an optical motion capture system with passive markers, OptiTrack [1] and with proprietary software Motive. Our configuration contains 16 Prime 13W cameras and suits with 37 passive markers. Participants were asked to wear a suit during the training and performing phase and data were collected during both phases. The frame rate was set to 120 frames per second. Participants were first introduced about experiment and given the consent form and pre experiment questionnaire to fill in. After that they were instructed to put on suit with markers and attempt to the central position of motion capture area. Experimenters placed marker on determined position on participant's body and skeleton was created in Motive. The next step was to explain to them how to interact with the application. Participants had 10 min to learn the dance using the application. The dance in the application is short and contains only two movements and therefore is suitable for participants without prior knowledge of folk dances. A longer dance would require participants to spend more time learning it and it might be difficult for some participants to remember it. In addition, the experiment itself would last longer than 60 min and it could result with withdrawal of participants.
Since this dance is performed in pairs, male participants had to learn male part of the dance and male avatar was shown to them during the learning phase and female participants had to learn female version by observing female avatar. The dance was manually synchronized with music and it was provided in both applications. After the learning phase, they took off the headset or projection screen was turned off and they were asked to perform the dance three times in a row just with music still wearing a suit. During the performing phase, they had to begin every recording in a T-pose. T-Pose is a default pose for a 3D model's skeleton before it is animated. It is also known also as a bind pose, where a character stands with their arms outstretched [34]. After the recording had started, participants moved to the position from where they started the dance once the music was played. When participants finished with the third performance, they took off the suit and they were asked to fill post-experiment questionnaires.

Questionnaires
Four questionnaires were used in total. Before experiment a pre-exposure simulator sickness questionnaire (SSQ) [35] was used. This four-point scale questionnaire includes list of several symptoms rated on scale from 0 = "None" to 3 = "Severe". The same questionnaire was filled in post-experiment procedure. SSQ is most commonly used to report measure of cybersickness symptoms. This questionnaire was developed to measure sickness in the context of simulation, and it was derived from a measure of motion sickness [36]. Four representative scores can be found: nausea-related sub score (N), oculomotor-related sub-score (O), disorientation-related sub-score (D) and total score (TS). They were calculated according to [35,37]. After the experiment, except for post-exposure SSQ, participants were asked to fill two more questionnaires. To characterize experience in the environment, presence questionnaire (PQ) [38] was used. This questionnaire originally consists of 24 questions and a seven-point scale, 1 = "Not at all" to 7 = "Completely". Since there is no haptic included in virtual environment, participants did not answer questions related to that. They answered 22 questions and scores were found as suggested in [38]. The NASA task load index (NASA-TLX) [39,40] was used to assess cognitive workload of participants. It consists of six questions on 21-point scales ranging from very low to very high for mental, physical and temporal demand, effort and frustration and from perfect to failure for performance. At the end, a debriefing session was performed, and participants were asked to give their comments and impressions.

Motion Data Analysis Using Dynamic Time Warping
Motive offers users to export data in several different formats, e.g., fbx, CSV, bvh, C3D. Filtered motion data were exported to CSV files and analyzed using MATLAB [41]. Quaternions were extracted from CSV files and stored into 19 matrices. The number of rows in the matrix represented the total number of frames. The initial T-pose was cut from the performance. For data analysis, just data from performing phase was used and compared to the data from professional dancers. For each joint, the angle between quaternions in two successive frames was calculated and saved in vector. Quaternions were normalized. The corresponding vectors from participants were compared to vectors from professional dancers (e.g., vector for hip quaternions was compared with another vector for hip) using dynamic time warping (DTW).
DTW is a well-known technique to find an optimal alignment between two sequences [7,37]. The cost matrix is calculated, and it contains the similarities between compared motions. The goal is to fnd the optimal alignment between the signals having the minimal overall cost. MATLAB has built-in DTW function and it was used in this work. Arguments of the function were two vectors that were compared. Function returned distance between vectors; smaller distance means higher similarity of signals. The total score for whole body was found as the mean value of all values calculated for joints [42].

Motion Capture Data Comparison
Filtered data were analyzed as shown in Section 5.1. Our hypothesis is that mobile AR application is better for learning how to dance compared to a large projector screen. Mean values and standard deviation (SD) regarding used device for individual best scores (best performance AR and best performance projection screen (PS)) and average of three performances (average AR and average PS) obtained from the filtered data are shown in Table 2. Lower score means better result. Statistical tests were performed using IBM SPSS. A Shapiro-Wilk test was performed on best performance and Average scores regarding used type of visualization and data have normal distribution except for best performance PS scores (p = 0.027), presented in Figure 7. An independent samples T-test was used to examine significant differences between average scores and best performance scores regarded used device (AR application or 3D animation on projection screen) and p-values was read from the outcome table.  7. Boxplots for best performance scores for filtered data. Table 2 represents results for filtered data and mean value and SD of scores regarding used device are shown. Mean values for AR application are lower but without significant difference (p = 0.682 for average scores and p = 0.891 for best performances). Figure 8 shows device-based comparison of different body parts of participants, respectively. The greatest impact on total scores were from left and right arm and left and right leg for both, AR and back-projector screen and comparison will be based on these body parts. Participants that used AR performed better for right arm, while participants that used back-projector screen performed better for left arm. Again, this supports results that there is no significant difference between AR and back-projector screen.

Questionnaires Evaluation
Pre and post SSQ was used to examine cybersickness symptoms since one group of participants used AR. Statistical analysis was done for four representative scores, shown in Figure 9. Mean value and standard deviation for scores regarding used device were calculated. Although there is no significant difference, mean values for oculomotor, disorientation and total scores regarding AR on phone were higher after experiment, indicating that participants experienced symptoms stronger once the experiment was done. Highest difference was in disorientation score. In the NASA TLX questionnaire, the mean value for mental demand and performance for phone was higher than for projector screen, but again without significant difference. According to this questionnaire participants experienced almost the same level of physical demand, while effort, temporal demand and frustration were lower for participants that used AR application. Results from this questionnaire are shown in Figure 10. Participants that used AR application estimated themselves more successful in completing the task than participants who used application on projector screen, since the lowest score on scale represent better result. Regarding PQ, again without significant difference, participants reported experience using AR application to be more realistic, it had better interface quality and they had better interaction with the environment. In terms of qualitative feedback, all participants noted that they enjoyed the experiment. None of them has ever used mobile AR in combination with a motion tracking system. Of course, a significant factor here is the 'wow' effect that is common for inexperienced users in AR applications. Another interesting point that was commented as impressive was the graphical representation of the avatar (for both male and female participants). Some participants underestimated themselves in evaluation how good they were in the end. Since they did not have any feedback it was extremely subjective and dependent on their self-confidence. At the end, all participants managed to remember and perform the dance without any difficulties. According to their comments, they find this experiment as good and refreshing experience.
Moreover, four participants (one that used projector screen while three used AR) mentioned that they could not see the avatar of the professional dancer while they were spinning. One participant complained that parts of a suite were sticking together during the movement, and the other one reported feeling odd with suit on. In addition, one participant did not realize that they had to follow the music. Seven participants from both groups found the experiment as funny, nice and a refreshing experience. One participant wrote that organization of the experiment was good. One participant suggested that it would be good to divide moves according to body parts, e.g., legs, hands and in the end to see full body movement. Two participants found interaction using a Bluetooth mouse clumsy and they had troubles during learning part since they had to hold the mouse.

Discussion
Participants' motion capture data were compared with ground truth data from professional dancers to measure similarity between their movements. Determination of ground truth data for motion capture is still an open research question. Human motion data obtained using an optical motion capture system are considered as gold standard in motion capture. In case of professional dancers, maximum size of the gap was set to 10 frames and after filtering all gaps where filled. From this we can conclude that the size of gaps was less or equal to 10. Since the frame rate of the system was 120 Hz, we can consider these data good enough to be used as ground truth.
The prototype mobile AR application was designed to provide easier interaction and environment where participants could examine the dance from different angles and distance. This should facilitate learning. However, most participants did not use this advantage. In addition, interaction in AR is based on swipe gestures which was supposed to be natural for participants, since they all use smartphones in everyday life. Possible reasons why they struggled with commands is that the phone was placed in head-mounted cardboard. After explanation from the experimenter at the beginning, participants did not have any time to try and repeat commands but the learning phase started. Some participants spent significant amount of time intended for learning the dance on learning how to interact with the application. This is supported by results from PQ where score related to 'Possibility to Act' that includes questions like 'How much were you able to control events?' and 'How completely were you able to actively survey or search the environment using vision?' was in favor of application presented on projector screen.
In our experiment some participants reported to feel some symptoms stronger after experiment but without significant difference. Since the folk dance, they had to learn, includes spinning, it could affect answers because participants that used projector screen also had to spin and they could experience vertigo or dizziness. This finding is in line with [43] which reported that only few participants experienced minimal discomfort. Moreover, Pettijohn et al. [17] reported that using AR/VR headset does not exacerbate motion sickness. It was indicated that variation of the latency over time can result with decrease in performance and increase in simulator sickness [44].
For both testing modes (AR and projection screen), the main limitation was avatar visualization within the field of view. When users move in space, an avatar is not following them and it can happen that they cannot see avatar at some moments. Interaction through cutout in Google cardboard is another part to be improved. Users could not see the cutout while they were in AR and they had to remember where it was placed. Gesture recognition using camera could be better solution.

Conclusions
This paper examined the use of AR application for learning folk dances. To prove the feasibility of the application, it was compared with a back-projector screen, which mimics traditional digital methods of learning how to dance. Experimental results, without significant difference, indicate that there is tendency for AR application to be better than 3D animation presented on projector screen. Results from NASA TLX supports expectation for higher mental demand in AR since participants are immersed in environment during the experiment. Even if the sample tested (20 participants) was too small to get some more generalized and inclusive conclusion, our results indicated how AR can be used for teaching folk dances as an assisting tool. The purpose is not to replace the teacher but provide a tool that could help learners to adapt and learn faster and easier. In the future, feedback at the end of performance will be provided in the form of a score. In addition, more dances will be incorporated into the application and evaluated.