Real-Time Musical Conducting Gesture Recognition Based on a Dynamic Time Warping Classiﬁer Using a Single-Depth Camera

: Gesture recognition is a human–computer interaction method, which is widely used for educational, medical, and entertainment purposes. Humans also use gestures to communicate with each other, and musical conducting uses gestures in this way. In musical conducting, conductors wave their hands to control the speed and strength of the music played. However, beginners may have a limited comprehension of the gestures and might not be able to properly follow the ensembles. Therefore, this paper proposes a real-time musical conducting gesture recognition system to help music players improve their performance. We used a single-depth camera to capture image inputs and establish a real-time dynamic gesture recognition system. The Kinect software development kit created a skeleton model by capturing the palm position. Different palm gestures were collected to develop training templates for musical conducting. The dynamic time warping algorithm was applied to recognize the different conducting gestures at various conducting speeds, thereby achieving real-time dynamic musical conducting gesture recognition. In the experiment, we used 5600 examples of three basic types of musical conducting gestures, including seven capturing angles and ﬁve performing speeds for evaluation. The experimental result showed that the average accuracy was 89.17% in 30 frames per second.


Introduction
Conducting is the art of directing a musical performance.The primary duties of a conductor include interpreting the scores in a manner that reflects specific indications, keeping the performance on the beat, and communicating with the performers about what emotions are to be presented in the performance.However, to become a professional musical conductor, an individual needs to be equipped with musical knowledge and a sense of musical beat to gesture the tempos correctly.Considerable time is required to become a professional musical conductor.In this study, we developed a musical conducting recognition system to help conducting learners improve their conducting [1].The recognition system also helps musical performers access the orders with less limitation on playing in terms of weak eye sight and on-going tasks, such as turning a page and looking at the score.
Hand gestures are an important tool that help us communicate with each other.Recently, human-computer interaction, which involves employing new ideas and technologies, has become increasingly important in solving many problems in our daily lives [2].Gesture recognition involves a human-computer interaction method wherein human gestures are processed and recognized by the machine.Humans often express their actions, intents, and emotions through hand gestures, and gesture recognition is useful for processing some tasks automatically.For example, American Sign Language helps those with speaking or hearing disabilities to communicate through just hand gestures [3][4][5][6].Another example is traffic police officers who direct traffic during a traffic jam or car accident.Some researchers have also used hand gesture recognition to detect whether a person has Alzheimer's disease [7] or to interpret hand-drawn diagrams [8].In musical performances, some use the Kinect to recognize drum-hitting gestures [9].In musical conducting, hand gestures are used to express the conductor's style and emotion and to make performers understand the conductor's guidance [10,11].Although every conductor has their own style of conducting, a standard conducting rule exists.In 2004, researchers attempted to use computer vision methods to recognize musical conducting gestures [12].
In musical conducting, the conductor is a person who guides the whole performance and keeps everyone on the same tempo.Due to the constant change in tempo and emotion of a musical piece, it is difficult for musicians to constantly focus on the conductor, as they also need to look at the music score and play simultaneously [13].Musicians may also encounter problems when they have to turn a page while playing or are lost in terms of where they are in the musical piece.It might even be difficult for the musicians to see the conductor if they are sitting far away or have to wear glasses to read the music sheet.To solve these problems, we studied a conductor's gestures and helped the computer recognize and understand the command as the conductor was gesturing.We believe that musicians can use our recognition system to improve their performance during practice and concerts.
Traditionally, the cultivation of a music conductor requires educational training by a mentor so that they can correctly express the timing of a given musical piece [14].However, this process requires considerable time and the cost of training is high.Therefore, in this study, we developed a system that can help prospective conductors to practice by themselves [15,16].In musical conducting, there are many types of gestures, including single, double, triple, quadruple, quintuple, sextuple, septuple, and octuple, where the prefixes are taken from the Latin names of the numerals.The gestures are named by their articulation point counts, so there can be infinite types of gestures.In this study, we chose the three most common types of conducting gestures, namely 2/4 duple, 3/4 triple, and 4/4 quad, as shown in Figure 1.
Appl.Sci.2019, 9, 528 2 of 18 the machine.Humans often express their actions, intents, and emotions through hand gestures, and gesture recognition is useful for processing some tasks automatically.For example, American Sign Language helps those with speaking or hearing disabilities to communicate through just hand gestures [3][4][5][6].Another example is traffic police officers who direct traffic during a traffic jam or car accident.Some researchers have also used hand gesture recognition to detect whether a person has Alzheimer's disease [7] or to interpret hand-drawn diagrams [8].In musical performances, some use the Kinect to recognize drum-hitting gestures [9].In musical conducting, hand gestures are used to express the conductor's style and emotion and to make performers understand the conductor's guidance [10,11].Although every conductor has their own style of conducting, a standard conducting rule exists.In 2004, researchers attempted to use computer vision methods to recognize musical conducting gestures [12].
In musical conducting, the conductor is a person who guides the whole performance and keeps everyone on the same tempo.Due to the constant change in tempo and emotion of a musical piece, it is difficult for musicians to constantly focus on the conductor, as they also need to look at the music score and play simultaneously [13].Musicians may also encounter problems when they have to turn a page while playing or are lost in terms of where they are in the musical piece.It might even be difficult for the musicians to see the conductor if they are sitting far away or have to wear glasses to read the music sheet.To solve these problems, we studied a conductor's gestures and helped the computer recognize and understand the command as the conductor was gesturing.We believe that musicians can use our recognition system to improve their performance during practice and concerts.
Traditionally, the cultivation of a music conductor requires educational training by a mentor so that they can correctly express the timing of a given musical piece [14].However, this process requires considerable time and the cost of training is high.Therefore, in this study, we developed a system that can help prospective conductors to practice by themselves [15,16].In musical conducting, there are many types of gestures, including single, double, triple, quadruple, quintuple, sextuple, septuple, and octuple, where the prefixes are taken from the Latin names of the numerals.The gestures are named by their articulation point counts, so there can be infinite types of gestures.In this study, we chose the three most common types of conducting gestures, namely 2/4 duple, 3/4 triple, and 4/4 quad, as shown in Figure 1.In computer vision systems, to track the position of hands and palms, a certain amount of preprocessing is required and the accuracy of the performance is not guaranteed.Therefore, in this study, we chose the Kinect to achieve hand tracking [17][18][19][20].The Kinect generates skeleton images of the human body as graphs with edges and vertices through its depth sensors.
We proposed a musical conducting recognition system to classify the types of musical gesture and calculate the conducting speed in order to help conductors improve their performance.In this section, we have introduced musical conducting gesture recognition and explained how computer vision methods are helpful in assisting musical conductors to perform better.In section 2, we present In computer vision systems, to track the position of hands and palms, a certain amount of preprocessing is required and the accuracy of the performance is not guaranteed.Therefore, in this study, we chose the Kinect to achieve hand tracking [17][18][19][20].The Kinect generates skeleton images of the human body as graphs with edges and vertices through its depth sensors.
We proposed a musical conducting recognition system to classify the types of musical gesture and calculate the conducting speed in order to help conductors improve their performance.In this section, we have introduced musical conducting gesture recognition and explained how computer vision methods are helpful in assisting musical conductors to perform better.In Section 2, we present our proposed method for musical conducting gesture recognition using the Kinect and dynamic time warping (DTW), in Section 3, the experimental results of our proposed method are presented, and in Section 4, the conclusion and possible future work are described.

Proposed Method
In this section, we describe our proposed method for musical conducting gesture recognition.In the first part, we exploit the Kinect SDK (Software Development Kit) to capture the skeleton image of the conductor, and in the second part, the DTW [21][22][23] method is used to recognize the conducting gestures.The proposed method can detect multiple gestures at different conducting speeds.A diagram of our proposed method is presented in Figure 2.
Appl.Sci.2019, 9, 528 3 of 18 our proposed method for musical conducting gesture recognition using the Kinect and dynamic time warping (DTW), in section 3, the experimental results of our proposed method are presented, and in section 4, the conclusion and possible future work are described.

Proposed Method
In this section, we describe our proposed method for musical conducting gesture recognition.In the first part, we exploit the Kinect SDK (Software Development Kit) to capture the skeleton image of the conductor, and in the second part, the DTW [21][22][23] method is used to recognize the conducting gestures.The proposed method can detect multiple gestures at different conducting speeds.A diagram of our proposed method is presented in Figure 2.Although DTW has been used for various gesture recognitions [24][25][26], the effectiveness for musical conducting has not been evaluated to a great extent.Our proposed method contributes to musical conducting gesture recognition in an effective manner as such that it is in real time, does not need much computational time, and has a high recognition accuracy.

Hand Tracking
Our system uses the depth image sensor Xbox 360 Kinect (Microsoft, Redmond, WA, USA, 2013) to capture the required skeleton image of the human body [27], which includes 20 feature points [28].As in previous literature, for musical gesture recognition, only palm feature points are required [29].For robustness, there are three inputs in our system, including the original input, a skeleton model, and the depth image, as shown in Figure 3.Among these points, the system uses the significant palm points for musical conducting gesture recognition.Compared to the methods of extraction and segmentation from the original input, the advantages of using the Kinect for hand tracking is that it is robust to background noise and the colors are similar to the conductor's clothes and skin [30].Figure 4 shows the example trajectories captured from hand feature points.Although DTW has been used for various gesture recognitions [24][25][26], the effectiveness for musical conducting has not been evaluated to a great extent.Our proposed method contributes to musical conducting gesture recognition in an effective manner as such that it is in real time, does not need much computational time, and has a high recognition accuracy.

Hand Tracking
Our system uses the depth image sensor Xbox 360 Kinect (Microsoft, Redmond, WA, USA, 2013) to capture the required skeleton image of the human body [27], which includes 20 feature points [28].As in previous literature, for musical gesture recognition, only palm feature points are required [29].For robustness, there are three inputs in our system, including the original input, a skeleton model, and the depth image, as shown in Figure 3.Among these points, the system uses the significant palm points for musical conducting gesture recognition.
Appl.Sci.2019, 9, 528 3 of 18 our proposed method for musical conducting gesture recognition using the Kinect and dynamic time warping (DTW), in section 3, the experimental results of our proposed method are presented, and in section 4, the conclusion and possible future work are described.

Proposed Method
In this section, we describe our proposed method for musical conducting gesture recognition.In the first part, we exploit the Kinect SDK (Software Development Kit) to capture the skeleton image of the conductor, and in the second part, the DTW [21][22][23] method is used to recognize the conducting gestures.The proposed method can detect multiple gestures at different conducting speeds.A diagram of our proposed method is presented in Figure 2.Although DTW has been used for various gesture recognitions [24][25][26], the effectiveness for musical conducting has not been evaluated to a great extent.Our proposed method contributes to musical conducting gesture recognition in an effective manner as such that it is in real time, does not need much computational time, and has a high recognition accuracy.

Hand Tracking
Our system uses the depth image sensor Xbox 360 Kinect (Microsoft, Redmond, WA, USA, 2013) to capture the required skeleton image of the human body [27], which includes 20 feature points [28].As in previous literature, for musical gesture recognition, only palm feature points are required [29].For robustness, there are three inputs in our system, including the original input, a skeleton model, and the depth image, as shown in Figure 3.Among these points, the system uses the significant palm points for musical conducting gesture recognition.Compared to the methods of extraction and segmentation from the original input, the advantages of using the Kinect for hand tracking is that it is robust to background noise and the colors are similar to the conductor's clothes and skin [30].Figure 4 shows the example trajectories captured from hand feature points.Compared to the methods of extraction and segmentation from the original input, the advantages of using the Kinect for hand tracking is that it is robust to background noise and the colors are similar to the conductor's clothes and skin [30].Figure 4 shows the example trajectories captured from hand feature points.

Separating Continuous Gestures into Single Gestures
In a musical performance, the conducting gesture is continuous, which helps the musicians stay on tempo and prompts them to change the music dynamics.The separation point of two gestures is easy to find for musicians, but not for computer systems.In Figure 5, we show a simplified continuous gesture.The X-axis is the time (t) in seconds, and the Y-axis is the position of the palm of the hand in terms of height (h).In the graph, the palm position is related to its height and the gesture is a duple 2/4 gesture.The height of the starting point is similar to its end point, but dissimilar to other points within the gesture.After successfully segmenting each gesture, the beats per minute can be calculated, and then, the type of gestures can be classified.
To implement gesture segmentation, first, we find the starting position of the whole sequence using the following equation: where h(s) is the starting height of the gesture, h(t) is the current height of the palm, and  is the instance of a beat.
After the first starting point is found, we set the threshold (τ) to the height of h(s).Subsequently, we find the end point of the first gesture (p) by the following equation:

Separating Continuous Gestures into Single Gestures
In a musical performance, the conducting gesture is continuous, which helps the musicians stay on tempo and prompts them to change the music dynamics.The separation point of two gestures is easy to find for musicians, but not for computer systems.In Figure 5, we show a simplified continuous gesture.The X-axis is the time (t) in seconds, and the Y-axis is the position of the palm of the hand in terms of height (h).

Separating Continuous Gestures into Single Gestures
In a musical performance, the conducting gesture is continuous, which helps the musicians stay on tempo and prompts them to change the music dynamics.The separation point of two gestures is easy to find for musicians, but not for computer systems.In Figure 5, we show a simplified continuous gesture.The X-axis is the time (t) in seconds, and the Y-axis is the position of the palm of the hand in terms of height (h).In the graph, the palm position is related to its height and the gesture is a duple 2/4 gesture.The height of the starting point is similar to its end point, but dissimilar to other points within the gesture.After successfully segmenting each gesture, the beats per minute can be calculated, and then, the type of gestures can be classified.
To implement gesture segmentation, first, we find the starting position of the whole sequence using the following equation: where h(s) is the starting height of the gesture, h(t) is the current height of the palm, and  is the instance of a beat.
After the first starting point is found, we set the threshold (τ) to the height of h(s).Subsequently, we find the end point of the first gesture (p) by the following equation: In the graph, the palm position is related to its height and the gesture is a duple 2/4 gesture.The height of the starting point is similar to its end point, but dissimilar to other points within the gesture.After successfully segmenting each gesture, the beats per minute can be calculated, and then, the type of gestures can be classified.
To implement gesture segmentation, first, we find the starting position of the whole sequence using the following equation: where h(s) is the starting height of the gesture, h(t) is the current height of the palm, and b is the instance of a beat.
After the first starting point is found, we set the threshold (τ) to the height of h(s).Subsequently, we find the end point of the first gesture (p) by the following equation: where τ is a threshold of tolerance, t = h(s) − δ, and δ is set to 15 cm empirically.Figure 6 is shown as a segmentation result example with their h(s) and h(p)s.
Appl.Sci.2019, 9, 528 5 of 18 where τ is a threshold of tolerance, t = h(s) − δ, and δ is set to 15 cm empirically.Figure 6 is shown as a segmentation result example with their h(s) and h(p)s.Figure 6 shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure 7 shows one of the segmentations from Figure 6.We use relative positions rather than absolute positions in our system, so the method is not affected by the height of the conductor, or the way in which the conductor makes his or her gesture, since the threshold of the height in the algorithm can be changed according to the conductor.6 shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure 7 shows one of the segmentations from Figure 6.where τ is a threshold of tolerance, t = h(s) − δ, and δ is set to 15 cm empirically.Figure 6 is shown as a segmentation result example with their h(s) and h(p)s.
Figure 6.Visualization of a simplified graph of the palm position during a duple gesture after performing segmentation.Figure 6 shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure 7 shows one of the segmentations from Figure 6.We use relative positions rather than absolute positions in our system, so the method is not affected by the height of the conductor, or the way in which the conductor makes his or her gesture, since the threshold of the height in the algorithm can be changed according to the conductor.We use relative positions rather than absolute positions in our system, so the method is not affected by the height of the conductor, or the way in which the conductor makes his or her gesture, since the threshold of the height in the algorithm can be changed according to the conductor.

Tempo Recognition
Depending on the musical score, each measure may or may not have the same tempo, and the conductor is responsible for keeping everyone on the same rhythm.As mentioned earlier, there are three basic gestures that a conductor can make during a performance, but the speed at which each gesture is made is the most important part of keeping the whole orchestra or choir together.The tempo of the beat is universally measured in beats per minute (BPM).
After the segmentation of the gesture in the previous section, we can begin to analyze the data and determine the speed of the gesture.To determine the BPM of the gesture, we first need to know the number of beats made in the gesture and the exact time when the beat was made.We can define this beat as h(b).Once we have this information, we can determine the speed of the gesture and calculate the BPM.Our equation for determining the beats in the gesture is as follows: where b represents the beat, and t and s are the time points with units in seconds.We want to find b that meets the conditions that the height in time, t, is greater than that in time t − 1, and the height in t − 1 is also less than that in t − 2. All ts are greater than the starting time, s.
Once we know where the beats are, we can calculate the BPM of the musical gesture.
After collecting the data for the beats, we can define the BPM of the gesture using the following equation: The output of the algorithm will give the exact BPM of the gesture-this would work for any gesture irrespective of the number of beats.Capturing the exact tempo will allow the conductor to see if he is making the gesture at the correct speed.The information can also be sent to the musicians if they are looking at a digital music sheet, so they can have a reminder of what speed to play at.

Dynamic Time Warping (DTW) Classifier
DTW is an algorithm used for measuring similarities between two temporal sequences at different speeds [31].DTW was originally designed for speech recognition and has been applied to temporal sequences in audio, video, and graphical data [32].It can be used to analyze any data in a linear sequence.To implement DTW, two time sequences are needed-the original gesture template and the conducting gesture to recognize.DTW takes these two sequences and then calculates an optimal match between the two [33].The gesture with the best match, in other words, the one with the highest similarity, is the recognized gesture.In this section, we visualize the implementation of DTW.In the visualization, five averaged template samples were used.Figure 8 illustrates the averaged samples in orange and a sample input from the user in blue.A difference was observed between the two, and it was possible to match them correctly by using DTW.
To determine the degree of match, we used a distance matrix (D) to measure the two time sequences.Using the aforementioned example, we took the data and used it to create a distance matrix by employing the DTW algorithm in the following equation: where h t (i) is the time sequence of the known musical gesture of a template, such as the horizontal axis in Figure 9, h u (j) is the time sequence of an input gesture from the user, such as the vertical axis in This equation gives a distance matrix between the two time sequences.We can find the warping path between these two sequences by starting from h ( ) , ℎ ( ) and determining the minimum between the distances on the right, below, and diagonally.If the warping path goes diagonally, it is counted as a match.Otherwise, it is just counted as another step in the warping path.To calculate the percentage of accuracy from the template matching, we count the number of matches in the DTW algorithm and compare it with the number of matches that a perfect gesture would have.If the match is over the threshold, the gesture is considered a match.For our system, we set the threshold to 60%.We took the previous example and input the data into the algorithm.The distance matrix and the process of finding an optimal warping path are shown in Figure 9.This equation gives a distance matrix between the two time sequences.We can find the warping path between these two sequences by starting from h t (1), h u (1) and determining the minimum between the distances on the right, below, and diagonally.If the warping path goes diagonally, it is counted as a match.Otherwise, it is just counted as another step in the warping path.To calculate the percentage of accuracy from the template matching, we count the number of matches in the DTW algorithm and compare it with the number of matches that a perfect gesture would have.If the match is over the threshold, the gesture is considered a match.For our system, we set the threshold to 60%.We took the previous example and input the data into the algorithm.The distance matrix and the process of finding an optimal warping path are shown in Figure 9.This equation gives a distance matrix between the two time sequences.We can find the warping path between these two sequences by starting from h ( ) , ℎ ( ) and determining the minimum between the distances on the right, below, and diagonally.If the warping path goes diagonally, it is counted as a match.Otherwise, it is just counted as another step in the warping path.To calculate the percentage of accuracy from the template matching, we count the number of matches in the DTW algorithm and compare it with the number of matches that a perfect gesture would have.If the match is over the threshold, the gesture is considered a match.For our system, we set the threshold to 60%.We took the previous example and input the data into the algorithm.The distance matrix and the process of finding an optimal warping path are shown in Figure 9.This warping path shows where the points of the two gestures are connected and how the path warps according to the distances between the two points.For example, in Figure 9, if there is no point match, the path will move horizontally until the end; otherwise, it will make turns by going down one step.In short, when the path goes down one step, it means that there is a point match.In our system, if the gesture made by the user had more than a 60% match with the template gesture, it was counted as a match and recognized as such.In Figure 10, we visualize the warping path by graphing the points and connecting them to each other where they match.
In this case, the total number of matches in the aforementioned example was 40 out of 60, which provided an accuracy above the threshold.Our system takes each user's gesture and compares it, using DTW, with every template available and records the gesture with the highest accuracy as the output.
This warping path shows where the points of the two gestures are connected and how the path warps according to the distances between the two points.For example, in Figure 9, if there is no point match, the path will move horizontally until the end; otherwise, it will make turns by going down one step.In short, when the path goes down one step, it means that there is a point match.In our system, if the gesture made by the user had more than a 60% match with the template gesture, it was counted as a match and recognized as such.In Figure 10, we visualize the warping path by graphing the points and connecting them to each other where they match.
In this case, the total number of matches in the aforementioned example was 40 out of 60, which provided an accuracy above the threshold.Our system takes each user's gesture and compares it, using DTW, with every template available and records the gesture with the highest accuracy as the output.

Experimental Setup
Our system environment was based on the Microsoft Visual Studio C++ 2010 and Kinect for Windows SDK 1.7.We also used the Xbox 360 Kinect depth camera to film and record our depth videos.The camera filmed at around 30 frames per second and the size of the videos captured was approximately 640 × 640 pixels.Our system was run on a personal computer, and the computer specifications were Intel® Core™ i5-4460 CPU @ 3.20 GHz and 8.00 GB RAM.Table 1 lists the hardware and software devices used in the experiment.

Database Setup
We developed a database of three gestures made from four different musical teachers and used their gestures to create a template for comparison in our system.In our experiment, the users made the gestures in a continuous motion, one after another, and the algorithm recognized the gesture as it was being made.A total of 5600 gestures were made.There were 1750 gestures for each basic music gesture and 350 for other gestures that may appear in a performance but are not tempo gestures, such as turning the page or pointing at a section of the orchestra.These gestures were made in a combination of different BPM and facing the camera at different angles.We also gestured to five

Experimental Setup
Our system environment was based on the Microsoft Visual Studio C++ 2010 and Kinect for Windows SDK 1.7.We also used the Xbox 360 Kinect depth camera to film and record our depth videos.The camera filmed at around 30 frames per second and the size of the videos captured was approximately 640 × 640 pixels.Our system was run on a personal computer, and the computer specifications were Intel®Core™ i5-4460 CPU @ 3.20 GHz and 8.00 GB RAM.Table 1 lists the hardware and software devices used in the experiment.

Database Setup
We developed a database of three gestures made from four different musical teachers and used their gestures to create a template for comparison in our system.In our experiment, the users made the gestures in a continuous motion, one after another, and the algorithm recognized the gesture as it was being made.A total of 5600 gestures were made.There were 1750 gestures for each basic music gesture and 350 for other gestures that may appear in a performance but are not tempo gestures, such as turning the page or pointing at a section of the orchestra.These gestures were made in a combination of different BPM and facing the camera at different angles.We also gestured to five different BPM for the same interval of time to find the mean squared error and to show the robustness of our system.Finally, we gestured along to three complete songs.Our experimental results are separated into two parts-the accuracy of the recognition of single musical gestures and the accuracy of the recognition of continuous musical gestures.All our experimental videos were filmed indoors.In Figures 11-13, we show the three basic gestures being made at different speeds.
different BPM for the same interval of time to find the mean squared error and to show the robustness of our system.Finally, we gestured along to three complete songs.Our experimental results are separated into two parts-the accuracy of the recognition of single musical gestures and the accuracy of the recognition of continuous musical gestures.All our experimental videos were filmed indoors.In Figures 11-13, we show the three basic gestures being made at different speeds.

Results of Single Musical Gesture Recognition
In our experiment, we used the template shown in section 3.2 to test and verify the gesture recognition results.Five people used the real-time system to perform the dynamic gesture recognition.The users had to stand in front of the camera with nothing in between so that the camera different BPM for the same interval of time to find the mean squared error and to show the robustness of our system.Finally, we gestured along to three complete songs.Our experimental results are separated into two parts-the accuracy of the recognition of single musical gestures and the accuracy of the recognition of continuous musical gestures.All our experimental videos were filmed indoors.In Figures 11-13, we show the three basic gestures being made at different speeds.

Results of Single Musical Gesture Recognition
In our experiment, we used the template shown in section 3.2 to test and verify the gesture recognition results.Five people used the real-time system to perform the dynamic gesture recognition.The users had to stand in front of the camera with nothing in between so that the camera different BPM for the same interval of time to find the mean squared error and to show the robustness of our system.Finally, we gestured along to three complete songs.Our experimental results are separated into two parts-the accuracy of the recognition of single musical gestures and the accuracy of the recognition of continuous musical gestures.All our experimental videos were filmed indoors.In Figures 11-13, we show the three basic gestures being made at different speeds.

Results of Single Musical Gesture Recognition
In our experiment, we used the template shown in section 3.2 to test and verify the gesture recognition results.Five people used the real-time system to perform the dynamic gesture recognition.The users had to stand in front of the camera with nothing in between so that the camera

Results of Single Musical Gesture Recognition
In our experiment, we used the template shown in Section 3.2 to test and verify the gesture recognition results.Five people used the real-time system to perform the dynamic gesture recognition.
The users had to stand in front of the camera with nothing in between so that the camera could accurately detect where the palms were.The background of the footage and the clothing of the user did not affect the results, because we used a depth camera.Only one user's gestures can be detected at a time, and our system detects each of the user's palms separately.To verify the result, we used half of the data to train the model and another half to verify the accuracy of the model.The accuracy of the musical gesture recognition can be defined as follows [34]: The original images, skeleton images, and the depth maps of the frames of the duple, triple, and quad musical conducting gestures are shown in Figures 14-16, respectively.Each user performed 350 gestures for each musical gesture and 70 gestures for the other category.The accuracy of the experiment is presented in Table 2.
Appl.Sci.2019, 9, 528 11 of 18 could accurately detect where the palms were.The background of the footage and the clothing of the user did not affect the results, because we used a depth camera.Only one user's gestures can be detected at a time, and our system detects each of the user's palms separately.To verify the result, we used half of the data to train the model and another half to verify the accuracy of the model.The accuracy of the musical gesture recognition can be defined as follows [34]: The original images, skeleton images, and the depth maps of the frames of the duple, triple, and quad musical conducting gestures are shown in Figures 14−16, respectively.Each user performed 350 gestures for each musical gesture and 70 gestures for the other category.The accuracy of the experiment is presented in Table 2.     Conducting gestures at different speeds affects the accuracy.Table 3 shows the conducting gesture recognition result of different conducting speeds.In the experiment, we used a metronome to assist the user in staying at a constant BPM.The metronome allowed us to determine the exact tempo that the user should be gesturing at, which made the experiment more stable.There were still some misclassifications in our experiment, and one of the main reasons for this is that the speed was too high for the camera to capture the gesture clearly.Our camera could only capture 30 frames per second, which made it difficult to capture a gesture made at 200 BPM, which is approximately 3.33 beats a second.In addition, because the gesture was made so quickly, the movement was much slower to compensate for the quickness, which made it difficult for our system to clearly capture the hand gesture.Conducting gestures at different speeds affects the accuracy.Table 3 shows the conducting gesture recognition result of different conducting speeds.In the experiment, we used a metronome to assist the user in staying at a constant BPM.The metronome allowed us to determine the exact tempo that the user should be gesturing at, which made the experiment more stable.There were still some misclassifications in our experiment, and one of the main reasons for this is that the speed was too high for the camera to capture the gesture clearly.Our camera could only capture 30 frames per second, which made it difficult to capture a gesture made at 200 BPM, which is approximately 3.33 beats a second.In addition, because the gesture was made so quickly, the movement was much slower to compensate for the quickness, which made it difficult for our system to clearly capture the hand gesture.As seen in Table 4, the accuracy of the classification below 150 BPM was 100% and thus ideal for being used in the real world.However, as the speed reached 150 BPM and higher, the accuracy dropped.The total accuracy at each speed is shown in Table 4.As seen in the above table, the accuracy at 150 BPM started to drop, and at 200 BPM, the accuracy dropped dramatically.As mentioned before, the main reason for the drop in accuracy was that the camera filmed at 30 frames per second, and the resolution of a gesture decreased as the BPM increased.Fortunately, in real-life applications, there are very few musical pieces that actually go up to 200 BPM, and thus, our proposed system is usable on most occasions.
Our experiment comprised conducting gesture recognition not only at different speeds but also at different angles.Our system can accurately recognize the gesture as long as the palms are clearly captured by the depth camera, because it recognizes the gestures based on the vertical movement of the palms.The accuracy of our musical conducting recognition system at different viewing angles is shown in Table 5.At 90 • , a part of the conductor's body was blocked, so the accuracy of the gesture was slightly lower.Our method can recognize three basic gestures made at 60−200 BPM and can recognize the gestures at any angle within 45 • upward or downward and within 90 • to the left or right.

Results of Continuous Musical Gesture Recognition
In previous experiments, each gesture was separated from each other and only one gesture was input at a time.However, in this experiment, we used actual gestures in a performance where all the gestures were connected and continuous.For the continuous gestures, the experiment was divided into two parts.In the first part, we used continuous gestures at five different BPM and calculated the mean squared error between the computed BPM and the actual ones.Table 6 lists the results of the average squared errors in the gestures of different BPM.The mean squared error shows the average error that the user makes when making the gestures.
In the second part of our experiment, we chose three musical pieces for the actual performance.Unlike the previous experiments, in this part of the experiment, the musical conductors changed their gesture styles in different parts of the music to convey its emotion.Table 7 shows the results of musical gesturing along with the music.The experimental result shows that our system can recognize the gestures accurately with variable BPM.

Comparison of Existing Methods and Our Method
As seen from our experimental results, we achieved a total accuracy of 89.17%.Our experiment comprised five tempo ranges, three gestures, and seven filming angles.Our method can also be applied to a complete song with very few errors.
Our experimentation is relatively more complete compared with similar methods of recognizing a musical gesture.Being a musical conductor means moving around and turning toward different musicians at different times, and our method allows a range of 180 • .It also allows the camera to be above or below the conductor, which is more useful in the real world.For example, if a camera is placed directly in front of the conductor, it may block some musicians and will not be aesthetically pleasing for the audience.However, if a camera is hidden behind the musicians or above the conductor, it will not block the musicians or the conductor and the audience will not be able to see it.
We compared our method with the methods reported by three other studies that are closely related to ours.The first one is by H. Je, J. Kim, and D. Kim, who used a depth camera to recognize some musical gestures through feature points [5].The second one is by N. Kawarazaki, Y. Kaneishi, N. Saito, and T. Asakawa [35], who used a depth camera and an inertia sensor to detect the conductor's movements and where the conductor was facing.The final method is by S. Cosentino, Y. Sugita, and M. Zecca [36], who used an inertia glove that could transmit signals to a robot to let the robot know what gesture the robot was making.Tables 8 and 9 show the comparisons of the related work with our method.Although our method did not reach the highest accuracy when compared with the other methods, we experimented with more gestures, angles, and tempos than the method with the highest accuracy.In addition, we only used a depth camera for our experiments, which is cost-effective for the results we achieved.

Conclusions
This paper proposes a musical conducting recognition system that can classify three basic musical conducting gestures and recognize these gestures up to 150 BPM.We used the Kinect to capture the depth map of a musical conductor and formed a skeleton model with palm data points.The time-varying palm data points were formed as a sequence of our input.The value of the data points was relative to the musical conductor's height, so the system was adaptive to the height of most users.Our proposed method is based on a DTW algorithm, which is a dynamic programming method for sequential data analysis.Combined with the Kinect depth camera and the Kinect SDK, the system can capture and track the movement of users' palms and classify the gestures.Before classification, the continuous gestures are segmented into separate ones.Additionally, the BPM of each gesture is calculated to help users practice the gestures.Once the BPM of the gesture is determined, the gesture can be identified and classified.We collected the musical gestures of three musical professionals and combined their gestures to create a template for comparison in our DTW algorithm.
In the experiment, we used three different musical gestures of five users at five different speeds and seven different angles for evaluation.We collected all the results and achieved a total accuracy of 89.17%.Our system can recognize three different musical gestures at five different speeds and seven different angles, with the fastest speed being 200 BPM and the largest angle being 90 • , with an overall frame rate of 30 frames per second (FPS).
In this study, we recognized the three most common gestures in musical conducting, which are the "duple", "triple", and "quadratic".For future work, we recommend that more types of musical gestures should be added for recognition.Also, the current tolerance angle between the musical conductor and the camera is 90% from left or right and 45% from upper or lower, which can be extended.Finally, in this experiment, the recognition often failed when a gesture was halted abruptly, which means that the robustness of the system to a sudden unexpected change in the behavior of musical conductors can be improved.

Figure 2 .
Figure 2. Diagram of our proposed method.

Figure 3 .
Figure 3. Three inputs for our proposed system: (a) original input, (b) skeleton image, and (c) depth image with the target body marked in green.

Figure 2 .
Figure 2. Diagram of our proposed method.

Figure 3 .
Figure 3. Three inputs for our proposed system: (a) original input, (b) skeleton image, and (c) depth image with the target body marked in green.

Figure 3 .
Figure 3. Three inputs for our proposed system: (a) original input, (b) skeleton image, and (c) depth image with the target body marked in green.

Figure 4 .
Figure 4. Data points of the continuous gestures captured from tracking users' palms: (a) duple, (b) triple, and (c) quad.

Figure 5 .
Figure 5. Simplified graph of the height of the palm position during a duple gesture.

Figure 4 .
Figure 4. Data points of the continuous gestures captured from tracking users' palms: (a) duple, (b) triple, and (c) quad.

Figure 4 .
Figure 4. Data points of the continuous gestures captured from tracking users' palms: (a) duple, (b) triple, and (c) quad.

Figure 5 .
Figure 5. Simplified graph of the height of the palm position during a duple gesture.

Figure 5 .
Figure 5. Simplified graph of the height of the palm position during a duple gesture.

Figure 6 .
Figure 6.Visualization of a simplified graph of the palm position during a duple gesture after performing segmentation.Figure6shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure7shows one of the segmentations from Figure6.

Figure 6 .
Figure 6.Visualization of a simplified graph of the palm position during a duple gesture after performing segmentation.Figure6shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure7shows one of the segmentations from Figure6.
Figure 6.Visualization of a simplified graph of the palm position during a duple gesture after performing segmentation.Figure6shows how a continuous gesture is segmented into six different sections, and how each section is individually processed and classified based on the information between the starting and end points.Figure7shows one of the segmentations from Figure6.

Figure 9 , 18 Figure 8 .
Figure 9, and i and j are the numbers of frames in each sequence.This equation has an initial condition of D(h t (1), h u (1)) = |h t (1) − h u (1)| with the minimum being zero.Appl.Sci.2019, 9, 528 7 of 18

Figure 9 .
Figure 9. Dynamic warping path marked in red between the template data, marked in blue, and the user data, marked in orange.

Figure 8 .
Figure 8.Both the template and user data graphed together, where the blue curve indicates the template gesture and the orange curve indicates the user gesture.

Figure 8 .
Figure 8.Both the template and user data graphed together, where the blue curve indicates the template gesture and the orange curve indicates the user gesture.

Figure 9 .
Figure 9. Dynamic warping path marked in red between the template data, marked in blue, and the user data, marked in orange.

Figure 9 .
Figure 9. Dynamic warping path marked in red between the template data, marked in blue, and the user data, marked in orange.

Figure 10 .
Figure 10.Visual representation of dynamic time warping (DTW) with the optimal warp, where the blue line represents the template and the orange line represents the users' gesture: (a) duple gesture, (b) triple gesture, and (c) quad gesture.

Figure 10 .
Figure 10.Visual representation of dynamic time warping (DTW) with the optimal warp, where the blue line represents the template and the orange line represents the users' gesture: (a) duple gesture, (b) triple gesture, and (c) quad gesture.

Figure 14 .Figure 15 .
Figure 14.Screenshots of a single musical conducting gesture (duple) being made while facing the camera, including the 2D video, a skeleton model of the musical conductor, and the depth video.

Figure 14 . 6 )Figure 14 .Figure 15 .
Figure 14.Screenshots of a single musical conducting gesture (duple) being made while facing the camera, including the 2D video, a skeleton model of the musical conductor, and the depth video.

Figure 15 .
Figure 15.Screenshots of a single musical conducting gesture (triple) being made while facing the camera, including the 2D video, a skeleton model of the musical conductor, and the depth video.

Figure 16 .
Figure 16.Screenshots of a single musical conducting gesture (quad) being made while facing the camera, including the 2D video, a skeleton model of the musical conductor, and the depth video.

Figure 16 .
Figure 16.Screenshots of a single musical conducting gesture (quad) being made while facing the camera, including the 2D video, a skeleton model of the musical conductor, and the depth video.

Table 1 .
Hardware and software used in our system.

Table 1 .
Hardware and software used in our system.

Table 2 .
Accuracy of musical conducting gesture recognition.

Table 3 .
Accuracy of musical conducting gesture recognition classified by BPM.

Table 2 .
Accuracy of musical conducting gesture recognition.

Table 3 .
Accuracy of musical conducting gesture recognition classified by BPM.

Table 4 .
Accuracy of musical gestures classified by BPM.

Table 5 .
Accuracy of musical conducting gesture recognition at different viewing angles.

Table 6 .
Mean squared error results.

Table 7 .
Accuracy of musical conducting gesture recognition in a whole musical piece.

Table 8 .
Comparisons of related work with our method.

Table 9 .
Accuracy comparison of related work with our method.