Classification of K-Pop Dance Movements Based on Skeleton Information Obtained by a Kinect Sensor

This paper suggests a method of classifying Korean pop (K-pop) dances based on human skeletal motion data obtained from a Kinect sensor in a motion-capture studio environment. In order to accomplish this, we construct a K-pop dance database with a total of 800 dance-movement data points including 200 dance types produced by four professional dancers, from skeletal joint data obtained by a Kinect sensor. Our classification of movements consists of three main steps. First, we obtain six core angles representing important motion features from 25 markers in each frame. These angles are concatenated with feature vectors for all of the frames of each point dance. Then, a dimensionality reduction is performed with a combination of principal component analysis and Fisher’s linear discriminant analysis, which is called fisherdance. Finally, we design an efficient Rectified Linear Unit (ReLU)-based Extreme Learning Machine Classifier (ELMC) with an input layer composed of these feature vectors transformed by fisherdance. In contrast to conventional neural networks, the presented classifier achieves a rapid processing time without implementing weight learning. The results of experiments conducted on the constructed K-pop dance database reveal that the proposed method demonstrates a better classification performance than those of conventional methods such as KNN (K-Nearest Neighbor), SVM (Support Vector Machine), and ELM alone.


Introduction
The past decade has witnessed rapid growth in the number of motion capture applications, ranging from sports sciences and motion analysis to motion-based video games and movies [1][2][3][4][5]. Generally defined, motion capture is the process of recording the movements of humans. It refers to recording the actions of human actors and using that information to animate digital character models in 2D or 3D computer animation sequences. Recently, we have also witnessed the popularity of Korean pop (K-pop) music spread throughout the world. K-pop is a musical genre originating from South Korea that is characterized by a wide variety of audiovisual elements. Although it includes all genres of popular music in South Korea, the term is more often used in a narrower sense to describe a modern form of South Korean pop music covering a range of styles including dance-pop, pop ballads, electro-pop, rock, jazz, and hip-pop. One possible reason that K-pop has become so popular globally is that other aspiring dancers may feel inclined to view skilled young K-pop dancers as role models and to copy their dance styles. This can lead to plagiarism issues in both dance and music, which is our main motivation for classifying K-pop dance movements for the development of both video-based retrieval systems and dance training systems.
Next, the PCA is performed by projecting the high-dimensional vectors into lower-dimensional spaces. Finally, feature vectors with discriminating capabilities are obtained by the LDA.

Generating Concatenated Vectors
In the first stage of our analysis, concatenated vectors are generated. Figure 1 illustrates the six core angles that distinguish each dance movement. As shown in Figure 1, these angles are related to the positions of both elbows, both knees, and both shoulders. Figure 2 illustrates an angle between two joints. This angle is calculated with the following equations: Sensors 2017, 17, 1261 3 of 14 concatenated vectors are produced from six important angles specifying K-pop dance movements. Next, the PCA is performed by projecting the high-dimensional vectors into lower-dimensional spaces. Finally, feature vectors with discriminating capabilities are obtained by the LDA.

Generating Concatenated Vectors
In the first stage of our analysis, concatenated vectors are generated. Figure 1 illustrates the six core angles that distinguish each dance movement. As shown in Figure 1, these angles are related to the positions of both elbows, both knees, and both shoulders. Figure 2 illustrates an angle between two joints. This angle is calculated with the following equations:  The total concatenated angles are generated by connecting these values within each frame, as shown in Figure 3. In general, the frame lengths of dance movements differ according to the dance type. To solve this problem, we perform a zero-padding method to set the frame sizes to the same concatenated vectors are produced from six important angles specifying K-pop dance movements. Next, the PCA is performed by projecting the high-dimensional vectors into lower-dimensional spaces. Finally, feature vectors with discriminating capabilities are obtained by the LDA.

Generating Concatenated Vectors
In the first stage of our analysis, concatenated vectors are generated. Figure 1 illustrates the six core angles that distinguish each dance movement. As shown in Figure 1, these angles are related to the positions of both elbows, both knees, and both shoulders. Figure 2 illustrates an angle between two joints. This angle is calculated with the following equations:  The total concatenated angles are generated by connecting these values within each frame, as shown in Figure 3. In general, the frame lengths of dance movements differ according to the dance type. To solve this problem, we perform a zero-padding method to set the frame sizes to the same The total concatenated angles are generated by connecting these values within each frame, as shown in Figure 3. In general, the frame lengths of dance movements differ according to the dance type. To solve this problem, we perform a zero-padding method to set the frame sizes to the same size

Combination of PCA and LDA for Dimensional Reduction
The method combining PCA and LDA for dimensional reduction is insensitive to large variations in movement. By maximizing the ratio of the between-scatter matrix to the within-scatter matrix, LDA produces well-separated dance movement categories in a low-dimensional subspace.
In what follows, we briefly describe the method referred to as "fisherdance" in this work as the well-known fisherface method [19]. This method consists of the two steps shown in Figure 3. In the first step, the PCA projects the concatenated vectors from a high-dimensional image space into a lower-dimensional space. In the second step, the LDA finds the optimal projection from a classification perspective, which is known as a class-specific method. Therefore, we can perform this step by first projecting the K-pop dance movement into a lower-dimensional space using the combination of PCA and LDA, so that the resulting within-class scatter matrix is nonsingular, before computing the optimal projection.
We denote the training set of N different dance movements as 1 2 ( , , , ) and define the covariance matrix as follows: The second step, which is based on the use of the LDA, can be described as follows. Consider c classes with N samples each. Let the between-class scatter matrix be defined as

Combination of PCA and LDA for Dimensional Reduction
The method combining PCA and LDA for dimensional reduction is insensitive to large variations in movement. By maximizing the ratio of the between-scatter matrix to the within-scatter matrix, LDA produces well-separated dance movement categories in a low-dimensional subspace. In what follows, we briefly describe the method referred to as "fisherdance" in this work as the well-known fisherface method [19]. This method consists of the two steps shown in Figure 3. In the first step, the PCA projects the concatenated vectors from a high-dimensional image space into a lower-dimensional space. In the second step, the LDA finds the optimal projection from a classification perspective, which is known as a class-specific method. Therefore, we can perform this step by first projecting the K-pop dance movement into a lower-dimensional space using the combination of PCA and LDA, so that the resulting within-class scatter matrix is nonsingular, before computing the optimal projection.
We denote the training set of N different dance movements as Z = (z 1 , z 2 , . . . , z N ) and define the covariance matrix as follows: where z i is the concatenated vector of a dance movement. Then, both the eigenvalues and eigenvectors of the covariance matrix R are calculated. Let E = (e 1 , e 2 , · · · , e r ) contain the eigenvectors corresponding to the largest eigenvalues. For a set of original dance movements Z, the corresponding reduced feature vectors, X = (x 1 , x 2 , . . . , x N ), can be obtained by projecting Z into the PCA-transformed space according to the following equation: Sensors 2017, 17, 1261

of 14
The second step, which is based on the use of the LDA, can be described as follows. Consider c classes with N samples each. Let the between-class scatter matrix be defined as where N i is the number of samples in the ith class C i , m is the mean of all of the samples, and m i is the mean of class C i . The within-class scatter matrix is defined as where S W i is the covariance matrix of class C i . The optimal projection matrix, W FLD , is obtained as the matrix with orthonormal columns that maximize the ratio of the determinant of the projected samples' between-class matrix to their determinant of the within-class scatter matrix, as in the following expression: where {w i |i = 1, 2, · · · , m} is the set of generalized discriminant vectors of both S B and S W corresponding to the c − 1 largest generalized eigenvalues {λ i |i = 1, 2, · · · , m}, i.e., Thus, the feature vectors V = (v 1 , v 2 , . . . , v N ) for any dance movement z i can be calculated as follows: v To complete the classification of a new dance pattern z , we compute the distance between z and a pattern in the training set z such that The measure d(z, z ) is defined as the distance between the training dance movement z and a given movement z in the test set. Note that this distance is computed based on both v and v , which are the LDA-transformed feature vectors of dance movements z and z , respectively. While the distance function · can be broadly interpreted, quite often we confine ourselves to the Euclidean distance.

Design of ReLU-Based ELMC
In this section, we design the ReLU-based ELMC based on the feature vectors obtained by the PCA and LDA. This classifier possesses the important characteristics of both a simple tuning-free network and a fast learning speed. Unlike those in conventional existence theories, the node parameters hidden in the design of an ELM are independent of the training data. Although hidden nodes are both important and critical, these nodes generally do not need to be tuned.

ELMC
Most studies on neural networks are performed based on conventional existence theories, including those of the adjustment and learning of hidden nodes. Many researchers have performed intensive research on developing good learning methods over the past few decades. In contrast to conventional neural networks, we develop an ELMC with real-time learning and high classification abilities for classifying dance movements. Figure 3 shows the architecture of the ELMC. Given random where a i and b i are the weight and the bias between the input and hidden layers, respectively. Although we do not know true output functions of biological neurons, most of them are nonlinear piecewise continuous functions covered by ELM theories. The output function of a generalized single layer feedforward network is expressed as The output function of the hidden layer mapping is as follows: The output functions of hidden nodes can be used in various forms. Many different types of learning algorithms exist, including sigmoid networks, radial basis function (RBF) networks, polynomial networks, complex networks, Fourier series networks, and wavelet networks, some of which are represented by: where conventional random projection is just a specific case of ELM random feature mapping when an additive linear hidden node is used. This not only proves the existence of the networks but also provides learning solutions. In this paper, we use the ReLU-based activation function that is utilized effectively in convolutional neural networks and is given as follows: where x is the input to a neuron. In contrast to the sigmoid function, the major advantage of the ReLU function is in solving the vanishing gradient problem in neural network design. Furthermore, the constant ReLU function gradient results in faster learning.
N , the hidden node output function G (a, b, x), and the number of hidden nodes L, the ELM determines both the hidden node parameters and the output weights using the following three-steps: [Step 1] Assign the hidden node parameters randomly (a i , b i ), i = 1, 2, · · · , N [ Step 2] Calculate the hidden layer output matrix H = Step 3] Calculate the output weights β using the least square estimate with First, the hidden layer does not need to be tuned. Second, the hidden layer mapping h(x) satisfies universal approximation conditions. Third, the parameters of ELM are minimized as follows: ELM satisfies both the ridge regression theory and the neural network generalization theory. Finally, it fills the gaps and builds bridges among neural networks, SVMs, random projections, Fourier series, matrix theories, and linear systems. Figure 4 shows the point-dance classification process flow regarding angle calculation between joints, frame normalization, dimensional reduction, and ELM classifiers. First, the hidden layer does not need to be tuned. Second, the hidden layer mapping h(x) satisfies universal approximation conditions. Third, the parameters of ELM are minimized as follows: ELM satisfies both the ridge regression theory and the neural network generalization theory. Finally, it fills the gaps and builds bridges among neural networks, SVMs, random projections, Fourier series, matrix theories, and linear systems. Figure 4 shows the point-dance classification process flow regarding angle calculation between joints, frame normalization, dimensional reduction, and ELM classifiers.

Experimental Results
This section reports on a comprehensive set of comparative experiments performed to evaluate the performance of the proposed approach.

Construction of K-Pop Dance Database
A K-pop dance database was constructed containing 200 point-dance movements from four professional dancers (two men and two women) obtained by a motion capture system that produced skeletal forms. Thus, there were 800 dance-movement data points in total. In order to construct this database, we recorded the skeletal information of these point-dances using a Kinect v2 sensor. The point-dances included in the K-pop dance database were composed of movements lasting for 4-9 s, and there were 25 skeletal joints considered. Among these joints, we selected 13 to obtain six core angles. The longest and shortest dance movements captured contained 147 and 276 frames, respectively. As mentioned in the previous section, we used a zero-padding method to produce frames of the same size. Zero padding padded the concatenated vector with zeros on both sides. Thus, the size of a point dance motion resultant vector was 6 × 276 elements. In this paper, we perform two different experiments. In the first experiment, the 800 total dance movements were divided into training and test sets of 400 movements each (one man and one woman). The total size

Experimental Results
This section reports on a comprehensive set of comparative experiments performed to evaluate the performance of the proposed approach.

Construction of K-Pop Dance Database
A K-pop dance database was constructed containing 200 point-dance movements from four professional dancers (two men and two women) obtained by a motion capture system that produced skeletal forms. Thus, there were 800 dance-movement data points in total. In order to construct this database, we recorded the skeletal information of these point-dances using a Kinect v2 sensor. The point-dances included in the K-pop dance database were composed of movements lasting for 4-9 s, and there were 25 skeletal joints considered. Among these joints, we selected 13 to obtain six core angles. The longest and shortest dance movements captured contained 147 and 276 frames, respectively. As mentioned in the previous section, we used a zero-padding method to produce frames of the same size. Zero padding padded the concatenated vector with zeros on both sides. Thus, the size of a point dance motion resultant vector was 6 × 276 elements. In this paper, we perform two different experiments. In the first experiment, the 800 total dance movements were divided into training and test sets of 400 movements each (one man and one woman). The total size of the training data set was 400 × 1656 elements. Here we used the data sequences showing the best results. In the second experiment, we performed 4-fold cross validation to test if the algorithm was independent from the dancer. Here we obtained the average rate of four classification results. Furthermore, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 5 shows the environment of database construction using a Kinect camera. Figure 6 illustrates three examples of dance movements with sequential images. of the training data set was 400 × 1656 elements. Here we used the data sequences showing the best results. In the second experiment, we performed 4-fold cross validation to test if the algorithm was independent from the dancer. Here we obtained the average rate of four classification results. Furthermore, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 5 shows the environment of database construction using a Kinect camera. Figure 6 illustrates three examples of dance movements with sequential images.

Experiments and Results
In the first experiment, we compared the proposed method with conventional methods, such as the uses of KNN, SVM, and ELM alone. Figure 7 shows the right elbow and right knee angles, of the training data set was 400 × 1656 elements. Here we used the data sequences showing the best results. In the second experiment, we performed 4-fold cross validation to test if the algorithm was independent from the dancer. Here we obtained the average rate of four classification results. Furthermore, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 5 shows the environment of database construction using a Kinect camera. Figure 6 illustrates three examples of dance movements with sequential images.

Experiments and Results
In the first experiment, we compared the proposed method with conventional methods, such as the uses of KNN, SVM, and ELM alone. Figure 7 shows the right elbow and right knee angles,

Experiments and Results
In the first experiment, we compared the proposed method with conventional methods, such as the uses of KNN, SVM, and ELM alone. Figure 7 shows the right elbow and right knee angles, which were among the six angles representing a point-dance movement in each frame. After obtaining the Sensors 2017, 17, 1261 9 of 14 concatenated vector, we selected r eigenvectors referring to the maximal recognition rate produced by the PCA method. Next, we determined the numbers of discriminant vectors m as the number of features in the LDA method increased. As a result, we selected the 100 eigenvectors that corresponded to the maximum recognition rate. From the obtained eigenvectors, we were able to determine that the use of 40 discriminant vectors provided the maximum recognition rate, as shown in Figure 8.
which were among the six angles representing a point-dance movement in each frame. After obtaining the concatenated vector, we selected r eigenvectors referring to the maximal recognition rate produced by the PCA method. Next, we determined the numbers of discriminant vectors m as the number of features in the LDA method increased. As a result, we selected the 100 eigenvectors that corresponded to the maximum recognition rate. From the obtained eigenvectors, we were able to determine that the use of 40 discriminant vectors provided the maximum recognition rate, as shown in Figure 8.   Figure 9 shows the variation in classification rates as the number of hidden nodes in the ReLU-based ELMC design increases after the fisherdance method had been performed. We obtained a maximum classification rate of 96.5% when there were 120 hidden nodes. Table 1 compares the classification performance results of both the proposed method and the conventional methods. As listed in Table 1, the proposed method generally led to better classification results than the KNN, SVM, and ELM methods alone. Noticeably, the conventional ELM showed a worse performance than those of the conventional machine learning methods. Figure 10 shows fisherdance images representing the discriminant vectors defined in Equation (9). Here we visualize 20 discriminant vectors with the size of 1650 × 20. Each discriminant vector is converted into an image with a 24 × 69-pixel array with gray levels ranging from 0-255. which were among the six angles representing a point-dance movement in each frame. After obtaining the concatenated vector, we selected r eigenvectors referring to the maximal recognition rate produced by the PCA method. Next, we determined the numbers of discriminant vectors m as the number of features in the LDA method increased. As a result, we selected the 100 eigenvectors that corresponded to the maximum recognition rate. From the obtained eigenvectors, we were able to determine that the use of 40 discriminant vectors provided the maximum recognition rate, as shown in Figure 8.   Figure 9 shows the variation in classification rates as the number of hidden nodes in the ReLU-based ELMC design increases after the fisherdance method had been performed. We obtained a maximum classification rate of 96.5% when there were 120 hidden nodes. Table 1 compares the classification performance results of both the proposed method and the conventional methods. As listed in Table 1, the proposed method generally led to better classification results than the KNN, SVM, and ELM methods alone. Noticeably, the conventional ELM showed a worse performance than those of the conventional machine learning methods. Figure 10 shows fisherdance images representing the discriminant vectors defined in Equation (9). Here we visualize 20 discriminant vectors with the size of 1650 × 20. Each discriminant vector is converted into an image with a 24 × 69-pixel array with gray levels ranging from 0-255.  Figure 9 shows the variation in classification rates as the number of hidden nodes in the ReLU-based ELMC design increases after the fisherdance method had been performed. We obtained a maximum classification rate of 96.5% when there were 120 hidden nodes. Table 1 compares the classification performance results of both the proposed method and the conventional methods. As listed in Table 1, the proposed method generally led to better classification results than the KNN, SVM, and ELM methods alone. Noticeably, the conventional ELM showed a worse performance than those of the conventional machine learning methods. Figure 10 shows fisherdance images representing the discriminant vectors defined in Equation (9). Here we visualize 20 discriminant vectors with the size of 1650 × 20. Each discriminant vector is converted into an image with a 24 × 69-pixel array with gray levels ranging from 0 to 255.    In the second experiment, we performed 4-fold cross validation to test if the proposed method is independent from the dancer. That is, we used four data sets with 200 dance movements constructed by each professional dancer. Here, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 11 visualizes the classification rates obtained by 4-fold cross validation. Table 2 lists the average rate of four classification results    In the second experiment, we performed 4-fold cross validation to test if the proposed method is independent from the dancer. That is, we used four data sets with 200 dance movements constructed by each professional dancer. Here, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 11 visualizes the classification rates obtained by 4-fold cross validation. Table 2 lists the average rate of four classification results In the second experiment, we performed 4-fold cross validation to test if the proposed method is independent from the dancer. That is, we used four data sets with 200 dance movements constructed by each professional dancer. Here, we also performed the experiments regarding the normalized coordinates of shoulder, elbow, and knee joints. Figure 11 visualizes the classification rates obtained by 4-fold cross validation. Table 2 lists the average rate of four classification results for the 4-fold cross validation method. As shown in Figure 11 and Table 2, it was found from the results that the proposed method showed a good performance in comparison with the SVM, KNN, and ELM methods with sigmoid and hard limit activation function. Table 3 lists the average classification rates for the 4-fold cross validation method with normalized coordinates. The results indicated that the normalization method in this study did not show a good performance in comparison with the general method without normalization. for the 4-fold cross validation method. As shown in Figure 11 and Table 2, it was found from the results that the proposed method showed a good performance in comparison with the SVM, KNN, and ELM methods with sigmoid and hard limit activation function. Table 3 lists the average classification rates for the 4-fold cross validation method with normalized coordinates. The results indicated that the normalization method in this study did not show a good performance in comparison with the general method without normalization. Figure 11. Each classification rate obtained by 4-fold cross validation.  Figure 11. Each classification rate obtained by 4-fold cross validation.

Conclusions
We performed a point-dance movement classification via a combination of the fisherdance method and the ReLU-based ELMC. Furthermore, we constructed the first K-pop dance database with a total of 800 dance movements including 200 dance types obtained from four professional dancers by a Kinect sensor. The experimental results revealed that the proposed approach demonstrated a good performance in comparison with those of the methods used in previous works, including KNN, SVM, and ELM alone. Experimental results confirmed that the feature extraction of the concatenated vectors, the dimensional reduction performed by fisherdance, and the design of the proposed classifier were able to classify point-dance movements successfully. These results led us to the conclusion that the proposed method can be used effectively for various applications, such as dance plagiarism identification, dance training systems, and dance retrieval. In future research, we will analyze different sequential dance motions using DTW (Dynamic Time Warping) to solve the limitation of the fixed length of the feature vector. Furthermore, we will design a dance-movement classification system by integrating skeletal motion data with depth image sequences based on both a large dance movement database and deep learning.