Human Skeleton Data Augmentation for Person Identiﬁcation over Deep Neural Network

: With the advancement in pose estimation techniques, skeleton-based person identiﬁcation has recently received considerable attention in many applications. In this study, a skeleton-based person identiﬁcation method using a deep neural network (DNN) is investigated. In this method, anthropometric features extracted from the human skeleton sequence are used as the input to the DNN. However, training the DNN with insufﬁcient training datasets makes the network unstable and may lead to overﬁtting during the training phase, causing signiﬁcant performance degradation in the testing phase. To cope with a shortage in the dataset, we investigate novel data augmentation for skeleton-based person identiﬁcation by utilizing the bilateral symmetry of the human body. To achieve this, augmented vectors are generated by sharing the anthropometric features extracted from one side of the human body with the other and vice versa. Thereby, the total number of anthropometric feature vectors is increased by 256 times, which enables the DNN to be trained while avoiding overﬁtting. The simulation results demonstrate that the average accuracy of person identiﬁcation is remarkably improved up to 100% based on the augmentation on public datasets.


Introduction
Biometric-based person identification systems have attracted considerable attention owing to their advantages in a wide range of applications, such as access control, home security monitoring, surveillance, and personalized customer services [1][2][3], where accuracy in the identification of individuals is paramount. To achieve this goal, various biometric technologies have been developed, such as ear, face, fingerprint, gait, iris, palmprint, and voice [4][5][6][7][8][9][10]. According to the type of biometric information used, these technologies can be categorized into active and passive systems. In the active system, the user is required to make physical contact with a certain interface or device for the extraction of biometric information. Fingerprint-, iris-, and palmprint-based person identification systems are prime examples of the active system. By contrast, in the passive system, the biometric information of the user is extracted at a distance by utilizing sensor data without the need for physical contact. Ear-, face-, gait-, and voice-based person identification systems are prime examples of the passive system.
Although no physical contact is required, it is necessary to have the cooperation of the user to enable recognition of their ear, face, and voice. To clarify, suppose a user has long hair that covers their ears and wears a facemask and hat that conceal parts of the face, it is difficult to clearly recognize the ears and face owing to the occlusion. To improve the accuracy, cooperation with the person identification process would be necessary. However, the user may feel uncomfortable to reveal their

Related Work
The idea of using anthropometric features in person identification was first proposed by Araujo et al. [22]. The authors calculated the length of 11 human body parts captured by the Kinect sensor. The parts used were as follows: (1) left and right upper arms, (2) left and right forearms, (3) left and right shanks, (4) left and right thighs, (5) thoracic spine, (6) cervical spine, and (7) height. Using the average length of each body part, the authors tested four learning algorithms, including MLP and k-NN, on their own data consisting of eight subjects. According to their results, k-NN outperformed the other algorithms, achieving about 99.6% accuracy.
Andersson et al. [23] investigated the usefulness of the anthropometric features on large-scale datasets. After capturing skeleton data from 164 individuals using the Kinect sensor, they tested the two learning algorithms, MLP and k-NN, on their own data. The results showed that as the number of subjects for identification increased, the accuracy of both the algorithms decreased. From the evaluation with different numbers of subjects, it was shown that k-NN always outperformed MLP. Andersson et al. [24] extended their previous work by exploring the effect of different numbers of subjects on the person identification accuracy. In addition, to improve the accuracy, they proposed to use the anthropometric features for all limbs rather than for only a limited number of limbs. As a result, 20 anthropometric features, including average lengths of the left and right shoulders, hands, hips, and feet, were used for person identification. Using these features, the authors tested three learning algorithms: SVM, k-NN, and MLP. As the number of subjects for identification increased by 15, in the range from 5 to 140, the accuracy of all the algorithms decreased. The k-NN algorithm outperformed the other algorithms except when the number of subjects was less than 35. By utilizing the extended anthropometric features, the accuracy of k-NN increased from approximately 80% [23] to approximately 85.4% [24].
The same authors analyzed their previous results in a subsequent study [25]. Focusing on k-NN, they explored the importance of each anthropometric feature to identify individuals. By removing each anthropometric feature without repetition of the measurement, the authors measured the average accuracy of k-NN using the remaining features. In addition, the importance of the corresponding feature was assessed by calculating the decrease in the average accuracy. From the results, the importance of the anthropometric features in person identification was seen to differ for different features. According to their results, the average length of the right foot was the most important anthropometric feature utilized to identify individuals. Motivated by these results, Yang et al. [26] proposed to apply the majority voting method to the person identification algorithm. In their studies, a person was identified by the voting results of 100 k-NN classifiers. For the training of each k-NN classifier, 10 randomly selected anthropometric features from a total of 20 were used. The authors tested their proposed method on the public person identification dataset proposed by Andersson et al. [25]. The average accuracy could be improved by 2.2% from 83.9% to 86.1% using their voting-based method.
Sun et al. [27] created a connection between 2D silhouette-and 3D skeleton-based person identification methods. Using the Kinect sensor, they built a person identification dataset composed of 2D silhouette and 3D skeleton sequences captured from 52 subjects. In their study, they extracted eight anthropometric features from the 3D skeleton sequences and then calculated a score defined by the weighted sum of the values of features. To determine the weight of each feature, they measured the correct classification rate when the corresponding feature was used for person identification. After the measurement was completed for all features, the authors set the weight of each feature as the corresponding correct classification rate. They trained k-NN using the obtained scores, and tested it on their own dataset.

Motivation
As reviewed in Section 2, the three machine-learning algorithms, k-NN, SVM, and MLP, have been widely used for person identification. In general, k-NN demonstrates better performance than the other methods [22][23][24][25]. For verification of the performance, we implemented the MLP as in previous studies [22][23][24][25], which included a single hidden layer. The number of hidden units was set to 10, 20, and 40. We attempted to train the MLP using the simulation parameters used previously [22][23][24][25], but failed to achieve the desired accuracy. From the investigation, we found that the numbers of hidden layers and hidden units of the MLP were too small to learn person identification tasks.
To overcome this issue, we used a DNN that consists of two dense layers with 512 full connections in each layer. With the DNN, the training set can achieve an accuracy of up to almost 100% over the same datasets used previously [22][23][24][25]. Figure 1 shows the training loss and training set accuracy of the DNN. As shown in the figure, after approximately 360 epochs, the training loss converges and the training set accuracy reaches approximately 100%. Nevertheless, the maximum test accuracy of the DNN is 57.5%. From this simulation, we found that overfitting occurs during the training phase, which in turn leads to significant performance degradation in the testing phase. To overcome the overfitting issue, we applied several techniques such as batch normalization [38], dropout [39], and L2 regularization [40] according to the number of hidden layers and the number of hidden units in each layer. However, we failed to avoid overfitting due to the shortage in the dataset. The dataset contains 160 subjects in total with 5 sequences for each. Therefore, this dataset contains a total number of 800 sequences (= 160 subjects × 5 sequences per subject). Assuming that 5-fold cross-validation is used, 640 sequences (= 160 subjects × 4 sequences per subject) can be used for the training, which is insufficient to train the DNN without overfitting. Therefore, to overcome the lack of data, we propose a novel data augmentation method to achieve both training set enrichment and overfitting prevention effectively. For this augmentation, we focus on increasing the number of anthropometric feature vectors rather than the number of 3D skeleton sequences based on the fact that the human body has bilateral symmetry. The main motivation for the approach is as follows: in general, the synthetic skeleton sequence can be generated by rotation and translation in 3D coordinates without loss of information [41,42]. However, even if the number of skeleton sequences increases by generating such personal synthetic skeleton sequences, the anthropometric features extracted from the original and synthetic skeleton sequences are similar, as shown in Figure 2, so that there is a limitation to improve the performance by increasing the synthetic skeleton sequence only. Therefore, it is essential to augment anthropometric feature vectors for training the DNN more accurately while preventing the DNN from overfitting. Hence, we note that some of the anthropometric features, such as the average lengths of the left and right arms, forearms, thighs, and shanks, are reflected by the characteristics of bilateral symmetry of the human body. Inspired by this, we utilize this bilateral symmetry for data augmentation, which is highly effective to improve the performance on small-scale datasets. Figure 3 shows a schematic overview of our proposed method for skeleton-based person identification. In the proposed method, for each anthropometric feature vector, we augment new vectors by exchanging the anthropometric features extracted from the left (or right) side of the human body with the corresponding right (or left) features. According to our data augmentation method, for the Kinect skeleton, the size of the anthropometric feature vector set can be increased 256 times. In addition, using these augmented datasets, the DNN can be effectively trained without overfitting. In the next section, the 3D human skeleton model and anthropometric features are described in detail.

3D Human Skeleton Model and Anthropometric Features
Assume that a 3D human skeleton model consists of N joints with their set of indices J = {1, · · · , N} and M limbs, which are represented by line segments between two joints. Let L = {(i, j)|i = j, i ∈ J, j ∈ J} be the set of joint pairs for constructing the limbs [43]. The values of N and M depend on the product specification of the motion capture sensors. For the Kinect sensor, N and M are 20 and 19, respectively, as shown in Figure 4. The details of the joint and limb information of the Kinect skeleton are provided in Tables 1 and 2, respectively.  left ankleleft foot (18,20) Let P j = x j , y j , z j be the position of the jth joint in the world coordinate system, where x j , y j , and z j are the coordinates of the jth joint on the X-, Y-, and Z-axes, respectively. The Euclidean distance between the ith and jth joints d(P i , P j ) is used as a distance feature.
The anthropometric features can be categorized into seven types: (1) length of each limb, (2) height, (3) length of upper body, (4) length of lower body, (5) ratio of upper body length to lower body length, (6) chest size, and (7) hip size. After these values are calculated for each frame, their average values over all frames are used as the anthropometric features for person identification. For the frame index n and the frame length F, the anthropometric feature is defined as follows.
(1) f 1 : Average length of each limb. In most case, the length of each limb will differ from person to person. For this reason, in many studies, this anthropometric feature has been used for person identification [22][23][24][25][26][27][28]. The length of each limb is calculated for each frame and the average length of each limb over all frames is given by where P i [n] and P j [n] are the position of the ith and the jth joints at the nth frame, respectively. Since the 19 limbs shown in Figure 5a are used, the number of feature dimensions of f 1 becomes 19. (2) f 2 : Average height. Human height is also one of the most important anthropometric features used for person identification and is measured as the vertical distance from the foot to the head [22][23][24][25][26]28]. In general, the person is asked to stand up straight while their height is measured. However, for a walking person, it is difficult to measure the true height due to the posture difference. To alleviate this issue, as shown in Figure 5b, the person's height is calculated as the sum of the neck length, upper and lower spine lengths, and the average lengths of the right and left hips, thighs, and shanks. To clarify, let H[n] be the subject's height at the nth frame. Then, H[n] is defined by (2) as follows: (2) Using H[n] in (2), the average height over all frames is obtained by (3) f 3 : Average length of upper body. This feature was introduced by Nambiar et al. [44,45]. As shown in Figure 5c, the upper body length is defined as the sum of the neck length and upper and lower spine lengths. Let U[n] be the upper body length at the nth frame. Then, U[n] is defined by (4) Using U[n] in (4), the average length of the upper body over F frames becomes (4) f 4 : Average length of lower body. This feature was introduced by Nambiar et al. [44,45]. As shown in Figure 5d, the lower body length is defined as the sum of the average of the lengths of the right and left hips, thighs, and shanks. The lower body length at the nth frame W[n] is then Using W[n] in (6), the average length of the lower body is (5) f 5 : Average ratio of upper body length to lower body length. This feature was introduced by Nambiar et al. [44,45] and is written as (6) f 6 : Average chest size. As shown in Figure 5f, the chest size is defined as the distance between the third and fourth joints (i.e., right shoulder and left shoulder) [27]. The average chest size is (7) f 7 : Average hip size. As shown in Figure 5e, the hip size is defined as the distance between the 13th and 14th joints (i.e., right and left hips). The average hip size is After the anthropometric feature extraction is completed, the extracted features are concatenated into a feature vector as The dimension of V is 1 × 25. Figure 6 shows the procedure to generate V from the skeleton sequence, which is then inputted to the DNN for identifying individuals.  Figure 6. Overall procedure to generate feature vector V from skeleton sequence.

Human Bilateral Symmetry-Based Data Augmentation
Here, we propose a novel data augmentation method by exploiting the bilateral symmetry of the human skeleton structure. In general, the length of the limbs on the left and right sides can be regarded as equal. As shown in Figure 7, eight pairs of limbs in the skeleton are symmetric. Using the limb definition in Table 2, the pairs are denoted as follows: (Limb #1, Limb #12), (Limb #2, Limb #13), (Limb #3, Limb #14), (Limb #4, Limb #15), (Limb #5, Limb #16), (Limb #6, Limb #17), (Limb #7, Limb #18), and (Limb #8, Limb #19). Using Equation (1), the average lengths for these limbs are calculated. In addition, from the feature concatenation process in Figure 6, the calculated values are located from the 1st element to the 8th element, and from the 12th element to the 19th element. Figure 7 depicts the proposed data augmentation method, where V(t) is the tth element in its symbol color. In the proposed method, several secondary versions are generated from V by exchanging its elements corresponding to the symmetric limbs. In addition, this exchange process is performed with different combinations of pairs. For example, suppose that the exchange is performed for five out of the eight pairs. In addition, let C(p, q) be the number of q-combinations from a given set of p elements. Then, 56 combinations can be generated from the selected five out of eight pairs (i.e., C(8, 5) = 8!/(5!3!) = 56, where r! denotes the factorial of r). Therefore, because the number of pairs of symmetric limbs is eight, the number of q-combinations for all q ∈ {1, 2, · · · , 7, 8} is In each of the 255 combinations above, q distinct index numbers represent the pairs selected from the eight pairs. In the proposed method, based on these index numbers, the elements of V that need to be exchanged are selected. For example, suppose that Pair 1 is selected out of the eight pairs (i.e., q = 1) for exchange. The first pair (Limb #1, Limb #12) is selected. The secondary version V is then generated as shown in Figure 8a Therefore, for q ∈ {1, 2, · · · , 7, 8}, the total number in the secondary version V is ∑ 8 q=1 C(8, q) = 255. We present its pseudocode in Algorithm 1, where S is the secondary version V and its dimension is 255 × 25. According to the pseudocode, the total number of anthropometric feature vectors is increased up to 256 times (including the original vector) as shown in Figure 9.  Figure 10 illustrates the architecture of the DNN. The DNN accepts the input of the anthropometric feature vectors. The dense layer, which has 512 fully connected dense nodes, is added after the input layer. The output of the dense layer can be represented by (12) as follows:

Deep Neural Network for Person Identification
where W is a 25 × 512 dimensional weight matrix. Let B be the batch selected at each epoch and S be the size of B. The dimension of O 1 then becomes S × 512. To prevent overfitting and improve the speed, performance, and stability, we added a batch normalization layer after the dense layer. Hence, at the batch normalization layer, the mean of B can be calculated by (13) as follows: where O 1 (s) is the sth row vector of O 1 and the dimension of µ B is 1 × 512. In addition, using µ B in (13), the variance of B is where σ 2 B has the dimension of 1 × 512.
where s ∈ {1, 2, · · · , S − 1, S}, v ∈ {1, 2, · · · , 511, 512} and is a small constant value added for numerical stability. In this study, the value of is set to 0.001. (15), the output of the batch normalization layer is where the parameters γ(v) and β(v) are learned during the optimization process of the training phase.
In the DNN, the activation layer is added after the batch normalization layer. At the activation layer, we used the rectified linear unit (ReLU) activation function proposed in [46] to add non-linearity to the DNN. Therefore, the output of the activation layer is where max(a, 0) is the function that returns "a" for a > 0 and 0 otherwise. After the activation layer, a dropout layer is added. During the training phase, at the dropout layer, input units are randomly dropped out at specified rates. In the DNN, the dropout rate is set to 0.1. The second dense layer, batch normalization layer, activation layer, and dropout layer are then successively added as described above. In addition, the number of dense nodes of the second dense layer is the same as that of the first dense layer as shown in Figure 10. At the second activation layer, the ReLU activation function is used. The dropout rate of 0.1 is set in the second dropout layer.
At the end of the DNN, the softmax layer is added to classify the labels for each person. In this layer, the softmax function is used as the activation function. Therefore, the activation values from the softmax layer are normalized by: where exp(·) is the exponential function and θ is the label for the θth person. In addition, y θ and y θ are the normalized and unnormalized activation values, respectively. If the person identification dataset contains Θ subjects in total, the total number of labels becomes Θ. In addition, the softmax layer outputs a 1 × Θ-dimensional vector where the θth value of the vector is y θ . To train the DNN, we apply the sparse categorical cross entropy loss function to the objective function.

Dataset and Evaluation Protocol
We evaluated the proposed method on the existing publicly available person identification dataset, called Andersson's dataset [23][24][25]. To the best of our knowledge, this dataset contains the largest number of subjects, compared to other publicly available datasets. For this reason, this dataset has been widely used to test person identification methods. It contains 164 subjects in total, with five sequences for each subject. However, there are only three or four sequences for the following four subjects: "Person002," "Person015," "Person158," and "Person164." We eliminated these four subjects from the dataset. In addition, there are six sequences for the following seven subjects: "Person003," "Person034," "Person036," "Person052," "Person053," "Person074," and "Person096." Observing the six sequences for each subject, we decided to eliminate the noisiest sequence. Consequently, the 4th sequence of "Person003," 1st sequence of "Person034," 3rd sequence of "Person036," 5th sequence of "Person052," 5th sequence of "Person053," 3rd sequence of "Person074," and 6th sequence of "Person096" were eliminated from the dataset. As a result, the dataset contains a total of 800 (= 160 subjects × 5 sequences per subject) sequences.
We used the 5-fold cross-validation to evaluate the performance of each method. In each cross-validation, for each subject, four sequences were selected from a total of five. The selected 640 sequences (= 160 subjects × 4 sequences per subject) were used for the training dataset. In addition, the remaining 160 (= 800 − 640) sequences were used for the testing dataset.

Four Benchmark Methods
For performance comparison, we implemented four person identification methods: (1) C-SVM, (2) nu-SVM, (3) k-NN, and (4) MLP. Figure 11 shows the block diagram for the training and testing phases of the four methods. Each classifier was trained to learn the person identification task. The anthropometric features, as described in Section 4, were extracted from the input skeleton sequence. They were inputted to each classifier in a form of vector. The four methods are detailed as follows: • C-SVM and nu-SVM: To implement C-SVM and nu-SVM that support multi-class classification, we used LIBSVM-an open-source library for SVMs [47]. According to the recommendation in [48], we used the radial basis function (RBF) kernel for C-SVM and nu-SVM. C-SVM has a cost parameter, denoted c, whose value ranges from 0 to ∞. The nu-SVM method has a regularization parameter, denoted g, with a value range [0, 1]. The RBF kernel has a gamma parameter, denoted γ. The grid search method was used to find the best parameter combination for both C-SVM and nu-SVM. • k-NN: According to the results of the previous studies [22][23][24][25], k-NN achieved the best performance in the task of person identification. To implement k-NN, we used the MATLAB function fitcknn. To determine the best hyperparameter configuration, we performed hyperparameter optimization supported in the function fitcknn. • MLP: According to the results of the previous studies [22][23][24][25], MLP showed the worst performance in the task of person identification. In [22], MLP had 10 hidden units. In [23], the number of hidden units was 20. Conversely, in [24,25], MLP, which had 40 hidden units, was used. For performance comparison, we implemented the three MLPs using TensorFlow [49]. However, as mentioned in Section 3, in the experiments, we observed that the training set accuracy of each MLP was lower than 10% when each MLP comprised only a single hidden layer. Therefore, to overcome this issue and to improve the training set accuracy, the batch normalization layer, activation layer, and dropout layer were successively added after the hidden layer of each MLP. For simplicity, we called the three MLPs: MLP-10, MLP-20, and MLP-40, respectively. Here, the suffix number denotes the number of hidden units in the hidden layer.

Results and Comparisons
In each cross-validation, the grid search method was used to find the best parameters c and γ for C-SVM. Figure 12 shows the grid search results for C-SVM over the 5-fold cross-validation. C-SVM achieved an average accuracy of 82.625% over the 5-fold cross-validation. To find the best values of g and γ parameters for nu-SVM, the grid search method was used in each cross-validation. Figure 13 shows the grid search results for nu-SVM over the 5-fold cross-validation; nu-SVM achieved an average accuracy of 83%.   Table 3 lists the hyperparameter optimization results for k-NN over the 5-fold cross-validation. As shown in the table, the optimal hyperparameters found at each cross-validation of each dataset were mutually different. In the table, the person identification accuracy of each k-NN over the 5-fold cross-validation is also listed. The results in the table show that k-NN achieved an average accuracy of 86.5%. Table 3. Hyperparameter optimization results for k-NN over 5-fold cross-validation.  Figure 14 shows the training loss and training set accuracy of each MLP over 5-fold cross-validation. In the experiments, the batch size was set to 64; the Adam optimizer proposed in [50] was used for the training; an ReLU activation function was used; the dropout rate was set to 0.1. As shown in this figure, the convergence speed of the training loss of MLP-40 was the fastest among the MLP-based methods. In addition, the training set accuracy of MLP-40 outperforms that of MLP-10 and MLP-20. As shown in Figure 14, MLP-40 achieves a training set accuracy of approximately 100%, whereas MLP-10 and MLP-20 achieve an accuracy of approximately 60% and 90%, respectively. In particular, in the experiments, it is observed that the training set accuracy of each MLP is lower than 10% when the batch normalization layer is not added. From the results, the batch normalization could be seen to effectively improve the person identification accuracy of MLP-based methods. For this reason, the batch normalization layer was added after the hidden layer of each MLP.  Table 4 shows the testing set accuracy of each MLP over the 5-fold cross-validation. As shown in this table, on the testing datasets, MLP-40 achieves the best average accuracy of 72.752%. However, compared with the training set accuracy of MLP-40 in Figure 14, the accuracy of MLP-40 is significantly lower for the testing datasets. The accuracy of MLP-10 and MLP-20 are also lower for the testing datasets, compared with the results in Figure 14. The results show that overfitting occurred in the training phase of each MLP.  Figure 15 shows the accuracy of each method over the 5-fold cross-validation. In the figure, the proposed DNN of using the augmented training dataset is denoted "Proposed." In the training phase of each cross-validation, the proposed data augmentation method was applied to 640 anthropometric feature vectors V. Resulting from the application of the proposed data augmentation method, the total number of V became 163,840 (= 640 × 256). Using this augmented training dataset, the proposed DNN could be trained without overfitting. During the training phase, the batch size was set to 1000 and the Adam optimizer was used. As a result, as shown in Figure 15, the proposed DNN achieved 100% person identification accuracy for all cross-validation folds. The average accuracy of each method is listed in Table 5. Among the benchmark methods, k-NN achieves the best average accuracy of 86.5%, whereas MLP-10 exhibits the worst with 57.324%. The person identification accuracy of the proposed method ("Proposed DNN+Proposed DA") is greater than that of k-NN by 23.5%. Moreover, in the experiments shown in the table, to validate the effectiveness of the proposed data augmentation method, we trained MLP-10, MLP-20, and MLP-40 using the augmented training dataset and denoted them as "MLP-10+Proposed DA," "MLP-20+Proposed DA," and "MLP-20+Proposed DA." As shown in the table, the proposed data augmentation method improves the person identification accuracy of "MLP-10+Proposed DA," "MLP-20+Proposed DA," and "MLP-30+Proposed DA" compared to those of MLP-10, MLP-20, and MLP-40. For MLP-40, the average accuracy is improved by 25.62%. For MLP-20, it is improved by 24.252%. In addition, for MLP-10, it is improved by 18.204%. Therefore, from the results, the proposed data augmentation method is seen to improve the performance of neural network models by preventing overfitting. In addition, the proposed DNN achieves the best performance among all the methods.  Table 6 shows the identification accuracy of the proposed and benchmarked data augmentation methods. For the benchmark, the skeleton sequence is divided into several segments with an equal length. In the experiments of the table, we divided the original sequence with five different segment lengths. From the frame division process, 15 segments corresponding to each length are generated as one for F, two for F/2, three for F/3, four for F/4, and five for F/5. Then, the anthropometric feature vectors are calculated for each segment. By doing this, the feature vectors used for training of the DNN can be augmented in the benchmark method. In "Benchmark DA #1," the anthropometric feature vectors were calculated for 3 segments (one for F and two for F/2). Therefore, the total number of training datasets becomes 1960 (= 3 × 640). In "Benchmark DA #2," the feature vectors were calculated for 6 segments (one for F, two for F/2, and three for F/3), and the total number of training datasets becomes 3840 (= 6 × 640). In "Benchmark DA #3," the feature vectors were calculated for 10 segments (one for F, two for F/2, three for F/3, four for F/4), and the total number of training datasets becomes 6400 (= 10 × 640). In "Benchmark DA #4," the feature vectors were calculated for 15 segments (one for F, two for F/2, three for F/3, four for F/4, and five for F/5), and the total number of training datasets becomes 9600 (= 15 × 640). As shown in the table, the performance of the proposed data augmentation method outperforms the other benchmarked method. Among the benchmarks, "Benchmark DA #4" performed the best because the number of augmented training datasets was greater than those of the others.  Table 7 shows the average accuracy of the DNN according to the number of dense layers and the number of dense nodes per layer. It was a big issue to determine how many dense layers and dense nodes per layer are needed to guarantee the accuracy of the DNN. For this purpose, we used the grid search method. For explanation, let D L be the number of dense layers of the DNN and D N be the number of dense nodes per layer. To find the best parameter configuration, in the experiments, D L was set to the range of {1, 2, 3, 4, 5} and D N was set to the range of {2 1 , 2 2 , · · · , 2 8 , 2 9 }. As shown in the table, as D N increases, the average accuracy of the DNN increases. In other words, if D N is not sufficiently large, the accuracy of the DNN was not guaranteed. In the experiments, when D N = 512 and D L ∈ {2, 3, 4, 5}, the DNN achieved 100% accuracy. Based on the results in the table, we used the DNN consisting of 2 dense layers with 512 dense nodes per layer. To find the best activation function of the DNN, we evaluated the average accuracy of the DNN according to the activation function: the ReLU, sigmoid, softmax, softplus, softsign, tanh, scaled exponential linear unit (SELU) [51], exponential linear unit (ELU) [52], and exponential activation functions. Table 8 shows the evaluation results. In the experiments of the table, the DNN consisting of 2 dense layers with 512 dense nodes per layer was used. As shown in the table, for the ReLU, sigmoid, softplus, softsign, SELU, and ELU activation functions, the DNN achieved 100% accuracy. On the other hand, for the softmax, tanh, and exponential action functions, the accuracy of the DNN was not guaranteed. Therefore, based on the results in the table, we proposed to use one among the six activation functions. In this study, we used the ReLU activation function. To find the best dropout rate of the DNN, we evaluated the average accuracy of the DNN according to the dropout rate. Table 9 shows the evaluation results. In the experiments of the figure, the DNN consisting of 2 dense layers with 512 dense nodes per layer was used. In addition, the ReLU activation function was used in each activation layer of the DNN. The dropout rate was set to range of {0.0, 0.1, 0.2, 0.3, 0.4}. In the experiments, for all the dropout rates, the DNN achieved a training set accuracy of 100%. However, as shown in the table, as the dropout rate increases, the average testing set accuracy decreases. Only when the dropout rate was set to either 0.0 or 0.1, the DNN achieved a testing set accuracy of 100%. Therefore, based on the results in the table, we proposed to use one between the dropout rates of 0.0 and 0.1. In this study, we used the dropout rate of 0.1.

Discussion
The skeleton model estimated from RGB image (or depth image) generally contains the noise and error (referred as the uncertainty) introduced from key point/joint detection methods [53]. This uncertainty makes the differences between the anthropometric features extracted from the symmetric parts of the human body (i.e., one is extracted from the right side and the other is extracted from the left side). By exchanging these features, the proposed data augmentation method generates the new anthropometric feature vectors. However, with the advancement in pose estimation techniques, if the uncertainty is significantly reduced and the anthropometric features extracted from the left (or right) side of the human body become exactly the same with the corresponding right (or left) features, the feature vectors augmented by the method become all the same. Therefore, in this case, the effectiveness and usefulness of the proposed augmentation method can be limited.

Conclusions
In this study, we proposed a skeleton-based person identification method using DNN. The anthropometric features extracted from the human skeleton sequence were used as input to the DNN. However, when training the DNN on the existing public dataset, overfitting occurred during the training phase. As a result, the DNN could not achieve the desired testing set accuracy. To prevent overfitting and improve the testing set accuracy, we proposed a novel data augmentation method. In the method, the augmented vectors were generated by exchanging the anthropometric features extracted from the left side of the human body to the corresponding features extracted from the right side. Using this method, the total number of anthropometric feature vectors was increased by 256 times. Experimental results demonstrated that the proposed DNN identified individuals with 100% accuracy when the DNN was trained on the augmented training dataset.
Author Contributions: Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, and writing-original draft preparation, B.K.; writing-review and editing, B.K. and S.L.; supervision and funding acquisition, S.L. All authors have read and agreed to the published version of the manuscript.