Data Augmentation for Human Keypoint Estimation Deep Learning based Sign Language Translation Data Augmentation for Human Keypoint Estimation Deep Learning based Sign Language Translation

: Deep learning technology has developed constantly and is applied in many ﬁelds. In order to correctly apply deep learning techniques, su ﬃ cient learning must be preceded. Various conditions are necessary for su ﬃ cient learning. One of the most important conditions is training data. Collecting su ﬃ cient training data is fundamental, because if the training data are insu ﬃ cient, deep learning will not be done properly. Many types of training data are collected, but not all of them. So, we may have to collect them directly. Collecting takes a lot of time and hard work. To reduce this e ﬀ ort, the data augmentation method is used to increase the training data. Data augmentation has some common methods, but often requires di ﬀ erent methods for speciﬁc data. For example, in order to recognize sign language, video data processed with openpose are used. In this paper, we propose a new data augmentation method for sign language data used for learning translation, and we expect to improve the learning performance, according to the proposed method. Abstract: Deep learning technology has developed constantly and is applied in many fields. In order to correctly apply deep learning techniques, sufficient learning must be preceded. Various conditions are necessary for sufficient learning. One of the most important conditions is training data. Collecting sufficient training data is fundamental, because if the training data are insufficient, deep learning will not be done properly. Many types of training data are collected, but not all of them. So, we may have to collect them directly. Collecting takes a lot of time and hard work. To reduce this effort, the data augmentation method is used to increase the training data. Data augmentation has some common methods, but often requires different methods for specific data. For example, in order to recognize sign language, video data processed with openpose are used. In this paper, we propose a new data augmentation method for sign language data used for learning translation, and we expect to improve the learning performance, according to the proposed method.


Introduction
Recently, the field of artificial intelligence has been actively applied in the sign language of expressing language by hand, as shown in Figure 1.  In the sign language area, the method of translating sign language recognized by the camera into the language of text or voice is performed through artificial intelligence (AI). Sufficient learning must be done in order for deep learning [1] to properly translate the word into text voice. One of the most important parts of this learning is the amount of sign language data. Sign language data are difficult to obtain. Since sign language is a motion, not a stationary image, the data production process is difficult compared to an image, because data must be produced in the form of a video. In addition, since sign language is not a language that everyone basically understands, such as English or Korean, it is necessary to fully understand the language to produce data. In addition, as sign language has many words, collecting every word is difficult. For these reasons, manufacturing and securing a sufficient amount of sign language data is a great difficulty. Therefore, this paper proposes a method to increase the sign language data set. We expect to improve translation accuracy by increasing the dataset through three methods to increase the sign language data and learning artificial intelligence.
The composition of this paper begins with an introduction, and Chapter 2 describes the research related to the paper. The next chapter describes in detail the data augmentation [2,3] methods for augmenting sign language data and explains how they were applied. In Chapter 4, the performance of the proposed method is verified based on the simulation results by comparing and simulating with augmented data based on the proposed method, and Chapter 5 summarizes the proposed method and experiment. Finally, Chapter 6 dictates the evaluation and opinion of the paper.

Related Work
Frame skip sampling was used as a data augmentation method in an existing paper on sign language translation [4]. Frame skip sampling is a random frame skip sampling, which is mainly used when handling video data, and when all frames are divided into n frames, one is extracted and integrated for each section. In the paper described above, frame skip sampling increases the data by 10 to 100 times. The learning process of this paper is shown in Figure 2.
Electronics 2020, 9, x FOR PEER REVIEW 2 of 9 In the sign language area, the method of translating sign language recognized by the camera into the language of text or voice is performed through artificial intelligence (AI). Sufficient learning must be done in order for deep learning [1] to properly translate the word into text voice. One of the most important parts of this learning is the amount of sign language data. Sign language data are difficult to obtain. Since sign language is a motion, not a stationary image, the data production process is difficult compared to an image, because data must be produced in the form of a video. In addition, since sign language is not a language that everyone basically understands, such as English or Korean, it is necessary to fully understand the language to produce data. In addition, as sign language has many words, collecting every word is difficult. For these reasons, manufacturing and securing a sufficient amount of sign language data is a great difficulty. Therefore, this paper proposes a method to increase the sign language data set. We expect to improve translation accuracy by increasing the dataset through three methods to increase the sign language data and learning artificial intelligence.
The composition of this paper begins with an introduction, and Chapter 2 describes the research related to the paper. The next chapter describes in detail the data augmentation [2,3] methods for augmenting sign language data and explains how they were applied. In Chapter 4, the performance of the proposed method is verified based on the simulation results by comparing and simulating with augmented data based on the proposed method, and Chapter 5 summarizes the proposed method and experiment. Finally, Chapter 6 dictates the evaluation and opinion of the paper.

Related Work
Frame skip sampling was used as a data augmentation method in an existing paper on sign language translation [4]. Frame skip sampling is a random frame skip sampling, which is mainly used when handling video data, and when all frames are divided into n frames, one is extracted and integrated for each section. In the paper described above, frame skip sampling increases the data by 10 to 100 times. The learning process of this paper is shown in Figure 2. The video increased through data augmentation is converted into a video composed of only keypoints using openpose [5]. After that, the transformed video is trained on the GRU-based encoder decoder model. As a result of the experiment, the accuracy was best when learning by increasing the video by 50 times using data augmentation. Since the random frame skip sampling method is used when dealing with general video data, this paper proposes a data augmentation method that generates data according to the characteristics of the sign language.

Materials and Methods
The data augmentation proposed in this paper is divided into the following three types.
1. Finger length conversion; 2. Random keypoint removal; The video increased through data augmentation is converted into a video composed of only keypoints using openpose [5]. After that, the transformed video is trained on the GRU-based encoder decoder model. As a result of the experiment, the accuracy was best when learning by increasing the video by 50 times using data augmentation. Since the random frame skip sampling method is used when dealing with general video data, this paper proposes a data augmentation method that generates data according to the characteristics of the sign language.

Materials and Methods
The data augmentation proposed in this paper is divided into the following three types.
As a dataset that performs the proposed data augmentation, learning in the process shown in Figure 3 was performed.
As a dataset that performs the proposed data augmentation, learning in the process shown in Figure 3 was performed. Data augmentation was performed by combining one or two of the videos to be trained, camera angle conversion, finger length conversion, and random keypoint removal. Then, we input the increased data into openpose [6] and transform generated keypoint video to frame image. By passing the frame image data through convolutional neural network (CNN) [7,8] for each frame, the feature was extracted [9,10] and the extracted feature was input to the long short-term memory models (LSTM) [11] to learn what the image means [12]. Learning was progressed through the input features and translated into what words the video means. In addition, in order to recognize only sign language motion, a pre-processing process of changing a video to a video composed of keypoints, as shown in Figure 4, through openpose was performed in all data augmentation steps. The generated video consisted of the upper body, arms and fingers of both hands needed to recognize the sign language. The reason for learning with the above video is that if openpose was successfully performed, the result was not greatly affected by the surrounding environment.

Finger Length Conversion
Data augmentation of an image of the type shown in Figure 4 is very limited, and the first proposed method was to randomly increase or decrease the finger length. Finger length varies by person. Therefore, by randomly reducing or increasing the length of a finger, a single video can be increased as if taken by various people. Data augmentation was performed by combining one or two of the videos to be trained, camera angle conversion, finger length conversion, and random keypoint removal. Then, we input the increased data into openpose [6] and transform generated keypoint video to frame image. By passing the frame image data through convolutional neural network (CNN) [7,8] for each frame, the feature was extracted [9,10] and the extracted feature was input to the long short-term memory models (LSTM) [11] to learn what the image means [12]. Learning was progressed through the input features and translated into what words the video means. In addition, in order to recognize only sign language motion, a pre-processing process of changing a video to a video composed of keypoints, as shown in Figure 4, through openpose was performed in all data augmentation steps.
As a dataset that performs the proposed data augmentation, learning in the process shown in Figure 3 was performed. Data augmentation was performed by combining one or two of the videos to be trained, camera angle conversion, finger length conversion, and random keypoint removal. Then, we input the increased data into openpose [6] and transform generated keypoint video to frame image. By passing the frame image data through convolutional neural network (CNN) [7,8] for each frame, the feature was extracted [9,10] and the extracted feature was input to the long short-term memory models (LSTM) [11] to learn what the image means [12]. Learning was progressed through the input features and translated into what words the video means. In addition, in order to recognize only sign language motion, a pre-processing process of changing a video to a video composed of keypoints, as shown in Figure 4, through openpose was performed in all data augmentation steps. The generated video consisted of the upper body, arms and fingers of both hands needed to recognize the sign language. The reason for learning with the above video is that if openpose was successfully performed, the result was not greatly affected by the surrounding environment.

Finger Length Conversion
Data augmentation of an image of the type shown in Figure 4 is very limited, and the first proposed method was to randomly increase or decrease the finger length. Finger length varies by person. Therefore, by randomly reducing or increasing the length of a finger, a single video can be increased as if taken by various people. The generated video consisted of the upper body, arms and fingers of both hands needed to recognize the sign language. The reason for learning with the above video is that if openpose was successfully performed, the result was not greatly affected by the surrounding environment.

Finger Length Conversion
Data augmentation of an image of the type shown in Figure 4 is very limited, and the first proposed method was to randomly increase or decrease the finger length. Finger length varies by person. Therefore, by randomly reducing or increasing the length of a finger, a single video can be increased as if taken by various people.

Random Keypoint Removal
The random keypoint removal method is a randomly removing keypoints from a video created using openpose, as shown in Figure 4. As shown in Figure 5b, the keypoint was randomly removed to appear as if there was no specific node of the finger.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 9 The random keypoint removal method is a randomly removing keypoints from a video created using openpose, as shown in Figure 4. As shown in Figure 5b, the keypoint was randomly removed to appear as if there was no specific node of the finger. The reason for increasing data using a random keypoint removal as above is as follows. Because openpose may not be recognized successfully, we randomly removed keypoints, thereby lowering our learning dependency on openpose. In other words, the recognition rate of openpose was lowered, so the effect of when the correct answering rate of the translation decreased due to the missing keypoint. In addition, as the keypoint randomly disappeared, various videos can be generated, so the generalization of learning can be enhanced.

Camera Angle Conversion
The data used to learn for sign language translation were in video format. This video was taken using a camera. When shooting with a camera, various angles were possible. Since the size or perspective of the object looks different depending on the shooting angle, various data can be produced by using it for data augmentation. If the shooting angle is changed, the location of the keypoint or the length of the node is also changed in the video shown in Figure 4, so data augmentation can be successfully performed. However, in order to change the camera angle, it is difficult to shoot again. Therefore, instead of changing the angle of the actual camera, the method of processing the photo itself as shown in Figure 6 was used so that the camera's shooting angle was changed.  The reason for increasing data using a random keypoint removal as above is as follows. Because openpose may not be recognized successfully, we randomly removed keypoints, thereby lowering our learning dependency on openpose. In other words, the recognition rate of openpose was lowered, so the effect of when the correct answering rate of the translation decreased due to the missing keypoint. In addition, as the keypoint randomly disappeared, various videos can be generated, so the generalization of learning can be enhanced.

Camera Angle Conversion
The data used to learn for sign language translation were in video format. This video was taken using a camera. When shooting with a camera, various angles were possible. Since the size or perspective of the object looks different depending on the shooting angle, various data can be produced by using it for data augmentation. If the shooting angle is changed, the location of the keypoint or the length of the node is also changed in the video shown in Figure 4, so data augmentation can be successfully performed. However, in order to change the camera angle, it is difficult to shoot again. Therefore, instead of changing the angle of the actual camera, the method of processing the photo itself as shown in Figure 6 was used so that the camera's shooting angle was changed.
Electronics 2020, 9, x FOR PEER REVIEW 4 of 9 The random keypoint removal method is a randomly removing keypoints from a video created using openpose, as shown in Figure 4. As shown in Figure 5b, the keypoint was randomly removed to appear as if there was no specific node of the finger. The reason for increasing data using a random keypoint removal as above is as follows. Because openpose may not be recognized successfully, we randomly removed keypoints, thereby lowering our learning dependency on openpose. In other words, the recognition rate of openpose was lowered, so the effect of when the correct answering rate of the translation decreased due to the missing keypoint. In addition, as the keypoint randomly disappeared, various videos can be generated, so the generalization of learning can be enhanced.

Camera Angle Conversion
The data used to learn for sign language translation were in video format. This video was taken using a camera. When shooting with a camera, various angles were possible. Since the size or perspective of the object looks different depending on the shooting angle, various data can be produced by using it for data augmentation. If the shooting angle is changed, the location of the keypoint or the length of the node is also changed in the video shown in Figure 4, so data augmentation can be successfully performed. However, in order to change the camera angle, it is difficult to shoot again. Therefore, instead of changing the angle of the actual camera, the method of processing the photo itself as shown in Figure 6 was used so that the camera's shooting angle was changed.  The camera angle change method was as shown in the following for Equation (1). The equation expresses the changed x and y values by multiplying the determined 3D square matrix by the x and y values corresponding to each pixel. To illustrate the equation using Figure 6 as an example, x, y are the x and y coordinates of each pixel in the original Figure 6a, and x', y' are the x and y coordinates of each pixel in the converted Figure 6b. t is the remaining value which is not actually used, and h is the total height value in the video, while w is the total width value in the video. Additionally, the value a is the weight value for how much the camera angle is to be changed. As the value of an increase, the effect of the camera looking down from a higher angle can be obtained.
In the same way as above, videos could be generated at various angles through camera angle conversion, and accordingly various learning data were generated. In addition, the generalization of the learned network was better because the videos were diversified.

Experimental Results
To verify the proposed method, a sign language dataset from KETI (Korea Electronics Technology Institute) was used. Among the KETI data sets, 20 classes were used: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, bear, ear, gas, leak, leg, next year, police, eyes, stairs, and stun did. There were a total of 20 images in each class, and a total of 400 data were used. The network to be used in the experiment is shown in Figure 3.
The experiment was conducted in the following four ways, and the results were compared. The rest of the learning environment was the same, except for the number of data sets.

a.
Learning is done with only the original data. That is, using 400 data in total, the learning process is as shown in Figure 3. b.
Data augmentation is performed using camera angle conversion and random keypoint removal at the same time before learning. Using camera angle conversion, the original data are augmented about five times, and this is doubled through the random keypoint removal method to learn using 4000 data, which is 10 times the total of the original data set. c.
Finger length conversions and random keypoint removal are used simultaneously to perform data augmentation before learning. Using the finger length conversion, the original data are augmented by about five times, and this is doubled through the random keypoint removal method to learn using 4000 data, which is 10 times the total of the original data set. d. All three methods are used simultaneously: camera angle conversion, finger length conversion, and random keypoint removal, and data augmentation is performed before learning. Using the finger length conversion, the original data are augmented about five times, and this is doubled through the random keypoint removal method, and the data are again augmented five times using the camera angle conversion. As a result, learning is performed using 20,000 data, which is 50 times the original data set.
In addition, three methods of camera angle conversion, finger length conversion, and random keypoint removal describe how to increase data. First, in the camera angle conversion method, data of a camera angle weight a is combined with the original using four values of 50, 100, 150, and 200, respectively, and the data are increased five times. The finger length conversion method uses four types of random finger length increase and decrease in left hand, and random finger length increase and decrease in right hand, which is combined with the original to increase data five times. Lastly, the random keypoint removal method doubles the data by combining the original and one method of randomly removing keypoint from the whole.
The results of the experiment using the above method are shown in Table 1 below. The total number of trainings was 1000, but an early stop was set so that if the loss converged in the middle, the training was terminated, the batch size was set to 32, and the learning rate was set to e −6 . In addition to the results, when a was trained with 400 original data sets, the learning itself was not performed properly, so the results could not be confirmed. The reason for not learning was that the learning process was not confirmed because the learning loss did not change due to the small data set and the learning result was not confirmed. Then, b and c both increased the data by 10 times-c increased the data successfully, while b increased the data only by eight times, with 3200 data. The reason that the data were not increased by 10 times was that the openpose was not recognized because the camera's angle changed excessively, so the dataset was not properly created. In addition, it can be seen that the accuracy is also low as the data are reduced. The learning process of b is shown in Figure 7. The total number of learnings was 240, and loss converged at 2.48.
Electronics 2020, 9, x FOR PEER REVIEW 6 of 9 The results of the experiment using the above method are shown in Table 1 below. The total number of trainings was 1000, but an early stop was set so that if the loss converged in the middle, the training was terminated, the batch size was set to 32, and the learning rate was set to . In addition to the results, when a was trained with 400 original data sets, the learning itself was not performed properly, so the results could not be confirmed. The reason for not learning was that the learning process was not confirmed because the learning loss did not change due to the small data set and the learning result was not confirmed. Then, b and c both increased the data by 10 times-c increased the data successfully, while b increased the data only by eight times, with 3200 data. The reason that the data were not increased by 10 times was that the openpose was not recognized because the camera's angle changed excessively, so the dataset was not properly created. In addition, it can be seen that the accuracy is also low as the data are reduced. The learning process of b is shown in Figure 7. The total number of learnings was 240, and loss converged at 2.48. c shows that the data are successfully generated and the learning is successful, showing a considerable accuracy of 80%, and the learning process is shown in Figure 8. The total number of learning sessions was 80, and the loss converged at 0.365.  c shows that the data are successfully generated and the learning is successful, showing a considerable accuracy of 80%, and the learning process is shown in Figure 8. The total number of learning sessions was 80, and the loss converged at 0.365. Electronics 2020, 9, x FOR PEER REVIEW 6 of 9 The results of the experiment using the above method are shown in Table 1 below. The total number of trainings was 1000, but an early stop was set so that if the loss converged in the middle, the training was terminated, the batch size was set to 32, and the learning rate was set to . In addition to the results, when a was trained with 400 original data sets, the learning itself was not performed properly, so the results could not be confirmed. The reason for not learning was that the learning process was not confirmed because the learning loss did not change due to the small data set and the learning result was not confirmed. Then, b and c both increased the data by 10 times-c increased the data successfully, while b increased the data only by eight times, with 3200 data. The reason that the data were not increased by 10 times was that the openpose was not recognized because the camera's angle changed excessively, so the dataset was not properly created. In addition, it can be seen that the accuracy is also low as the data are reduced. The learning process of b is shown in Figure 7. The total number of learnings was 240, and loss converged at 2.48. c shows that the data are successfully generated and the learning is successful, showing a considerable accuracy of 80%, and the learning process is shown in Figure 8. The total number of learning sessions was 80, and the loss converged at 0.365.  d shows a high accuracy of 96.2% by using enough datasets by combining all the previous methods, and the learning process is shown in Figure 9. The total number of learning sessions was 500, and the loss converged at 0.272. Electronics 2020, 9, x FOR PEER REVIEW 7 of 9 d shows a high accuracy of 96.2% by using enough datasets by combining all the previous methods, and the learning process is shown in Figure 9. The total number of learning sessions was 500, and the loss converged at 0.272. As can be seen in Table 1, increasing the number of data through data augmentation has a great influence on the learning result, and shows that good learning results can be obtained by learning enough data by properly combining the proposed methods.
In addition, the comparison with the frame skip sampling data augmentation method of sign language translation [4] mentioned in Related Work is shown in Table 2 below. The frame skip sampling method has an accuracy of up to 93.2% when the data are increased 50 times, and the d method has an accuracy of up to 96.2% when the data are increased 40 times. d has a higher accuracy with fewer data. Figure 10 shows the result of the program execution and estimates the sign language estimated through the model in Figure 3 and outputs it as text.  As can be seen in Table 1, increasing the number of data through data augmentation has a great influence on the learning result, and shows that good learning results can be obtained by learning enough data by properly combining the proposed methods.
In addition, the comparison with the frame skip sampling data augmentation method of sign language translation [4] mentioned in Related Work is shown in Table 2 below. The frame skip sampling method has an accuracy of up to 93.2% when the data are increased 50 times, and the d method has an accuracy of up to 96.2% when the data are increased 40 times. d has a higher accuracy with fewer data. Figure 10 shows the result of the program execution and estimates the sign language estimated through the model in Figure 3 and outputs it as text.
Electronics 2020, 9, x FOR PEER REVIEW 7 of 9 d shows a high accuracy of 96.2% by using enough datasets by combining all the previous methods, and the learning process is shown in Figure 9. The total number of learning sessions was 500, and the loss converged at 0.272. As can be seen in Table 1, increasing the number of data through data augmentation has a great influence on the learning result, and shows that good learning results can be obtained by learning enough data by properly combining the proposed methods.
In addition, the comparison with the frame skip sampling data augmentation method of sign language translation [4] mentioned in Related Work is shown in Table 2 below. The frame skip sampling method has an accuracy of up to 93.2% when the data are increased 50 times, and the d method has an accuracy of up to 96.2% when the data are increased 40 times. d has a higher accuracy with fewer data. Figure 10 shows the result of the program execution and estimates the sign language estimated through the model in Figure 3 and outputs it as text.

Conclusions
In this paper, we proposed an augmentation method for sign language data to improve learning performance. Three methods were proposed: finger length conversion, random keypoint removal, and camera angle conversion, and these methods were combined with b, c, and d to augment the data by as little as eight times and as much as 40 times. Through this, it showed various performance improvements from the lowest accuracy of 41.9% to the highest of 96.2%, and using all the combinations of d, 54.3% higher accuracy improvement than b. Through these results, it was verified that the method of this paper was very helpful when there were insufficient sign language data.

Discussion
In this paper, we verify new three data augmentation methods used in sign language translation. As you can see from the above experiment results, the combination of two of the three proposed methods, such as b and c, can be used to collect the minimum data for learning. When all three methods are used in combination, a high accuracy of 96.2% is obtained. In addition, when looking at the results in Table 2, the learning results may vary depending on the data augmentation method. If the data augmentation method is more suitable for the data, the maximum results can be obtained with less data. Using small data to achieve maximum results means better learning efficiency per data.
Therefore, if the data augmentation method proposed is used when using openpose processed sign language video as training data, you can save time and effort to collect data with high performance.

Conflicts of Interest:
The authors declare no conflict of interest.