DGU-HAU: A Dataset for 3D Human Action Analysis on Utterances

: Constructing diverse and complex multi-modal datasets is crucial for advancing human action analysis research, providing ground truth annotations for training deep learning networks, and enabling the development of robust models across real-world scenarios. Generating natural and contextually appropriate nonverbal gestures is essential for enhancing immersive and effective human–computer interactions in various applications. These applications include video games, embodied virtual assistants, and conversations within a metaverse. However, existing speech-related human datasets are focused on style transfer, so they have limitations that make them unsuitable for 3D human action analysis studies, such as human action recognition and generation. Therefore, we introduce a novel multi-modal dataset, DGU-HAU , a dataset for 3D human action on utterances that commonly occurs during daily life. We validate the dataset using a human action generation model, Action2Motion (A2M), a state-of-the-art 3D human action generation model.


Introduction
Human action analysis research is important in understanding and interpreting human behavior from various perspectives.This field is crucial for multiple applications, including video surveillance, healthcare, robotics, sports analysis, and entertainment.Human action analysis research can enhance safety, efficiency, and automation in various industries by accurately recognizing, predicting, and modeling human actions.
Constructing datasets suitable for human action analysis is significant for advancing this research domain.The datasets provide essential ground truth annotations for training and evaluating deep learning networks of human action analysis research such as action recognition, action prediction, action generation and modeling, pose estimation, real-time action analysis, etc.As diverse and complex human actions span multiple contexts and environments, the datasets allow researchers to develop robust models that generalize well across real-world scenarios.Furthermore, well-structured datasets foster healthy competition within the research community, inspiring the development of more accurate and efficient action analysis techniques.Previous human action analysis datasets were unimodal, mainly based on RGB images or videos [1].These datasets do not contain depth information, so they need pre-processing to reconstruct the 3D skeleton.With the introduction of depth sensors, such as Microsoft Kinect [2,3] and IR cameras [4,5], it is possible to build multi-modal human action analysis datasets containing RGB, depth, and 3D skeleton data.Therefore, we aim to introduce a general-purpose human action analysis dataset and validate the dataset with the human action generation model [6] in this paper.
The generation of human-like movements has garnered significant attention and research across disciplines such as computer vision, graphics, and animation.This field aims to develop algorithms and models that can produce realistic and natural movements resembling those of humans.By capturing the intricacies of human motion, many studies strive to enhance the quality and believability of virtual characters, avatars, and animations, thereby creating immersive experiences in various domains, including entertainment, virtual reality, and robotics.
To achieve truly immersive and effective human-computer interactions, generating nonverbal gestures that appear natural and appropriate is crucial across a range of conversational scenarios.This necessity has emerged in various applications, including communication with characters in video games, embodied virtual assistants, and avatars conversing in a metaverse.In video games, lifelike character animations convey emotions, intentions, and interactions, enabling players to engage with the virtual world more deeply.Embodied virtual assistants, such as chatbots or virtual agents, can benefit from natural gestures to enhance their expressiveness and facilitate more intuitive communication with users.Moreover, as the concept of the metaverse continues to evolve, avatars engaging in conversations within this virtual realm will require nonverbal gestures that are contextually appropriate, enabling users to connect and communicate effectively in this immersive environment.In all these scenarios, the research on generating natural nonverbal gestures aims to bridge the gap between verbal communication and nonverbal expressions, enhancing human-computer interactions' overall effectiveness and believability.
Previous studies on human action generation models include Generative Adversarial Network (GAN)-based [7,8], conditional temporal Variational Auto Encoder (VAE)-based [6], Graph Convolutional Network (GCN)-based [9], and Transformer-based models [10].Its dataset [4,5,11,12] has focused on generating human motion for daily activities.Despite significant progress, several challenges remain in human action generation datasets; no action generation datasets are related to conversation situations.Research on gesture generation for speech involves studying the unique gesture style of an individual, replicating it, and applying it to other objects or contexts.The primary focus of this task is researching the creation of a specific speech style or style transfer rather than generating general human behavior that may occur during a conversation.
This paper introduces a novel dataset, DGU-HAU, a dataset for 3D Human Action on Utterances that commonly occur during daily life.Our dataset is divided into two categories: single-person presentations and conversations involving two or four people.Each category has four and ten scenarios, resulting in 142 action classes with approximately 10 classes for each scenario.The group of subjects involved in the study had an almost equal distribution of males and females, with a total of 166 participants of different age groups.Therefore, our dataset comprises 142 distinct action classes across 14 scenarios.We have collected approximately 100 motion capture data samples for each class, resulting in a total of 2408 JSON annotation files with 14,116 motion capture data samples.Each JSON annotation file contains annotation information of about six motion capture data samples.Using the Action2Motion (A2M) [6] model, we validated our dataset.The A2M model represents the latest 3D human action generation advancement as of 2023.It was validated across various datasets, employing the Fréchet Inception Distance (FID) [13] as an evaluation metric.Moreover, the Action2Motion model operates on conditional temporal VAE principles and crafts physically plausible human actions by leveraging Lie Algebra.Hence, we adopted A2M to validate our dataset by generating physically plausible human actions.
The structure of this paper is as follows.Section 2 reviews previous research on 3D-based human action analysis.Section 3 describes the structure of the proposed dataset and how we pre-processed our dataset.Section 4 explains the dataset evaluation results with the A2M model and performance analysis.Finally, Section 5 summarizes the paper and discusses future work.

Generative Pre-Trained Transformer
Recent research trends in GPT (Generative Pre-trained Transformer) [14][15][16] have shown significant advancements in natural language processing.GPT, a state-of-the-art language model [16][17][18][19] based on the Transformer [20] architecture, has gained tremendous attention and popularity in the research community.It has demonstrated remarkable capabilities in various natural language understanding and generation tasks, including machine translation, text summarization, question answering, and conversational agents.Researchers have been actively exploring novel techniques to improve the performance, efficiency, and generalization ability of GPT models.Recent studies have addressed challenges such as model size, training efficiency, fine-tuning techniques, and ethical considerations in language generation.Additionally, efforts have been made to extend the capabilities of GPT models to handle multi-modal tasks that involve both textual and visual inputs.This paper studies building a multi-modal dataset of such a generative model.Therefore, we aim to develop a general-purpose 3D human action analysis dataset for tasks involving text and visual input to overcome the limitations of GPT studies biased toward natural language processing.

Gesture Generation
In the human action analysis study, the action generation largely consists of gesture and 3D human action generation.Firstly, gesture generation is a crucial area of research in understanding and enhancing an individual's speech pattern.The primary objective is to generate expressive gestures that align with the speech context.Several studies have explored different aspects of gesture generation, including [21]; one notable work by Ginosar et al. focused on understanding and learning the unique conversational gestures of ten celebrities.By analyzing a large dataset of their speeches, the study aimed to capture and reproduce the distinct styles of these individuals in gesture generation.Another relevant research direction is style transfer in gesture animation.The authors of [22] investigated the transfer of gesture styles between individuals.The goal was to learn consistent gesture styles from multiple individuals and apply those styles to different subjects.However, this task does not match our dataset to generate general gestures or synthesize actions during a speech to fit the speech context.

3D Human Action Generation
In 3D human action generation, several related works have been conducted to explore various aspects of this research area.The authors of [7] generate realistic and consecutive human actions using an autoencoder and generative adversarial network.Bi-directional GAN-based [8] generates action sequences from noise.The author of [8] proposes modeling smooth and diverse transitions for action generation using a latent space of lower dimensionality.Unlike standard action prediction methods, ref. [8] can generate action sequences without any conditional action poses from pure noise.The author of [9] suggests a modified version of GCNs that selectively uses self-attention to sparsify a complete action graph in the temporal domain.The work in [10] presents a generative VAE transformer-based architecture model using SMPL [23] for 3D mesh modeling.Conditional temporal VAE-based [6] used Lie Algebra to generate physically plausible human action.This paper uses this model to validate our dataset because of the Lie Algebra.
The majority of studies on human action generation use the following datasets: Human 3.6 M [11], NTU RGB+D [4,5], HumanAct12 [6], UESTC [12].The configuration of those datasets is described in Table 1.The study in [11] contains 3.6 million frames of motion capture data, 11 subjects performing 17 motions, four cameras, 3D skeleton motion capture data including 17 action classes (walking, running, activity, cycling, clapping, lifting, squats, etc.), and section tagging annotation.The work in [5] includes 114,480 motion capture sequences performed by 106 subjects.The dataset includes 120 motions, such as hand waving, picking up objects, sitting, standing up, walking, running, and more.The dataset also includes RGB+D, 3D skeleton motion capture data, and section tagging annotation.The work in [6] includes motion capture data for 12 action categories, such as warm-up and lifting a dumbbell, and 34 subcategories, including warming up the elbow and lifting the dumbbell with the right hand, as well as segment tagging annotations.The work in [12] contains 40 categories of aerobic exercise with 118 subjects, as shown in Table 1.Our dataset has the largest number of action classes and subjects compared to the other datasets shown in Table 1.
Table 1.Comparison of the proposed DGU-HAU dataset and other 3D human action datasets.The Anno. (Annotation) in data is the section tagging annotation data of each data sample with the metadata, such as actor information, action code, conversation or presentation scenario information, action class, and its code.Our dataset has a text script modality: the textualization of extracted audio from each video data sample.Motion capture data (MCD) represent 3D joint information of the human body.Since gesture generation is a study that learns and imitates a specific human style that can occur in conversation scenarios, the dataset used for gesture generation comprises various gestures for each particular person.On the other hand, since the 3D human action generation study is a study that learns and creates general human actions for a specific action class, the dataset for this is composed of data in which various people acted on each action class.Our dataset corresponds to 3D human action generation rather than gesture generation because it is a dataset of general human actions of specific actions that can occur in presentation and conversation scenarios.Additionally, as can be seen in Table 1, the HumanAct12 [6] dataset used in Action2Motion is a smaller dataset than our dataset.Therefore, the Action2Motion model was used to validate our dataset.

Dataset Structure and Building Process
This section describes the data collection environment, tools, methods, and structure.The overall data-building process for each data modality is schematized in Figure 1.
Figure 1.All data modalities were collected and built simultaneously.The finger's motion capture data were collected using MoCap Pro Super Splay, a hand motion data collection device, separate from the body part's data.They were merged with the body motion capture data coordinates according to the human skeleton's hierarchical structure.

Collection Setups
We used 12 Qualisys Arqus A9 devices for motion capture data in a 6∼15 m square space and 3 Miqus video device for video data, as shown in Figure 2. Two-dimensional coordinates of each joint marker of the human body are generated from multiple motion capture cameras (Arqus A9, Qualisys, Göteborg, Sweden), and these two-dimensional coordinate data are analyzed by software (QTM) (https://www.qualisys.com/software/qualisys-track-manager/ (accessed on 10 October 2023)) to calculate coordinates in threedimensional space.We used MoCap Pro Super Splay, a glove format with 16 sensors, to acquire the hand motion capture data.Motion capture data for the human body and hands were collected separately and then integrated using QTM to create a complete motion capture of the whole human body.There are three distinct viewpoints to collect RGB video data, and footage is shot at 60 fps or higher.The RGB video is full HD with 1920 × 1080 resolution.

Data Modalities
The proposed dataset, DGU-HAU, provides 14,116 motion capture data samples with 2408 annotation data samples, and there are four data modalities: motion capture data, RGB video, scenario script, and annotation data.The overall building steps for each modality of our dataset are shown in Figure 1.The samples of each data modality are shown in Figure 3.

Motion Capture Data (MCD)
Our dataset provides 14,116 motion capture data samples of the human body in BVH (BioVision Hierarchy) format.This common file format delivers motion capture data that represent 3D coordinates of the joints of humans, as shown in Figure 4.The BVH format consists of a hierarchy section and a motion section.In the hierarchy section, the information on the human skeleton joints is represented in a tree structure, and each joint has an offset and a channel list.The channel list is a transformation list for motion at that point.First, we collected raw samples of motion capture data in FBX (Filmbox) format using 12 Qualisys Arqus A9 devices, as shown in Figures 1 and 2.Then, we cleaned up the collected data by labeling body part markers and recovering missing data and noise.After cleaning up, we verified the collected data and extracted the skeleton information from the raw data.We collected body and finger joint data separately to create a dataset that can detect precise finger movements.Thus, we obtained 3D coordinates for 75 body joints, consisting of 27 for the human body and 24 for the left and right human fingers.We labeled the motion capture data, extracted each 3D coordinate, and converted it to JSON format for ease of use.We selected and employed 24 representative joints out of 75 for data verification.Figure 4 displays information about the position, number, and label of the selected joints in this paper.

RGB Video
Our dataset provides 1352 RGB videos of the various versions of each action class related to the utterance in MP4 format, of which the resolution is 1920 × 1080.Each RGB video has more than one action class, and the section tagging information of each action class is in the annotation data.As shown in Figure 2, we filmed the 14 scenarios, including about ten action classes in three different views (front, left, and right side) using Qualisys Miqus devices.After collecting the video data, we anonymized the video to protect personal information.We then verified that the data anonymization process was successful, as shown in Figure 1.Simultaneously, we extracted the audio data in MP3 format from the verified video data for the scenario script; the other modality is described in the next section.

Scenario Scripts
Our dataset provides 1352 scenario scripts, based on the audio extracted from RGB videos, in the text file, as shown in Table 1.We wrote a scenario script based on the audio data extracted from each collected RGB video.After that, we checked the audio-to-text scenario script for errors such as typos and profanity and whether the lines matched well with the uttered actors and times.As shown in Table 2, it consists of a total of fourteen types of scenarios; four scenario types are scenarios for one-person presentation circumstances, and ten scenario types are scenarios for two-or four-person conversation circumstances.We collected data samples for 14 types of scenarios with various combinations of actors.

Annotation Data
Our dataset provides 2408 annotation data samples, and each annotation data file contains the annotation information of about 6 motion capture data samples.Therefore, we have 14,116 motion capture data samples.The annotation data are the metadata of the others.They include the information of the dataset, annotation of the video, and each action, actor, scenario, action information, corresponding video section information, and motion capture data, as shown in Table 3.In the annotation part of the motion capture data, there are frame ID and 3D coordinates of the body joints in Figure 4.The name of the action class is tagged based on predefined start and end points and conversation content.There are a total of 75 body joints, of which 48 body joints are related to finger joints, as shown in Table 3.It represents the 3D coordinates of 24 joints for the right and left hands.The remaining 27 body joints represent important joints in the human body.When verifying the data, we referred to [6] and selected 24 body joints that appropriately expressed the human body out of 75 body joints to construct a skeleton.A more detailed body joint label annotation is described in [26].

Subjects
The dataset includes 166 subjects of different ages and genders, each with a unique actor ID number.The subjects were recruited with an equal number of males and females and varying ages to mitigate bias.Table 4 shows the number of data samples by age and gender groups.The subjects' age groups were divided into three groups: the young group in their teens and 20 s, the middle group in their 30 s and 40 s, and the old group in their 50 s and 60 s.Y denotes the young group, M represents the middle group, and O means the old group.There are 1128 data samples containing female subjects and 1280 data samples containing male subjects, and the data are structured in an almost 1:1 ratio in the gender of subjects.The number of data samples by group is 226, 430, and 472 in the order of old, middle, and young groups for female subjects, respectively, and 344, 509, and 427 for male subjects.In the conversation scenario, each scenario comprised more than 84 different subjects.In the presentation scenario, each scenario included more than 170 different subjects.

Action Classes
Our dataset has 142 action categories among four presentation and ten conversation scenarios.There are four types of presentations: sitting and standing presentations, as well as explanations.The ten conversations cover various scenarios that commonly occur daily, including daily life (specifically watching TV), consoling someone, celebrating a birthday, etc. Tables 5 and 6 describes the scenario code, scenario name, action class code, and action class name of the action classes and scenarios.The table only includes brief information about action classes and scenarios.The full table is in [27,28].
Table 5. Configuration of the 37 action classes and their action codes in the four presentation scenarios.The work in [27] shows the full table about the configuration of 37 action classes.We validated our dataset using Action2Motion [6], as mentioned earlier.Therefore, to generate 3D plausible human actions, we trained the A2M model with our data and used the trained A2M model to create 3D human actions as GIF files.To do this, we pre-processed the data according to the input format of the A2M model.The overall pre-processing and generating action steps for our dataset are shown in Figure 5.We obtained the first pre-processed 3D human action data in JSON file format through annotation of BVH format motion capture data, and the second pre-processed 3D coordinate data in the NumPy format through the extraction of 3D coordinates, as shown in Figure 5.The extraction of 3D coordinates involves two main steps.First, we extract the information from the 24 joint we selected among the 75 joints' information during the frame section corresponding to the action class from each data sample.Second, the 3D coordinate values of the extracted 24 joints were converted to the NumPy format to match the input format of the model.The final pre-processed dataset was trained using the A2M model to create a new action synthesis.We measured its similarity to the existing ground truth using the FID metric.

Evaluation Results for A2M Model
This paper uses the A2M model to evaluate our dataset.Following [6], four metrics, FID, accuracy, diversity, and multi-modality, are considered to validate our dataset as a metric.FID [13] is a metric used to measure the performance of generative models.It assesses the similarity between generated and ground truth data.The FID comprises two main components: Inception Score and Fréchet Distance.The FID is obtained by applying the following operation: where µ real and µ f ake represent the means of the feature vectors for real and generated data, respectively.Σ f ake and Σ real represent the covariances of the feature vectors for generated and ground truth data, respectively.FID score represents the difference in Inception Scores [13].
FID is related to human intuition in that it is based on feature vectors that capture an abstract representation between generated and real data.Therefore, FID matches well with general intuition about how similar a generated image feels to real data.Additionally, FID calculates the Frèchet Distance between two multivariate normal distributions, which can quantify and compare the similarities and differences between the two distributions through statistical methods.Therefore, we used FID as the main data evaluation metric.Diversity gauges the deviation of the generated motions across all categories of actions.Unlike diversity, multi-modality assesses the extent to which the generated motions exhibit diversity within each action category.
The hardware specifications of dataset evaluation are listed in Table 7, and the hyperparameter settings of the A2M model are shown in Table 8 below.We proceeded with training by maintaining the hyperparameter settings of the A2M model [6].We used NVIDIA GeForce RTX 3090 GPU for training and evaluating our dataset, and the employed learning parameters are shown in Table 8 below.Since the action class classifier of the A2M model can only classify up to 13 action classes, we separately trained 14 scenarios.According to Tables 5 and 6, there are no more than 13 action classes in each scenario in our dataset.Therefore, we trained the A2M model for each scenario, and the evaluation results are shown in Table 9.According to Table 9, the FID for real motion, the ground truth data for each scenario, is 0.039 to 0.062, showing that the distribution of the original data was learned very well.In addition, the FID for the generated motion is 0.342 to 1.532 for each scenario, confirming that the data were well-generated by learning the distribution of the original data well.The evaluation metric in the result table was constructed by referring to [6] for comparison.Accuracy refers to whether A2M's classifier generates the right action for real motion.For real motion, the accuracy is very high, at 97% to 99%, but for generated motion, the accuracy ranges from 35% to 93%, which shows a very large deviation.The scenario's low accuracy is due to the presence of multiple similar action classes.Diversity is an evaluation metric for whether the model generates diverse data.Table 9 shows that our data have a diversity of about 6 for all scenarios.
To compare our dataset with other datasets, we trained the A2M on our dataset under the same experimental conditions as the three datasets [4,6,29] on which the A2M was trained.As mentioned earlier, A2M can classify up to 13 action classes, so we selected 13 by applying random sampling, just as A2M applied to the other two datasets [4,29].We sampled seven subsets, which consist of 13 action classes of our dataset, and trained using the A2M model.The results of the comparison experiment are shown in Table 10.
Table 10.Evaluation results of our dataset and comparison with other datasets based on [6].The lower the FID and the higher the Accuracy, the better the performance.According to Table 10, the FID value for each scenario of ground truth is 0.041.This value is the second smallest among the four datasets, only surpassed by NTU RGB+D [4] by a mere 0.01 difference.The motion generated has an FID value of 0.992, the second-best after NTU RGB+D, with a difference of 0.66.
Regarding accuracy, real motion recorded the lowest value among the four datasets at 87.2%.The value was lowered because there were other similar action classes when only 13 were randomly selected from 142.According to Table 9, the accuracy of real motion for each scenario is 97% to 99%, similar to other datasets in Table 10.The accuracy of generated motion is 75.9%,ranking third out of four datasets.The value seems to have been measured with slightly lower accuracy for the same reason as the real motion.
In the case of Diversity, both real and generated motion were about 6, which showed that the values were similar to other datasets.It has been determined that data can be produced variously.

Discussion
A 3D human action dataset on utterance, DGU-HAU, is introduced in this paper.Our dataset provides 14,116 data samples of motion capture data, 1352 RGB videos, their textualized scenario script based on the audio data extracted from the video, and the 2408 annotation data samples in JSON format.The human actions are based on 14 scenarios occurring in daily life, and these scenarios include about ten action classes per each one.Therefore, there are 142 action classes in our dataset.Also, 166 subjects recorded action classes in various combinations according to age group, gender, and body shape.Our dataset is a general-purpose dataset that can be used for multiple studies that analyze 3D human actions.In this paper, our dataset was verified using a generative model, Action2Motion [6], but it is possible to apply various models, such as human action recognition and human-object interaction.Action2Motion is a 3D human action generation model that leverages Lie Algebra for physically plausible human action.In the experimental results, the FID values for real motion showed good results in the following order: NTU-RGB+D, our dataset, CMU Mocap, and HumanAct12.Additionally, the FID values for generated motion showed good results in the following order: NTU-RGB+D, our dataset, HumanAct12, and CMU Mocap.Our dataset showed a difference of 0.01 and 0.66 in real motion and generated motion, respectively, from the NTU-RGB+D dataset that showed the best results, while differences of 0.05 in real motion and 1.89 in generated motion were found from the dataset that showed the worst results.While NTU-RGB+D consists of various types of actions that can occur in everyday life, our data consist of a series of actions that can occur in the flow of conversation.Therefore, because there is continuity of motion, even different motions within one scenario may have similar motions.For these reasons, the NTU-RGB+D has clear distinctions between each operation.Still, our data sometimes have actions that overlap or occur twice in the scenario, so the distinction between each operation may be less clear than in NTU-RGB+D.Additionally, in our data, we were able to confirm that the performance was slightly smaller than NTU-RGB+D because the scale of the action was not large.Therefore, we were able to confirm that our dataset was well-created and verified well.
In future work, we plan to apply various models that study 3D human actions in different ways, such as the human action recognition model, to our dataset.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Informed Consent was obtained from all the human subjects who participated in the data collection.

Figure 2 .
Figure 2. Setup of the data collection environment.

Figure 4 .
Figure 4. Configuration of the body joints and label in our dataset.

Figure 5 .
Figure 5. Overall steps of pre-processing our dataset to train the A2M [6] model.We extracted coordinates from labeled motion capture data to generate the NumPy file of 3D coordinates of human action.We trained the A2M model with pre-processed data and generated a new 3D human action in GIF format.

Table 2 .
Data modality and description of each modality.

Table 3 .
[26]iguration of annotation data in JSON format.More detailed body joint label annotation is described in[26].

Table 4 .
Gender and age ratio of the subjects.FM denotes female, and MA denotes male.

Table 6 .
[28]iguration of the 105 action classes and their action codes in the ten conversation scenarios.The work in[28]shows the full table about the configuration of the 105 action classes.

Table 7 .
The hardware specifications.

Table 8 .
Configuration of the hyperparameters.

Table 9 .
Evaluation results of our dataset for each scenario.The lower the FID and the higher the Accuracy, the better the performance.