# Generative Model for Skeletal Human Movements Based on Conditional DC-GAN Applied to Pseudo-Images

## Abstract

## 1. Introduction

- spatial information: strong correlations between adjacent joints, which makes it possible to learn about body structural information within a single frame (intra-frame);
- temporal information: makes it possible to learn about temporal correlation between frames (inter-frame); and
- cooccurrence relationship between spatial and temporal domains when taking joints and bones into account.

## 2. Related Works

#### 2.1. Generative Models for Skeletal Human Movements

#### 2.2. Pseudo-Image Representation for Skeletal Pose Sequences

## 3. Materials and Methods

#### 3.1. NTU_RGB+D Dataset

#### 3.2. Tree Structure Skeleton Image (TSSI)

#### 3.3. Data Preparation

#### 3.4. Conditional Deep Convolution Generative Adversarial Network (Conditional DC-GAN)

#### 3.5. Loss Function and Training Process

#### 3.6. Transformation of Generated Pseudo-Images into Skeletal Sequences

#### 3.7. Model Evaluation—Fréchet Inception Distance (FID)

## 4. Results

#### 4.1. Qualitative Result for Generated Skeletal Actions

#### 4.2. Quantitative Model Evaluation Using Fréchet Inception Distance (FID)

#### 4.3. Importance of the Modification of Generator Using Upsample + Convolutional Layers

#### 4.4. Importance of Reordering Joints in TSSI Spatiotemporal Images

#### 4.5. Hyperparameter Tuning

#### 4.6. Analysis of Latent Space

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A

## References

**Figure 1.**Illustration (adapted from [20]) of Tree Structure Skeleton Image (TSSI): (

**a**) skeleton structure and order in NTU_RGB+D, with traversal order that respects spatial relations, and (

**b**) joint arrangements of TSSI. The shape is (T, J, d) where T is the sequence duration, J’ the number of joints in traversal order (J joint repeated for the TSSI), and d the dimension (d = 3 for (x, y, z) 3D positions).

**Figure 2.**Architecture of our conditional Deep Convolutional Generative Adversarial Network (DC-GAN).

**Figure 3.**Examples of generated actions compared with a typical real training sequence for 4 action classes: (

**a**) sitting down, (

**b**) standing up, (

**c**) hand waving, and (

**d**) kicking something.

**Figure 4.**Checking the consistency of the Fréchet Inception Distance (FID) score with the qualitative appearance of generated action: the action “standing up” generated by the model after (

**a**) 150 epoch training and (

**b**) 200 epoch training, and (

**c**) the corresponding curve of FID as a function of training epochs.

**Figure 5.**FID curves evaluated on (

**0–7**) 8 different actions, (

**avg**) their average score, and (

**total**) the entire dataset.

**Figure 6.**Learning curves of our model (using upsample and convolutional layers) (orange) and the original method (using convolutional-transpose layers) (blue): Loss_D is discriminator loss calculated as the average of losses for the all real and all fake samples, and Loss_G is the average generator loss.

**Figure 7.**Example of spatiotemporal images: (

**a**) TSSI of a real sequence; (

**b**) an image generated by the original conditional DC-GAN model with convolutional-transpose layers, with checkerboard artifacts; and (

**c**) an image generated by our model, without checkerboard artifacts.

**Figure 8.**Sequence of action “drinking” generated by (

**a**) our model (using upsample + convolutional layers) trained after 69 epochs and (

**b**) the original model (using convolutional-transpose layer) trained after 349 epochs.

**Figure 9.**Sequences of action “standing” generated by the models trained (

**a**) with non-TSSI pseudo-images (no reordering of joints) and (

**b**) with TSSI-reordered pseudo-images.

**Figure 10.**Learning curves for our model trained with different learning rates: 0.001, 0.0005, 0.0002, and 0.0001.

**Figure 11.**Learning curves for our model trained with different batch sizes (of input data): 16, 32, 64, and 128.

**Figure 12.**Learning curves for the model trained on different numbers of dimensions of the latent space z (input of G): 5, 30, and 100. For the average FID, we plot the value at 50, 100, 110, 120, 200 epochs.

