Modeling the Conditional Distribution of Co-Speech Upper Body Gesture Jointly Using Conditional-GAN and Unrolled-GAN

: Co-speech gestures are a crucial, non-verbal modality for humans to communicate. Social agents also need this capability to be more human-like and comprehensive. This study aims to model the distribution of gestures conditioned on human speech features. Unlike previous studies that try to ﬁnd injective functions that map speech to gestures, we propose a novel, conditional GAN-based generative model to not only convert speech into gestures but also to approximate the distribution of gestures conditioned on speech through parameterization. An objective evaluation and user study show that the proposed model outperformed the existing deterministic model, indicating that generative models can approximate real patterns of co-speech gestures better than the existing deterministic model. Our results suggest that it is critical to consider the nature of randomness when modeling co-speech gestures.


Introduction
Human-like robots and virtual agents have human appearances, and they are expected to use both verbal and non-verbal behaviors to communicate, like humans do when interacting with others. One crucial non-verbal behavior is the use of hand gestures [1,2]. These spontaneous hand movements accompany speech to complement or even supplement the information relayed by a speaker [3]. The modeling of the relationship between gestures and speech can be incorporated in human-like agents to express themselves comprehensively.
Recently, machine learning and deep learning have achieved great success in generating gestures. The related studies mainly aim at optimizing the parameters of a model to convert speech features into gesture sequences. For instance, the effect of recurrent models such as gated recurrent unit (GRU) and long-short term memory (LSTM) on mapping melfrequency cepstrum coefficient (MFCC) features of speech to gestures has been analyzed in a study in which a bi-directional LSTM network learned how to map MFCC features to 3D joint coordinates on a skeleton from a dataset collected using motion capture (MOCAP) hardware and software [4]. However, these generation methods are based on a strong assumption: the mapping from speech to gesture is injective, i.e., only one gesture can be generated by these models for one speech segment. In reality, there are alternatives to almost any gesture. Numerous examples help to explain this phenomenon, such as using left, right, or both hands, hands at different heights and radii, and so forth. Additionally, a human may perform new gestures that have never been performed before. We consider this randomness to be an essential part of co-speech gestures and thus aim to design a generative model to incorporate the randomness of co-speech gestures.
Inspired by the success of generative adversarial nets (GANs) for image generation, we propose a GAN-based generative model that can convert speech into gestures while preserving randomness. To optimize the model, we used a discriminator to give dynamic feedback on the generator results. Furthermore, the effect of mode collapse, which is a common type of failure in GAN training, is minimized by using the unrolled generative adversarial net (Unrolled-GAN) algorithm. We experimented with our model on a Japanese speech/gesture dataset. The evaluation shows that the proposed model can approximate real gesture distributions better than baseline could. User studies also confirm the proposed model is effective, showing a significant difference between the results generated by the proposed model and that of the baseline.
The contribution of this work is three-fold: (1) We propose a novel deep-learningbased generative model to generate co-speech gestures. (2) We propose a strategy for changing gesture patterns by manipulating the randomly sampled vector, and we improve the performance. (3) We confirmed that the proposed model outperformed the existing deterministic model through objective and subjective experiments.
The rest of this article is organized as follows: In Section 2, we discuss the research related to the present study. Section 3 briefly mentions the existing methods that are substantial to our work and describes the details of the proposed model and implementation. In Section 4, the objective evaluation metrics and user study are explained, and the obtained results and interpretation are presented. In Section 5, we discuss observations made during our experiment and the limitations and future directions of our approach. Our implementation is available at https://github.com/wubowen416/co-speech-gesturegeneration-using-CGAN.

Generative Adversarial Nets (GAN)
The essence of GAN is a min-max game between a generator and a discriminator. While the discriminator is optimized to recognize whether its inputs are sampled from real data or are fake data generated by the generator, the generator tries to deceive the discriminator by learning how to generate data that resembles real data. This adversarial system will reach a Nash equilibrium once the generator learns to generate real data. Intuitively, this is equivalent to the generator approximating the real data distribution. Refrerence [5] confirmed this hypothesis by proving that the generator tries to minimize the Jensen-Shannon divergence between the generated distribution and the real data distribution when the discriminator is optimal.
Conditional generative adversarial nets (CGAN) can generate an entity in a specific category [6]. It adds the same conditional labels to both the generator and discriminator. Mathematically, the distribution to which the GAN's generator is trying to approximate is replaced by the conditional distribution conditioned on a specific category. Reference [7] used CGAN to model head motion with speech as the conditional input.
Mode collapse is a common failure in GAN training, i.e., the generator outputs identical results for any noise vector from the prior. By unrolling the discriminator, unrolled-GAN allows the generator to "look into the future" to prevent the discriminator from overfitting on a specific training sample, thereby reducing the effect of mode collapse [8].

Gesture Generation
Studies on the generation of human-like gestures for robots started years ago. Early on, robot gestures were only designed for a few pre-defined scenarios [9]. The first automatic method was the so-called ruled-based method. A set of human gesture patterns was recorded as sequences of joint data, and their occurrences were statistically studied in relation with the lexicon. These results were then summarized as a number of rules to decide which gesture to select from the recorded database [10]. An advanced rule-based method was proposed to separately model different parts of the human body to generate different combinations as a whole [11].
Beyond writing rules, statistical models were also adopted. These models learn cooccurrences between pre-defined high-level speech features and gesture features from the collected data. In [12], for example, abstract concepts were selected from speech text using WordNet. Then, the extracted concepts were mapped to a gesture sample cluster based on gesture functions (i.e., iconic, metaphoric, and so forth) using data-driven probabilistic modeling. The prosody peak of the speech was automatically analyzed to indicate timing and perform a pre-defined beat gesture. The relationship between iconic gestures and lexicon was automatically learned from the corpus using a Bayesian decision network [13]. A dynamic Bayesian network was also utilized to model several meaningful behaviors (e.g., nod) while considering synchronization with speech [14]. The relationship between the prosodic features of speech and rhythmic gestures was modeled using modified hierarchical factored conditional restricted Boltzmann machines (HFCRBMs) [15]. Various characteristics of natural language were analyzed to determine gesture type and posture by using conditional random fields [16]. However, the methods proposed in these studies require elaborate feature engineering of the data collected from humans. The shape of the gesture was constrained to those appearing in the collected data in these studies.
Since data analysis is tedious and time-consuming, machine learning and deep learning approaches have been utilized to automatically map speech to gestures. A hidden Markov model was used to generate pointing gestures from audio features [17]. The effect of recurrent models, such as gated recurrent unit (GRU) and long-short term memory (LSTM), on mapping Mel-frequency cepstrum coefficient (MFCC) features of speech to gestures has been analyzed [4,18]. Text has also been used as input to generate meaningful gestures using sequence-to-sequence neural networks [19]. In [20], text was encoded using bidirectional encoder representations from transformers (BERT) in order to be concatenated with audio features to generate gesture sequences. Due to the high dimensionality characteristic of human motion, a denoising autoencoder (DAE) was used to reduce the number of dimensions of motion to help the neural network to generalize [21]. Reference [22] made use of labeled gesture phase information to constrain the dynamics of the generated gestures. The individual style was concerned with separately training different neural networks with the L1 distance and discriminative loss on a particular person's data [23]. A style transfer model aimed at generating gestures with a personal style from the voices of others was also proposed [24]. Relatively few studies have dealt with probabilistic generation. Reference [25] used MoGlow to generate gestures while controlling the height, radius, or speed by inputting a control variable. However, this work uses mel-frequency power spectrograms as speech features, we use solely prosodic features of speech as the input to the model. The premise of the above studies is that correlations exist between speech and gesture. In this study, we generate multiple gesture sequences for one utterance. By treating speech features as conditional input, we utilized the concept of CGAN, through which a Gaussian distribution is mapped to the gesture distribution conditioned on the speech features, and realized a one-to-many mapping from speech to gesture.

Problem Formulation
The notation used in the rest of this article is as follows: for a speech segment of length T, the features extracted from the audio signal are s = [s t ] t=1:T . The sequence of absolute positions of each joint in three- :K , and K is the total number of joints. The problem of generating gesture from speech can then be defined as to parameterize a model G by a parameter set θ such that j (m) = G θ (s (m) ). Furthermore, we aim to model the conditional distribution X j conditioned on the distribution X s . To achieve this, the model takes a random variable z sampled from a normal distribution N(0, 1). Thus, the problem becomes one of finding a parameter set θ such that p(j|s) = G θ (z|s), j ∼ X G , s ∼ X s , z ∼ N(0, 1). The error between the param-eterized distribution and the real distribution is defined as dist(p(j|s) j∼X G , p(j|s) j∼X j ) to optimize G θ . A discriminator parameterized by φ is optimized to be the measurement of this error. The method of optimizing D φ and G θ is discussed in Section 3.3.

Feature Extraction
The motion data in the corpus is composed of joint rotations and offsets of each joint. We used the protocol provided in [21] to convert the joint's rotation values into absolute position values (APV) in 3D space, which is how our problem is posed in Section 3.1. As the active movements are mostly of the upper body, we used only the upper body's APVs as the training labels.
The speech features used in this study are prosodic features. Prosodic features include fundamental frequency (f0), intensity, and their first and second derivatives; they reflect the rhythm of speech. Although MFCC features are frequently used in automatic speech recognition (ASR), they are not preferred here because the extracted features are used as conditions in model D. Low-dimensional features are expected to yield better results than high-dimensional ones, since high-dimensionality conditions will drastically reduce the number of samples included in that condition. An opensource audio signal processing package, Parselmouth, was used to extract the intensity and fundamental frequency from the speech signal. First, 200 frames of every second feature were extracted by using a window size of 40 milliseconds and hop length of 5 ms. Then, the features are averaged every ten frames to be 20 frames per second (fps) to match the frame rate of the motion data.

Methodology
Our model utilizes the architecture of CGAN, where speech features are used as a condition. An overview is shown in Figure 1. During the generating phase, a randomly sampled vector (noise vector) z from the Gaussian prior is replicated to have the same length as the speech features. Next, z and speech features are processed by fully-connected layers (FC1 and FC2), respectively; then they are concatenated and fed into a two-layer bidirectional long-short term memory (bi-LSTM) [26]. A sequence-wise fully-connected layer then takes the output of the previous layers and outputs a sequence of vectors indicating each joint's absolute positions in 3D space. The reason for replicating a fixed-length random vector instead of sampling a sequence length wise random vector is that we want to maintain the output motion's consistency along the entire sequence. To optimize the generator, we optimize the discriminator simultaneously to compute the error between the generated distribution and the real distribution conditioned on speech features. The vector of motion sequence and the corresponding speech features are concatenated and fed into a two-layer bi-LSTM layer. The output is squashed between 0 and 1 by using a sigmoid function, and the value indicates whether the input motion is real and corresponding with the speech features. Instead of outputting only one scalar for the whole sequence by the discriminator, we prefer to output one scalar for each time step. The reason for doing so is that although LSTM is claimed to be capable of capturing long-term dependencies, in practice, its effectiveness decreases when the sequence grows relatively long. The equation for optimizing generator and discriminator is where m is the number of samples, G is the generator, and D is the discriminator, j is the value of the joint positions, s is the speech features, and z is the noise vector.
In our experiment, we found that each noise vector corresponds with a particular pattern of motion, i.e., motions with the same pattern are generated when using the same noise vector throughout the sequence, a result that is not desirable. To increase variations of the generated motions, we proposed a strategy of generating variating noise vectors for a certain length of speech sequence. Specifically, multiple independently sampled noise vectors with the same length are concatenated to be the noise vector input to the model. The length of the concatenated noise vector is the same as the length of the speech feature input. The algorithm is shown in Algorithm 1. Sample P ∼ Uni f orm(0, 1)

end if 11: end for
On the other hand, a common failure during GAN training is mode collapse, i.e., the generator outputs identical results for any noise vector from the prior. In practice, we found that the algorithm for unrolled-GAN reduced the effect of mode collapse that appeared in our experiment setting. However, since we used the LSTM layer, the original unrolled-GAN algorithm will tremendously increase the training time. To avoid this problem, we simplified the algorithm and found in our experiment that a similar result was achieved with a shorter training time. Note that we are not claiming that the original algorithm is replaceable by this simplified version. The proposed algorithm is shown in Algorithm 2. As a brief explanation, in every iteration, the discriminator is trained once, and the parameters of the discriminator are stored. Then, the discriminator is trained multiple times; then, it is used as the loss function of the generator to train the generator once. Finally, before the iteration ends, the parameters of the discriminator are restored to the previously stored discriminator parameters.

Algorithm 2 Algorithm for training the proposed model.
Require: α, the learning rate. k unroll , the unrolling steps. m, the batch size. iteration, the number of training iterations. Require: φ 0 , initial discriminator parameters. θ 0 , initial generator parameters. Require: (X j , X s ), pairs of value of joint positions and speech features.

Corpus
We evaluated our model on the dataset proposed in [27], in which pairs of recorded audio and motion are provided. The content is an undergraduate student answering questions in Japanese like in an interview while standing and gesturing. The motion data were recorded using a motion capture studio. The motion data files contain information on the offset and rotation of each joint, from which each joint's absolute position can be derived. The audio is saved as WAV files (sampling rate 22,050 Hz, 16 bits). There are 1049 sentences in this dataset: 68.41% are metaphoric gestures, 23.73% are beat gestures, and others are iconic and deictic gestures. The dataset is 298 minutes long.

Implementation
Since the motions are represented as absolute positions in 3D space, the means and variances of each joint's values are considerably different, which can drastically decrease the model's performance. Therefore, we performed a min-max scaling strategy on the motion features by using Equations (3) and (4) to squash the feature within the range of −1 to 1. The speech features were also scaled using Equations (3) and (4) to be compatible with the motion features in terms of the values' size. Note that the scaling was performed using parameters calculated only from the data in the training set.
where X min and X max are calculated from the split training set.
Numerous studies on gesture generation cut the gesture sequence into several slices to approximate the effect of data augmentation. Instead, we used the entire sequence of speech and motion as samples. The hyper-parameters for training the proposed model used in our experiment are listed in Table 1. The number of nodes of the proposed model is detailed in Table 2. The Adam optimizer was used to update the parameters. The initial parameters of all layers were drawn from a Gaussian distribution with 0 mean and 1 variance. We saved the trained model every ten iterations and generated samples using speech utterances in the test set. After assessing the quality of these generated results, we chose the generator of the 1000 iteration.

Baseline
To compare the proposed model with the deterministic generation method, the model proposed in [21] was selected as a baseline. We used the protocol provided by the authors and reproduced the reported results. We cut the upper body motion generated using the baseline model in order to make the comparison. Since the dataset for the baseline model is already split into training, development, and test sets, we used the split test set for the evaluation. There are 45 samples in the test set.

Quantitative Evaluation
It is common for a deterministic model to use the L1 distance or average position error (APE) to evaluate the generated results. Since our motivation is to model the distribution of gestures, it is not appropriate to evaluate the precision of generated key points in comparison with the ground truth. Instead, kernel density estimation (KDE) is a useful tool for approximating the distribution of the data; it was used in [5] for image generation and in [7] for head motion generation. The output of KDE is the log-likelihood of the input samples based on the fitted density function using reference samples. In this study, we used the generated gesture sequences from the speech input in the test set to fit the density function and used the ground truth as the input of KDE. Therefore, as the output value tends to 0, the generator better fits the real data distribution.
We used Algorithm 1 to generate one motion sequence for every speech sample in the test set. The generated motions were used to fit a distribution. The optimal bandwidth in the KDE model was obtained using a grid search with 3-fold cross-validation. Then, the loglikelihood of the real motions in the test set was calculated using the fitted distribution. We also studied how F in Algorithm 1 affects the results. The results are shown in Table 3. The values are the average of five calculations. Table 3. Quantitative comparison between models. Ground truth is the log-likelihood of real motions in the test set in the kernel density estimation (KDE) distribution fitted using the ground truth itself, indicating the best results that can be expected. * uses replicated noise vectors to generate motions. ** jointly uses the proposed model and the proposed Algorithm 1.

Log-Likelihood Standard Error
Ground

Motion Dynamics Distribution
Motion dynamics (i.e., velocity) are imperative to human perception. As we aim to model the distribution of human gestures, one reason that the proposed model outperforms the baseline model is assumed to be that the velocity distribution of the motion generated by the proposed model is more similar to the ground truth than the baseline model is. We confirmed this assumption by plotting the histogram of the average velocity of all joints, shoulder, wrist, and hand: the histograms of the proposed model were more similar to the ground truth than those of the baseline, while the hand velocity distributions of both methods were comparable to the ground truth ( Figure 2).

User Study
The ultimate goal of gesture generation is to generate human-like motions. Here, we conducted a user study to subjectively evaluate the motions generated by the baseline and the proposed model against the ground truth. The Likert scale in the baseline paper was used to evaluate motions on three different scales based on three specific statements for each (Table 4). Table 4. Likert scale used in the user study.

Scale
Statements (Translated from Japanese)

Gesture was natural Naturalness
Gesture was smooth Gesture was comfortable

Time
Gesture timing was matched to speech Consistency Gesture speed was matched to speech Gesture pace was matched to speech Gesture was matched to speech content Semantics Gesture well described speech content Gesture helped me understand the content Before the evaluation, participants viewed three ground-truth videos to help them understand the real motions that would be played. The first part of the questionnaire was a ranking task. We prepared 12 sets, three videos within each set. There were four sets for ranking (1) the baseline and full proposed model, (2) CGAN with or without unrolling, and (3) the ground truth, baseline, and full proposed model. After watching a set of videos, participants were asked to rank the gesture depicted in the videos in order of naturalness. The second part was to assign a score to each statement within each scale. This part compared the baseline, ground truth, and the proposed model. After watching each video, participants were asked to assign a score to each statement. The value ranged from (0) to (7), where (0) indicates strongly disagree and (7) indicates strongly agree. There were five videos for the baseline, ground truth, and the proposed model, and the score for each scale was the average of three scores of the statements. As a result, five scores for each subject were obtained on each scale in Table 4 from one participant. Proposed (F = 40) was used to generate videos for the full proposed model. We recruited 38 participants (19 male, 19 female, all native Japanese speakers, average 34 years old) through a cloud sourcing service. Analysis of variance (ANOVA) was conducted to test the difference between the three groups' scores. All three scales passed the ANOVA test with p < 0.001. Tukey's honestly significant difference test (Tukey HSD) was conducted to test if there was a significant difference pairwisely. For the naturalness scale, there was a significant difference between the baseline (M = 3.51, SE = 0.09) and the full proposed model (M = 4.41, 0.08), p < 0.002, and between the baseline and the ground truth (M = 4.27, SE = 0.08), p < 0.002. There was no significant difference between the full proposed model and the ground truth, p = 0.46. For the time consistency scale, there was a significant difference between the baseline (M = 3.82, SE = 0.08) and the full proposed model (M = 4.28, 0.08), p < 0.002, and between the baseline and the ground truth (M = 4.38, SE = 0.08), p < 0.002. There was no significant difference between the full proposed model and ground truth, p = 0.65. For the semantics scale, there was a significant difference between the baseline (M = 3.64, SE = 0.08) and the full proposed model (M = 4.23, 0.08), p < 0.002, and between the baseline and the ground truth (M = 4.33, SE = 0.08), p < 0.002. There was no significant difference between the full proposed model and the ground truth, p = 0.68. The age distribution and scores on the scales are shown in Figure 3. These results indicate that the motions generated by the full proposed model were perceived as more natural than those of the baseline and were similar to the ground truth. The ranking tasks revealed similar results (Figure 4).

Inappropriateness of Using Euclidean Distance as a Loss Function
There are mainly two reasons that the Euclidean distance, i.e., L1-distance or L2distance, is not suitable for the gesture generation task. Firstly, motion may be realistic even though the Euclidean distance gives a large error; for example, suppose that the ground truth is a gesture with the left hand and the generated gesture is a mirror symmetry of the ground truth performed by the right hand. It is not reasonable to punish such realistic motions simply because they are not identical to the ground truth because of the randomness of human gestures. Secondly, the Euclidean distance tends to ignore small unrealistic parts of motions, underestimating the error. For example, even if one frame is modified to be unrealistic for a real motion sequence, the Euclidean distance will still give a relatively low error since most of the sequence is correct. This is inconsistent with human perception because humans immediately notice unrealistic motions.
Instead of using the Euclidean distance as the loss function, the GAN architecture gives the error by looking at a low-dimensional manifold, i.e., the output of the last hidden layer of the discriminator. Specifically, the discriminator judges whether the low-dimensional manifold of the generated samples is similar to that of the real samples, thus preventing the motion from being unrealistic while allowing more variation in the generated motion. Another benefit of this approach is that by interpolating on a low-dimensional manifold, realistic motions that are not in the dataset can be generated.

Unrolling for More Variation
Since we input a noise vector, by manipulating it, we can interpolate among motions and thereby generate new gestures that are not in the dataset. However, the ranking results shown in Figure 5 indicate that the CGAN without unrolling was as natural as CGAN with unrolling, and better than not using prosody input in the discriminator. This similarity is probably because the generated results of CGAN are already human-like compared with the proposed model, even though the generated motions of CGAN without unrolling are all the same pattern. The ranking task designed in the questionnaire cannot discriminate between performing the same pattern all the time and changing patterns occasionally. Intuitively, always performing the same pattern is not human-like while occasionally changing patterns is human-like.

The Role of the Noise Vector
To investigate the effect of changing the noise vector, we input a 5-second-long sinusoidal wave to the proposed model. Through the prosodic feature extraction, there were a total of 139 frames of speech features, as well as the generated motions.
The noise vector controls the motion pattern. The results in Figure 6 show that the proposed model can be a controller of the movement pattern. Although we have not investigated much on this topic, disentanglement of the noise vector in the proposed model is worthy of future investigation. We expect that noise can maintain gesture patterns across the whole utterance, i.e., the same pattern shifts according to the prosodic peak in the utterance. By shifting the phase of the sinusoidal signal and plotting the generated results, shifting effects appear as the shifts in the apex of a gesture as the prosodic peak shifts, as shown in Figure 7.

The Role of Prosody as a Condition
Since the prosodic features we used are the fundamental frequency and intensity, we generated motions with different f0 and intensity condition inputs to investigate their effect on the generated gesture. According to [28,29], f0 and intensity are correlated with the heights of the hands and size of the motion. Thus, here, we focused on the heights of the hands and the size of the motion.
The reference values of f0 were set to 100, 150, 200, and 250 Hz. First, a sinusoidal wave signal of a certain f0 was generated. Then, using the trained model, motion sequences were generated. The corresponding results are shown in Figure 8. It is clear that the size becomes larger and the height of the hand becomes higher as f0 increases. Correlations were also observed between intensity and the heights of hands and the size of motion ( Figure 9).

Conclusions
Human-like agents play an important role in human-computer interaction, and it is crucial to equip them with the capability of gesturing so that they can be expressive. We presented a model for producing co-speech gestures by modeling the conditional distribution of gestures conditioned on speech features. Incorporating unrolled-GAN and our proposed algorithm, our model outperformed the existing deterministic model in objective and subjective evaluations. Our work provides a powerful tool for human-like agents to express thoughts, thereby enhancing human-computer interactions. Moreover, the success of the distributional modeling revealed that future research in this field should focus more on gesture distribution. Human-like agents should be widely used in HCI. However, without the ability to gesture well, they are too inexpressive to be understood or empathized with by humans. Though our gesture generation model performs better in terms of naturalness and time consistency, the lack of semantics (i.e., meaningful gestures) is still a considerable obstacle to perfect modeling of human gestures; further research should focus on developing a model with semantically meaningful gestures.