Each part proposed in this paper: 3D character animation and asset generation was evaluated based on qualitative and/or quantitative results.
  4.1. 3D Character Animation
For training the model that generates the next pose of the 3D character, we used the CMU motion dataset [
19], which contains enough diversity for our use-case. This dataset is composed of 144 motion capture scenarios, with different subjects. Each scenario is composed of a set of activities from a certain domain, like dancing, running, walking, basketball, riding vehicles and so on. Each activity is represented by an individual motion capture file, consisting of several hundred frames. In our case, each such file represents a “document” for the GPT, with a total of 2553 “documents”. All the activities have been retargeted to a single skeleton, so there is no need for extra motion retargeting. The skeleton is defined by a hierarchy of joints. Each joint is defined by the joint type (“ROOT” for the root node, “JOINT” for the joints and “End site” for end nodes).
An item from the dataset is represented by a motion capture file data, meaning a vector of size . However, in a lot of cases the motion sequence is larger than the capacity of AnimGPT. Moreover, the animations needed for a game are not exactly long, so we have decided to crop the number of frames to either 256 or 512. However, cutting them is a bad idea, because the parts are logically linked. Instead, we have opted for a random sampling strategy from the motion data. Sampling all the motion data files with equal probability is not correct, since this would lead to oversampling of shorter animations. To solve this, we have weighted the sampling by linking more indices of the dataset to each sequence, proportionally with its length. More specifically, we have divided the length of the sequence to the fixed motion sequence length and taken the ceiling of the result. This results in a series of intervals.
Another choice regarding the data was to eliminate information regarding movement on the X and Z axes. Games are interactive, so the trajectory and speed of the motion is determined at run-time. This collides with the information from the X and Z axis, which represents horizontal movement in the environment. Because of this we have chosen to simplify the problem and eliminate the need to predict this data. We eliminate it by zeroing it over the full dataset instead of lowering the dimensionality of the result, for practical reasons: to be easier to generate the motion data for the result.
The data for the denoising experiment is obtained in the same way, but we also add noise to the data separately. We created additive noise by centering the noise in 0. Because of the normal distribution properties, with most of the values revolving around the mean, we always apply the noise instead of applying it to components with a set probability. We compute the standard deviation separately for each channel, on the whole dataset, to create a more believable noisy motion.
For generation of the next pose, we started from the architecture of the base GPT2, because we found its size of 110 million parameters to be fitting for the rather low amount of data we had available (compared to the millions of documents the language model has been trained on).
The AnimGPT model changes are rather straightforward. The original GPT2 uses the Pytorch Embedding class for translating from tokens to embeddings. These are nothing more than dictionaries with learnable parameters for each entry, which require as input the maximum number of elements. This approach works for natural language processing, because the words form a discrete space, but since we are dealing with motion, which forms a continuous space, we cannot use the same module. Each motion frame represents a vector of 132 channels, which can be considered a point in the motion space. Instead of trying to discretize this space, we employ a linear projection to bring the point into a higher dimensionality space, more exactly the 768 dimensional space the regular Transformer usually processes. This operation is the equivalent of translating a token to that index. Employing something more than a linear projection is superfluous, since we’re using a deep neural network to process the data anyway.
A similar operation must be done at the end, by replacing the model head. Instead of having a linear projection that translates the obtained deep embedding into a vector of logits for each token of the vocabulary, we use a linear projection that translates the embedding into a 132 dimensional point from the motion space, the next frame.
We train two models for 160 epochs, one processing sequences of 256 frames and one processing 512 frames ones. Both models work equally well. For the former we use a batch size of 32, while for the latter we can only afford a batch size of 20. We train both models from scratch, because the weights of the original GPT2 are optimised for another problem. We start with a learning rate of  and decay it progressively during training.
A more important change was in the actual generation case. The usual GPT2 model inherits the GenerationMixin module, which contains the common code for all generative models, with different generation techniques. We have built a GPT, but our results are deterministic, unlike GPT’s case, so we do not use such methods. The solution was to override the generate function to use our generation techniques. Two methods were implemented. The first one is for the generation of the next pose, where we expect at each step the correct sequence of frames and generate the next one. The second one is free autoregressive generation, where we give the model a start sequence and let it generate the rest of it.
For motion capture denoising, we trained this model using the same CMU motion capture dataset as in the case of generation of the next pose case, but opt for a noising scheme to include artificial noise. We added Gaussian noise, which is more appropriate for this task. For this case, DenoiseAnimGPT uses mainly the same changes as AnimGPT. We opted for adding the noisy embedding to the input as an extra embedding besides the motion embeddings and the position embeddings. For this, we have added another linear projection at the input. To process it correctly, we apply a roll-over to the left, so that the masked self-attention will pay attention to the full noisy sequence, until the current point inclusively.
Qualitatively, AnimGPT obtained good results. Not only does it generate good quality animations overall, but it does a job for the next pose prediction problem. Some qualitative results can be observed in 
Figure 4. The top row is obtained by starting from the first several frames then generating the movement freely. The middle row is obtained by a simulated animation process, which we call 1b1, in which the artist guides the generation step by step and the model only predicts the next pose. We used as input the ground truth and predicted the next pose at each step. The results are composed of the predicted next steps. The free generation follows the input motion initially, but starts wandering around after a while. The next pose prediction, however, provides results that are identical to the ground truth, at least for the naked eye.
For quantitative results, we use as a metric of comparison the mean absolute error between the ground truth motion vectors and the predicted motion vectors on the test set, a held out set of 569 motion data sequences. We also add the free generation to the comparison. We compute the results for the first 50, 100, 200, 350 and all the steps and report the results in 
Table 1. The free generation leads to increasing errors the more we run it, because it doesn’t necessarily generate the ground truth motion. However, this is to be expected and is in line with what is observed in [
3]. We tried comparing our solution to theirs, but they target a different dataset, so direct comparisons are not possible. We also target a vastly different time-frame for the evaluation. The free generation model manages to generate good motion most of the time, but not the ground truth one. For our intended next pose prediction case, though, the model displays the desired behaviour with a comparable, low error for all time steps.
We do not provide qualitative results for the DenoiseAnimGPT model here because they are not visible from pictures, only in video. However, we have observed that the model does a great job at denoising the input data. We measure this with the mean absolute error (see 
Table 2). Two versions of using the model are presented, together with the error induced by the noisy data.There is no other publicly available model to compare against for this dataset. The first one, denoised auto, uses a sequence of manually denoised frames and lets the model autoregressive denoise the rest, without further intervention. In denoised 1b1, after a frame is denoised automatically we consider that the user eliminates of the remaining noise before going to the next frame, just as in AnimGPT 1b1 case. The 1b1 version has greater performances compared with the autoregressive one, which is to be expected, since we have extra, clean information for inference. Comparing these results to AnimGPT shows that the denoising process leads to better animations for both the simple generation and the 1b1 generation [
20]. The denoising model does not suffer from degradation over longer sequences in the autoregressive case, unlike AnimGPT, which suffers greatly. This is because we feed the noisy information, which keeps the generation direction of the model on the right path and lets it focus solely on the denoising. We can observe a slight degradation for longer sequences, but it’s not that significant for the overall quality of the results.
  Ablation Study
In order to compare our results with some other existing ones, we started with the methods presented in 
Section 2 for 3D character animation. These methods are summarized in 
Table 3.
Since our method is trained on a different dataset (the CMU motion dataset) it is not feasible to make a direct comparison between our results and other existing ones (also, methods [
2,
4] used their own datasets).
Also, none of the models from the Related Work Section do exactly what AnimGPT did. In our case we generated the next pose based on all previous ones (correct and cleaned), so that it would be easier to animate the character. The main competitor is the ERD method from [
3], but the code is not public. This method extends a LSTM network by augmenting the model with encoder and decoder networks. Thus, to be able to check the performances of our method against other existing results we replaced the transformer from both networks AnimGPT and DenoiseAnimGPT with a LSTM network and retrained them with the same dataset: CMU motion dataset. Thus, we created an architecture similar to the one used in the ERD method.
Results for AnimGPT compared with LSTM like model are given in 
Table 4 using mean absolute error between the ground truth motion vectors and the predicted motion vectors on the test set as the metric.
Results for DenoiseAnimGPT compared with LSTM like model are given in 
Table 5.
From 
Table 4 and 
Table 5 we can observe that AnimGPT has better performances compared with the architecture similar with [
3]. Even if we use 1b1 version, the error keeps accumulating over time in case of using the LSTM network.
For denoising, it seems to introduce more noise than it was initially. Even if we use 1b1, the error is reduced, but it is still far below as performance compared with our solution.
  4.2. Asset Generation
For asset generation Stable Diffusion was used. The basic, text2image [
24], generation is handled by StableDiffusionPipeline. This receives as input only the text prompt. Inpainting is done with StableDiffusionInpaintPipieline, the most complex of all, with inputs represented by a text prompt, an image and the inpainting mask. Dreambooth [
25] is a standard text2image model, so StableDiffusionPipelines are used for it [
26].
For the background generation approach we used two diffusion models, the StableDiffusion-2-1-base (
https://huggingface.co/stabilityai/stable-diffusion-2-1-base) (accessed on 10 July 2024) with a discrete Euler scheduler for the background seed generation and the StableDiffusion-2-inpainting (
https://huggingface.co/stabilityai/stable-diffusion-2-inpainting (accessed on 10 July 2024) for image outpainting. Since the two models must be used jointly, we run them using fp16. To remove unwanted artifacts, we used the negative prompts “logo, repetitions, writing, text, watermark”. To ensure a smoother transition, we blurred the margin of the outpainting mask with a gaussian kernel, so that a part of the border of the last frame is slightly changed. The used prompt for both the seed image generation and the image completion is enriched with attributes to ensure better generation quality. The consistency of the image is ensured by outpainting, so we do not force the seeds to match.
For the asset generation use-case we employ a version of ControlNet trained on user doodles. This provides a powerful model that is capable of turning even the simplest sketches into detailed assets. Moreover, if offers artist the fine-grained control they need for obtaining exactly the assets they want. Since the model was trained on natural, scraped of the internet images, we make use of a salient object detection models, bundled by the rembg (
https://github.com/danielgatis/rembg) (accessed on 10 July 2024) package. More specifically, we used 
-Net [
27], a clever pyramid of U-nets [
28], where different levels compute the salient regions at different scales. This is run using onnx-runtime. However, the results are not always ideal, with portions of the object missing in some cases. Since the doodles usually represent objects without holes, we used the initial sketch to build an outline mask. We fixed the automatically detected object mask by intersecting it with the filled outline mask. Since this approach is not fail proof or perfect, we used an optional manual mask refinement step to correct the eventual mistakes, which can be integrated seamlessly in an application.
The visual effects pipeline requires two diffusion models, one for generating the initial texture image and one for editing the final effect. We use the same version of diffusion model, StableDiffusion-v1-5 (
https://huggingface.co/runwayml/stable-diffusion-v1-5) (accessed on 10 July 2024), for both, but with different pipelines. For texture generation we used the usual text2image pipeline, but for the editing we use an image2image approach. Generating effects cannot be done without any of the two parts of the pipeline. Generating the starting image is crucial for consistency, otherwise the results would be unacceptably different. The image2image process is crucial for result quality, otherwise the effect would be just a cutout from an image, which is not desirable. It must also be noted that our approach is based on the StableDiffusion property that white regions are kept white as well as possible, so instead of generating a full picture through image2image, the model stylises only the content part. We use the same seed for the image2image part for consistency.
We computed the superpixels using the SLIC superpixels algorithm [
29], pre-implemented in the Scikit image segmentation package. We use a compactness of 0.01 to focus more on colour than on spatial proximity and a sigma of 1. The number of segments is, however, the pivotal parameter. Choosing small values for it results in bigger superpixels, which leads to bigger elements being generated for the effect. Bigger values will result in an image closer to the given mask. We obtain the outline mask of the character by dilating the mask and doing a xor operation with the original. To have some overlap between the character and the mask, we apply a second dilation operation. The final mask is obtained by intersecting the outline mask with the superpixels, such that if a superpixel intersects the mask, it is kept. This approach leads to a bigger mask, which is then used for cutting the texture image. After the image2image diffusion, we combine the asset and the effect with the asset mask, after we erode it to ensure there are no artefact. This result is passed again through the initial mask detection algorithm to obtain an asset with transparent background.
Generating effects for non moving assets or for cases when there is not enough movement to generate enough diversity can easily be done by translating the mask randomly. We compute a random translation matrix and apply it to the mask with numpy, then we follow the algorithm exactly like in the former case. Once the image is generated, we apply the inverse transform to bring the effect back to its origin.
Obtained effects are good, but careful prompt editing should be done to obtain the best results. In general, the model does a great job with good descriptions of the outcome. The generation style is also important. The best use-cases we have identified for plain text guided image generation are the generation of backgrounds and settings for the action and the generation of starting points for the asset creation. Some results can be observed in 
Figure 5. The top row shows examples for the former. We can see the settings are believable and, if not ready to use, can be easily adapted for different projects. The bottom row targets the latter. While not perfect, the generated assets represent good starting points for the creative process. Either way, the results show great potential for creating moodboards.
Image to image generation represents a great use-case for asset generation, because it allows the artist to guide the model better, with minimal effort. Stable Diffusion obtains great results from simple doodles, as it can be observed in 
Figure 6a. Artists get inspiration and better starting point for assets, characters and can also edit the appearance of the characters. However, the appearance change approach is limited, because it changes the whole picture. In our example, the identity of the person is lost. A good property of this solution is that multiple, cascading edits can be made. For example, we can generate the character in the middle, then keep adding details to the photo until we are happy with the results.
Unlike image2image generation, the text guided inpainting part leaves the unmasked regions of the photo unchanged. Theoretically, we could mask a portion of the image we want to edit and use the prompt to replace that zone with what we want. This solution does not work as well as it should have. In 
Figure 6b, we can see that the model fails to place a dog on the floor of the living room, as instructed by the prompt and instead eliminates the carpet. The solution is able to edit the other two photos, with somewhat limited performance (the glasses look unrealistic and that shirt is too white). However, the model is limited. For example, in the case of the rightmost game character, trying to add a top hat results in failure every time.
As it can be seen, the model learns the concept presented in the pictures and is able to transpose that concept in multiple settings. For example, the model is able to transfer the features of a 2D drawing in a 3D object (
Figure 7—the cat sculpture). However, the usefulness of these results to our use-case is limited to providing inspiration, in game art or cutscenes.
By coupling ControlNet with background removal, we allow artists to create assets according to their needs. Some results can be observed in 
Figure 8. As it can be seen, an inexperienced user like ourselves can generate high quality assets starting from nothing more than a vague description and a vague scribble of it. The generated results are realistic and highly detailed and their resolution is high enough to be used in practice, thanks to latent diffusion. ControlNet makes it easy for users to generate a variety of assets, including characters like the guard.
To demonstrate the usefulness of asset generation, we’ve created an extremely simplistic demo game with Pygame. Most assets are created with generative techniques, with the exception of the cat sprite, which we have animated with our CharacterGAN. The assets are generated with our asset generation pipeline, while the background is generated with our background generation pipeline, presented in the previous section. One setting from the game can be observed in 
Figure 9. While there are problems with blending the assets, especially with the lighting, it looks more than adequate. This result proves that generative techniques are a great tool for artists, because we managed to obtain a good looking result with limited artistic skills and experience in the field. An experienced artist could effortlessly blend the assets in the image using tools like Photoshop to obtain better looking scenes.
A set of results of our effect generation approach can be observed in 
Figure 10. Our pipeline enables users to generate various effects, limited only by imagination and common sense (it would be hard to make an effect with couches). The results are of good quality and diverse. The leftmost three effects are generated with a full body, automatically generated mask, while the rightmost three are generated with a mask limited to the head region. By providing custom masks, users can generate effects in whatever region they please. All results use the effects as background effects, but they can also be coupled with background removal and placed in the front. This enables creation of effects such as spells.
The effects do not have to be bound to a character. By combining a prompt with a sensible mask of their choice, users can create an animated effect, as can be observed in 
Figure 11. We limit ourselves to providing just several frames of the animation. The position jitter ensures that the images are not identical, while the texture prior preserves the temporal coherence, resulting in the desired animated effect.