Realistic Image Generation from Text by Using BERT-Based Embedding

: Recently, in the ﬁeld of artiﬁcial intelligence, multimodal learning has received a lot of attention due to expectations for the enhancement of AI performance and potential applications. Text-to-image generation, which is one of the multimodal tasks, is a challenging topic in computer vision and natural language processing. The text-to-image generation model based on generative adversarial network (GAN) utilizes a text encoder pre-trained with image-text pairs. However, text encoders pre-trained with image-text pairs cannot obtain rich information about texts not seen during pre-training, thus it is hard to generate an image that semantically matches a given text description. In this paper, we propose a new text-to-image generation model using pre-trained BERT, which is widely used in the ﬁeld of natural language processing. The pre-trained BERT is used as a text encoder by performing ﬁne-tuning with a large amount of text, so that rich information about the text is obtained and thus suitable for the image generation task. Through experiments using a multimodal benchmark dataset, we show that the proposed method improves the performance over the baseline model both quantitatively and qualitatively.


Introduction
Although many deep learning methods have been developed for single modality, the real world we experience is multimodal, so research on multimodal deep learning is essential for AI to make meaningful progress [1]. Text-to-image generation is a representative example of multimodal learning [2,3]. Many images exist with tags or descriptions of them, and the text serves to clarify the meaning of the image [4]. Text-to-image generation is a mixture of two modalities, text and image, which are the most challenging modalities in deep learning, but it is difficult to learn because text as input and image as output have completely different characteristics. There are three problems that need to be addressed in order to generate images from text. First, you need to learn a text representation that captures what is visually important. Second, it is necessary to generate high-quality images similar to real images through text feature representation. Third, it is necessary to extract high-quality feature expressions for texts that have not been seen during learning. Therefore, the text-to-image generation model consists of a text encoder part for text embedding and a GAN part that generates an image using it. It is important to effectively learn the task of generating various images corresponding to the semantic information of the text.
In this paper, we propose a text-to-image model using BERT-based embedding and high-quality image generation using StackGAN. Existing text-to-image generation studies have a problem of creating empty spaces between data in the text manifold by using a pre-trained text encoder for a zero-shot visual recognition task. In this paper, we try to solve this problem by fine-tuning the pre-trained BERT to be suitable for the text-to-image generation task. When the fine-tuned BERT is used as a text encoder, there is little space between data in the text manifold, so text representation can be effectively extracted, and it is shown that it is possible to generate a more realistic image compared with existing studies using efficient embedding. Through experiments, the proposed method shows qualitative and quantitative performance improvement compared with the existing methodologies on the CUB multimodal benchmark.

Related Work
A generative adversarial network (GAN) is a deep neural network model consisting of a generator network and a discriminator network [5]. The generator neural network takes a random vector or a vector extracted from the latent space as input, and generates data such as images, audio, and text. The discriminator neural network discriminates between real data and fake data generated from the generator neural network. The generator and discriminator neural networks are trained simultaneously in competition with each other. As a result, we can get a generator neural network that generates data similar to real data.
For the generation of image data, convolutional neural network [6] has been introduced to the GAN model. Radford et al. [7] proposed a family of network architectures called deep convolutional GAN (DCGAN), which allows training a pair of deep convolutional generator and discriminator networks. DCGANs make use of convolutions, which allow the spatial downsampling and upsampling operators to be learned during training. Conditional GAN (CGAN) is a generative adversarial network that inputs a vector extracted from latent space and conditional information [8]. By using the conditional information, CGANs can generate data of a desired class. CGANs also have the advantage of being able to provide better representations for multimodal data generation. Image-to-image translation, image style transfer, and photo-realistic image generation using CNN and GAN model are also related to this research [9][10][11][12].
In text-to-image(T2I) tasks, not only should the image capture all the content of the given text description, but also the quality of the generated image should be good. In order to satisfy this condition, various T2I models have been proposed. The first GAN-based T2I model is GAN-INT-CLS [13] where a class label of an image is simply replaced by a text embedding in CGAN. GAN-INT-CLS has shown that it can generate images but cannot guarantee the quality of images synthesized using text encoders that have learned the relationship between labeled images and text in advance. Since it is difficult to generate high-resolution images with end-to-end direct learning, StackGAN, a two-stage approach of sketch and refinement, has been proposed [14]. StackGAN consists of the first stage of generating a low-resolution image and the second stage of generating a high-resolution image by integrating the low-resolution image and text. The conditioning augmentation method is applied to compensate for discontinuities in high-dimensional embeddings that hinder the learning of the image-generating part. Experimental results showed that it produces high-resolution images of 256 × 256, later this method was developed to generate 512 × 512 images using hierarchical discriminators HDGAN [15].
In the natural language processing (NLP) field, many studies using large amounts of text are being conducted [16,17]. Especially BERT is a language model using only encoders from the Transformer architecture, and is used to obtain embeddings, which are vector representations of text for natural language processing and machine translation [18]. BERT achieves the best performance in most tasks of GLUE [19], a natural language understanding benchmark by generating context-sensitive embeddings, and is widely used in the field of natural language processing. Because BERT can be trained with a relatively small amount of domain data, BERT is used not only in natural language processing fields such as machine translation, but also in various fields requiring text embedding. In [20], BERT was used to draw a face from a text description. The advantage of BERT-based embedding is that it can work for relatively long text descriptions and that it is possible to learn with a small amount of face data by using pre-trained BERT. In [21], BERT was also used for supervised image generation work. In this paper, pre-trained BERT is used for text embedding without fine tuning.

Realistic Image Generation from Text by Using BERT-Based Embedding
The realistic image generation model from text using BERT-based embedding proposed in this paper utilizes the structure of the stack generative adversarial network (StackGAN). Therefore, the proposed model consists of (1) BERT-based text embedding, (2) low-resolution image generation from text using BERT-based embedding, and (3) realistic high-resolution image generation with text using BERT-based embedding as shown in Figure 1. learn with a small amount of face data by using pre-trained BERT. In [21], BERT was also used for supervised image generation work. In this paper, pre-trained BERT is used for text embedding without fine tuning.

Realistic Image Generation from Text by Using BERT-Based Embedding
The realistic image generation model from text using BERT-based embedding proposed in this paper utilizes the structure of the stack generative adversarial network (StackGAN). Therefore, the proposed model consists of (1) BERT-based text embedding, (2) low-resolution image generation from text using BERT-based embedding, and (3) realistic high-resolution image generation with text using BERT-based embedding as shown in Figure 1. In the text embedding process, pre-trained BERT is used, and fine-tuning is performed on the target dataset. A low-resolution image is generated by using the conditionaugmented embedding and a random vector as inputs to the generator, and the discriminator performs learning by discriminating the low-resolution image and the actual image generated along with the text embedding. The process of generating realistic high-resolution images from text using BERT-based embedding corresponds to stage 2 in Figure 1. In the process, condition augmentation is also performed using BERT-based embedding in Figure 1. A high-resolution image is generated by using the condition-augmented text embedding and the low-resolution image generated from the learned generator of stage 1 as an input to the generator, and the discriminator performs learning by discriminating the high-resolution image and the actual image generated with text embedding.
The 3 steps of our proposed model can be summarized as followings: 1) Fine-turn the pre-trained BERT using the target dataset to obtain text embedding.
2) In stage 1, generate a low-resolution image as the input for the text embedding from fine-turned BERT and random noise. 3) In stage 2, generate a high-resolution image as the input for the text embedding from fine-tuned BERT and the low-resolution image from stage 1. In the text embedding process, pre-trained BERT is used, and fine-tuning is performed on the target dataset. A low-resolution image is generated by using the conditionaugmented embedding and a random vector as inputs to the generator, and the discriminator performs learning by discriminating the low-resolution image and the actual image generated along with the text embedding. The process of generating realistic high-resolution images from text using BERT-based embedding corresponds to stage 2 in Figure 1. In the process, condition augmentation is also performed using BERT-based embedding in Figure 1. A high-resolution image is generated by using the condition-augmented text embedding and the low-resolution image generated from the learned generator of stage 1 as an input to the generator, and the discriminator performs learning by discriminating the high-resolution image and the actual image generated with text embedding.
The 3 steps of our proposed model can be summarized as followings: (1) Fine-turn the pre-trained BERT using the target dataset to obtain text embedding. (2) In stage 1, generate a low-resolution image as the input for the text embedding from fine-turned BERT and random noise. (3) In stage 2, generate a high-resolution image as the input for the text embedding from fine-tuned BERT and the low-resolution image from stage 1.

BERT-Based Text Embedding
In this paper, pre-trained BERT is used to obtain embedding vectors for text descriptions. The pre-trained BERT was fine-tuned by creating blanks in the sentences for the target dataset as much as the number of repetitions.
Algorithm 1 describes the procedure of fine-tuning the BERT. The pre-trained BERT θ BERT is initialized in the first line of the algorithm. In the second line S f is the number of iterations for fine-tuning and T = {t 1 , t 2 , . . . , t n } is the text description of the dataset. In the third and fourth lines, the BERT is fine-tuned for the description of the data set using Adam repeatedly S f times. During fine-tuning, the MLM (masked language modeling) task, which is a semi-supervised learning method, is learned for the given text description. The MLM task is a blank insertion problem, and it is a task to cover a part of the input token and match it. Through this, we learn the probability of token appearance considering the bidirectional context. The result of Algorithm 1 is a text embedding to be used in performing image operations with text. Perform step o f adam update on θ BERT with T 5: End for

Generating Low-Resolution Images from Text
The stage1 generator produces a low-resolution image that matches only the rough shape and color by using the text embedding vector and random vector extracted from the text encoder. The discriminator uses an image and a text embedding vector as inputs and determines whether the text matches the image or not.
Let D 1 be the discriminator of stage1, G 1 the generator, h the text embedding of given text description through fine-tuned BERT, h ca the text embedding extracted from the Gaussian conditioning variable, x row the low-resolution image, z the random vector, and λ the regularization parameter. The stage 1 discriminator is trained to maximize Equation (1), and the stage1 generator is trained to minimize the loss function of Equation (2) is the regularization term which is the KL divergence, which extracts latent text representation from an independent Gaussian distribution. The Gaussian conditioning variable h ca is sampled from N (µ 0 (h), ∑ 0 (h)) to provide randomness.
Algorithm 2 presents the process of generating a low-resolution image from text using BERT-based embedding. This corresponds to stage 1 of Figure 1. In line 2, θ f ine−tuned BERT , which is the fine-tuned BERT model and the other parameters such as the number of iterations S 1 , and the learning rates α and β of the generator and discriminator are set. T = {t 1 , t 2 , . . . , t n } is the text descriptions X = {x 1 , x 2 , . . . , x n } is the images of dataset. In the for loop, the mini-batch image x i and the text t i are extracted and learning is performed for S 1 times. A random vector is extracted in line 5, and the text embedding is obtained by taking the mini-batch text as the input of the BERT in line 6. In line 8, h ca is obtained through conditional augmentation, and in line 10, a low-resolution image is generated by input with a random vector. Thereafter, the discriminator determines whether the image and the text match, and proceeds with the process of updating the parameters of the generator and the discriminator in line 14 and 15, respectively. For i = 1, . . . , n do 5:

16:
End for 17: End for

Generating High-Resolution Images from Text
In stage 2, a realistic high-resolution image is produced by supplementing omitted details to the image generated in stage 1. It takes a low-resolution image and the text embedding fine-tuned on BERT. Let D 2 be the stage 2 discriminator and G 2 the stage 2 generator. The stage 2 discriminator is trained to maximize Equation (3), and the stage 2 generator is trained to minimize Equation (4). The random noise is used only in stage 1, not in this stage. Instead, a low-resolution image generated from the generator in stage 1, x row is used. A high-resolution image, x high , is generated by inputting the text embedding vector extracted from the text encoder and the low-resolution image generated from the stage1 generator. The discriminator determines whether the image and text match by inputting an image and text embedding vector.
Algorithm 3 presents the procedure of generating realistic high-resolution images from text using BERT-based embeddings. In line 2, θ f ine−tuned BERT , the number of iterations S 2 , and the learning rates γ and ω of the generator and discriminator are set. In the for loop, the mini-batch image x i and the text t i are extracted and learning is performed for S 2 times. A random vector is extracted in line 5, and the text embedding is obtained by taking the mini-batch text as the input of the BERT in line 6. In line 10, text embedding h ca a random vector z are input together, and a low-resolution image is generated using stage1 generator. In line 11, the stage 2 discriminator determines whether the image and the text match, and the process of updating the discriminator and the generator is carried out in lines 15 and 16, respectively. For i = 1, . . . , n do 5: . 16:

17:
End for 18: End for

Experiments
We validate the proposed method both quantitatively and qualitatively. The experiment was conducted using CUB [22], which is one of the popular benchmark datasets for image generation from text. In our experiments, we used RTX 5000(16GB) × 4, and Intel(R) Xeon(R) Silver 4210R CPU @ 2.40GH × 2, with Pytorch 1.4.0 on Ubuntu 18.04.2. Table 1 provides information on the benchmark dataset used in the experiment: the training set, test set, and the number of sentences for each image of the CUB benchmark dataset. The CUB dataset consists of 200 classes. It consists of 11,788 images of birds, each with 10 textual descriptions. In about 80% of the CUB dataset, the proportion of the object is less than half the image size. In the experiment of this paper, preprocessing was performed so that the ratio of the object was greater than 0.75 by using a bounding box for the object.
The IS, which is the first metric for evaluating GAN, is computed by the inception-v3 network. Inception-v3 is an image classification model composed of a convolutional neural network [25]. In order to calculate the inception score, we pre-trained using the ImageNet dataset consisting of 1000 classes and about 1.2 million images. The image generated from the text is input to the pre-trained inception model, and the generative model is evaluated based on the output result. Equation (5) is the formula for calculating the inception score. By calculating the difference between two probability distributions Electronics 2022, 11, 764 7 of 11 by the Kullback-Leibler divergence, the difference in information entropy that can occur when sampling is performed using another distribution similar to a specific distribution is calculated. For example, if the distribution of class A is different from the distribution of other classes, the inception score has a large value. This means that images of class A have different characteristics from images of other classes.
The FID has been proposed to compensate for the disadvantage that the inception score does not use the distribution of actual data. Equation (6) is the formula for calculating the FID, which measures the distance between the synthetic data distribution, p f , and the real data distribution, p r . Therefore, the smaller the FID value, the better the visual quality of images.

The Compared Models
The experimental results are compared with GAN-INT-CLS, GAWWN, and AttnGAN, which are direct image generation methods, and the StackGAN which is a stacked method. •

StackGAN+BERT(ours)
StackGAN+BERT(ours) is the text-to-image generation model proposed in this paper. We use BERT as a text encoder to generate a 256 × 256 image from text. Table 2 shows the comparison of quantitative experimental results for generating realistic images from texts using BERT-based embedding proposed in this paper. The model proposed in this paper is denoted as StackGAN+BERT. As a quantitative experimental result, it was shown that the IS of the proposed model was about 0.74 higher than that of the existing stack generative adversarial neural network, and the FID was 14.1 lower. The text-to-image generation model proposed in this paper is different from the embedding method used in previous studies. The embedding of the proposed method was fine-tuned to fit the pre-trained BERT with a large amount of data for the textto-image task. Due to this, the space between data in the text manifold is small, so that features can be extracted from texts that are not seen during training in the finetuning process. However, existing studies use limited datasets to pre-train the zero-shot visual recognition task. Therefore, the empty space between data in the text manifold is relatively large, and bad features are extracted for the text expression that are not seen during training.

Quantitative Results
In general, deep learning can achieve good output when good feature values are input. In text-to-image tasks, it is also easy to create an image that matches the text description when a high-quality text representation is input. In other words, the T2I model using the BERT-based embedding proposed in this study improved the performance of the quantitative evaluation methods IS and FID scores by generating images using high-quality text embedding.

Qualitative Results
In the image generation field, it is difficult to measure the performance of the model only with quantitative evaluation, and it is necessary to perform qualitative evaluation on the generated image. Figure 2 is a comparative figure that performed qualitative evaluation on the CUB dataset. For the text used for evaluation, the text description that was not used for learning was used. Qualitative evaluation results show that the proposed model generates realistic high-resolution images when compared with existing studies. For example, the existing stack generative adversarial neural network generated images without webbed feet from the text description "A bird with a medium orange bill white body gray wings and webbed feet", but the proposed model generated images including them. Moreover, in the text description, "This small bird has a white breast, light gray head, and black wings and tail", the existing stack generative adversarial neural network failed to generate a tail, but the proposed model showed that it did. It has been shown that high-quality images are generated from high-quality text representations.

Discussion and Conclusions
In this paper, we proposed a T2I model capable of generating realistic images from text using BERT-based embedding. The proposed model was fine-tuned to fit the pretrained BERT, which exhibits high performance in the natural language processing field, to the task of image creation in text. Due to this, there is less free space between data in the text manifold. Therefore, it was possible to extract text embeddings of relatively high quality compared with the existing embedding methods for texts that were not seen during learning in the fine-tuning process. As a result of the experiment, the quantitative evaluation of IS was about 0.74 high and FID was 14.1 low, showing that the proposed method is effective. Compared with the existing text-to-image generation model, our method generates high-resolution images with diversity for unseen textual descriptions. In the future, we will verify the effect of BERT-based embedding on text-to-image creation tasks using various data, and try applying various keyword extraction algorithms for effective analysis of input text. In addition, we plan to conduct research on designing sophisticated loss functions and generating images with higher resolution from text using a small amount of data.

Discussion and Conclusions
In this paper, we proposed a T2I model capable of generating realistic images from text using BERT-based embedding. The proposed model was fine-tuned to fit the pretrained BERT, which exhibits high performance in the natural language processing field, to the task of image creation in text. Due to this, there is less free space between data in the text manifold. Therefore, it was possible to extract text embeddings of relatively high quality compared with the existing embedding methods for texts that were not seen during learning in the fine-tuning process. As a result of the experiment, the quantitative evaluation of IS was about 0.74 high and FID was 14.1 low, showing that the proposed method is effective. Compared with the existing text-to-image generation model, our method generates high-resolution images with diversity for unseen textual descriptions. In the future, we will verify the effect of BERT-based embedding on text-to-image creation tasks using various data, and try applying various keyword extraction algorithms for effective analysis of input text. In addition, we plan to conduct research on designing sophisticated loss functions and generating images with higher resolution from text using a small amount of data.

Conflicts of Interest:
The authors declare no conflict of interest.