CapGAN: Text-to-Image Synthesis Using Capsule GANs

: Text-to-image synthesis is one of the most critical and challenging problems of generative modeling. It is of substantial importance in the area of automatic learning, especially for image creation, modiﬁcation, analysis and optimization. A number of works have been proposed in the past to achieve this goal; however, current methods still lack scene understanding, especially when it comes to synthesizing coherent structures in complex scenes. In this work, we propose a model called CapGAN, to synthesize images from a given single text statement to resolve the problem of global coherent structures in complex scenes. For this purpose, skip-thought vectors are used to encode the given text into vector representation. This encoded vector is used as an input for image synthesis using an adversarial process, in which two models are trained simultaneously, namely: generator (G) and discriminator (D). The model G generates fake images, while the model D tries to predict what the sample is from training data rather than generated by G. The conceptual novelty of this work lies in the integrating capsules at the discriminator level to make the model understand the orientational and relative spatial relationship between different entities of an object in an image. The inception score (IS) along with the Fréchet inception distance (FID) are used as quantitative evaluation metrics for CapGAN. IS recorded for images generated using CapGAN is 4.05 ± 0.050, which is around 34% higher than images synthesized using traditional GANs, whereas the FID score calculated for synthesized images using CapGAN is 44.38, which is ab almost 9% improvement from the previous state-of-the-art models. The experimental results clearly demonstrate the effectiveness of the proposed CapGAN model, which is exceptionally proﬁcient in generating images with complex scenes.


Introduction
Text-to-image synthesis is the translation of a single sentence directly into pixels [1].Automatically generating images from a single sentence is a primary problem in computeraided design (CAD), automatic art generation (AAG) and various other applications.The difficulty of synthesizing images or illustrating visual information from text has gained interest in the research community, but it is far from being solved, in particular for complex scenes.In complex scenes, objects are composed of multiple entities in which the color(s) 1.
Once the images are synthesized, any modification in a scene or an image can be implemented by means of text as an input instead of using advanced photo editing tools.

2.
Text-to-image synthesis can improve the predictions of object classification problems, as the synthesis model is generating images from scratch, thus, it has good judgment about object features.3.
It will smooth the automatic learning process and art generation of, for example, animated images, clips, movies, etc. 4.
The images synthesized using text can also be helpful to generate labeled data for further research.
The problem of generating images from text description is highly multimodal.This means that there can be multiple correct answers for a single input sentence.In terms of image synthesis from text, multimodality suggests that there can be several reasonable and possible configurations of pixels that can correctly, as well as acceptably, exemplify the same text description.For instance, a sentence, "This flower has petals that are white and has a green tip" or "A bird with yellow beak sitting on a tree" can have multiple possible solutions.An illustration of such cases is shown in Table 1, which presents multiple images generated for a given input text.

Text
Generated Images This flower has long red petals with black center.A water flower with light yellow petals and yellow pistils in the center.This flower has purple petals and a long stigma.This flower has rounded white petals which form a bright yellow shape in the center.This bird is dark grey in color and has a long wings and a black downward curved beak.The bird is a royal blue with black accents on the wings, tail and beak.

White smiling dog.
The past research on solving multimodal problems in text-to-image synthesis has focused on various machine-learning-based algorithms, particularly generative adversarial networks (GANs) [2].Many researchers have made attempts to synthesize images using text for single objects [1,[3][4][5] by plugging conventional convolutional and deconvolutional layers in GANs.However, none of them succeeded in generating coherent images, and the proposed models failed to take into account the spatial and orientational association among diverse entities of an object in an image.
The models based on convolutional layers, e.g., convolutional neural networks (CNN), have provided massive success for several deep learning applications; nonetheless, they have some limitations and drawbacks.For instance, large amounts of data are required for training CNN.In addition, the internal data representation of CNN does not take into account important spatial associations among objects.In order to clarify this, we present an example of a flower object in Figures 1 and 2. For CNN, both Figures 1 and 2 are flowers as CNN does not take into account the spatial and orientational association among different entities of an object.The spatial and orientational association dictates that for an object to be a flower, the petals should be aligned around the center, the leaves should be connected to the stem and all entities (petals, stem, leaves and stamen) should be connected.
For CNN, both Figures 1 and 2 are flowers, as the mere presence of the entities (petals, stem, leaves and stamen) indicates object existence.However, for a capsule network, Figure 1 is a flower, whereas Figure 2 is not considered to be a flower.On the other hand, capsule networks [6] are based on the concept of inverse rendering.In computer graphics, objects are constructed though rendering, which requires some geometric information that specifies where to draw an object, its scale, its angle, along with other spatial information.Capsules in capsule networks are designed to represent vectors or multi-dimensional information, whereas neurons in convolutional neural networks (CNNs) typically operate on scalar values or single-channel data.Therefore, unlike neurons in CNN, the capsules extract the geometric information of an object in an image in the form of vectors and use it for inverse rendering.Therefore, in contrast to CNN, a capsule network easily identifies Figure 1 as a flower, and Figure 2 to be a non-flower object; as in Figure 2, the spatial association between different entities of flowers is not satisfied.
We utilize the capsule network for image synthesis from a given text statement to overcome the problem of global coherent structures in complex scenes.In this work, we suggest an innovative model, named CapGAN, to synthesize images from a given single text statement using GANs.Our model takes as input a single text statement and utilizes skip-thought vectors as text encoders that can produce highly generic fixed length sentence representations [7].As the synthesis of images from extracted features is highly multimodal, this makes GAN an ideal candidate for image synthesis problems.We feed these fixed length representations to GANs for image synthesis using an adversarial process, in which two models are trained at the same time, namely: generator (G) and discriminator (D).The model G generates fake images, while the model D tries to predict what the sample is from training data rather than generated by G.The conceptual novelty of this work lies in integrating capsules at the discriminator level to make the model understand the orientational and relative spatial relationship among various diverse entities of an object in an image.CapGAN is trained and evaluated on the Oxford-102 dataset [8] for flowers, Caltech-UCSD 200 [9] for birds and ImageNet [10] for images of dogs.Our model is evaluated using the most widely used inception score (IS) and Fréchet inception distance (FID) measures for the image synthesis problems.The bench-marked results affirm the usefulness of capsule networks to capture and regenerate orientation, as well as spatial connection, among several entities of an object.The rest of the paper is organized as follows: the second section briefly explores various studies associated with the problem of synthesizing images based on text.The implementation details are described in detail in Section 3. Section 4 highlights the key results of this research, followed by discussion in Section 5. Finally, the paper is concluded in Section 6, along with some future recommendations.

Background
Text-to-image synthesis is highly multimodal.Shared representation across modalities and data prediction via retrieval or synthesis are two major challenges for multimodal problems [1].Zhu et al. [11] have exploited the capabilities of artificial intelligence (AI) and machine learning (ML) to generate images provided with the basic results.However, with the introduction of generative modeling, image generation from given text has improved drastically.Generative modeling is well suited for synthesis problems such as text-to-image synthesis [1,3,4,12,13], image to image translation [14][15][16][17], prediction of next frame in video [18], super resolution [19], etc.
Reed et al. [1] demonstrated a GAN-based architecture for translating a single text into pixels.A single deep convolutional generative adversarial network (DC-GAN) is trained conditioned on text features stored by a convolutional recurrent neural network (CRNN) in text-to-picture synthesis utilizing GANs.Feed forward inference is performed by both generators (G) and discriminators (D) based on the text feature.Both generators (G) and discriminators (D) perform feed forward inference conditioned on the text feature.Text-to-image generation using a single GAN was a huge success; however, there are some limitations as well.First, the images generated are of very low resolution and blurred.Second, less text is used for training.Finally, it was reported in their study that upon closer inspection, generated scenes are usually not coherent, which means that when the model was tested for the MS COCO [20] dataset for complex scene generation, images synthesized for composite objects were not distinguishable, and the spatial relationship between multiple objects was not fulfilled.
Another study [3] proposed a text-conditioned auxiliary classifier generative adversarial network (TAC-GAN) to improve the resolution of images synthesized from a given text.The generated images are of resolution 128 × 128, and the objects are more distinguishable compared to the previous text-to-image synthesizer.The Caltech UCSD birds dataset [9] is used for training.The images generated by TAC-GAN are of a high resolution, and objects are more distinguishable; however, TAC-GAN was only tested for datasets having single objects.
StackGAN [4] advocated employing many GANs layered on top of each other to create photo-realistic image synthesis.StackGAN++ [21] is a variant of StackGAN that proposes a multi-stage GAN architecture for both conditional and unconditional generative tasks.Multiple generators and discriminators are grouped in a tree-like structure in StackGAN++.Various branches of the tree produce images of the same scene with different resolutions.StackGAN++ generates images for single items successfully but fails to make photo realistic images for complex settings, as do other systems.
The attentional generative adversarial network (AttnGAN) [22] proposes picture synthesis using attention driven refinement.By paying attention to appropriate words in the text description, AttnGAN synthesizes fine-grain features at multiple sub-regions.For producing different sub-regions of the image, each attention model automatically retrieves the most relevant word vectors.However, for complex situations, AttnGAN was unable to capture a global coherent structure.
Capsule networks have gained significant attention and success in computer vision tasks.However, their application to the domain of generating images from textual descriptions remains largely unexplored and is a promising avenue.It's interesting to note that capsule networks have primarily been utilized for detection and classification tasks in computer vision [23][24][25].However, their application in text-to-image synthesis is relatively unexplored.A few authors [26][27][28] have explored the use of capsules with GANs; however, their models have not reported any result for text-to-image synthesis.While traditional convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have been the stalwarts in this field, their inherent limitations, such as handling variations in viewpoints and object relationships, call for innovative solutions.Capsule networks, on the other hand, with their ability to capture hierarchical features and model spatial relationships, hold the potential to revolutionize text-to-image synthesis.

Methodology
For automatic text-to-image synthesis, we use the concept of capsule networks in an adversarial process to better model the hierarchical relationships among the entities of an object.A simple yet effective model, CapGAN is proposed to synthesize images from a given text, in which the last CNN layer at the discriminator is replaced by a state-of-theart capsule layer to incorporate the relative spatial and orientational relationship among the various entities of an object.Photo-realistic visuals are synthesized from a given text utilizing the following four main phases in the suggested model: (1) input sentence, (2) text encoding, (3) image production and (4) image discrimination are all steps in the process.

Input Sentence
The input of the CapGAN architecture is a single English sentence for which an image needs to be synthesized.An example input sentence is shown in Figure 3, i.e., "The flower has white petals with yellow center".The input sentence in the next phase is encoded into a vector representation so that it can be fed to the model.

Text Encoding
The raw data from the previous step are encoded into numbers before we can use them to fit our model.For this purpose, we first use skip-thought vectors [29] for creating text embedding.Skip-thought vectors are well known neural network models that learn fixed length representations of sentences in any natural language.1.
Encoder Network: The model takes sentence i and generates a fixed length representation z using a recurrent neural network (RNN).

2.
Previous Decoder Network: The model takes embedding z and tries to generate sentence i − 1 using RNN.

3.
Next Decoder Network: The model takes embedding z and tries to generate sentence i + 1 using RNN.
Decoders are trained to minimize reconstruction error, which are further backpropagated to encoders for the training.Additionally, noise is added before generating the fixed length representations.The reason for corrupting, or adding noise, to the text embedding's learning process is to generate a more robust embedding space.Once trained, this trained encoder is used for generating a vector of fixed length representation, as shown in step 2 of Figure 3.This fixed length representation enables our model to replace any sentence with an equivalent vector of numbers.This caption vector is used as input for CapGAN.

Image Generation
In this section, we first give a short background of the adversarial process for automatic image synthesis and then explain the generator step of the CapGAN architecture.

Generative Adversarial Networks
Ian Good Fellow introduced a paradigm for estimating generative models using an adversarial process in 2014 [2], in which two models, the generator G and discriminator D, are trained.G generates fresh data, whereas D verifies that the data generated by G are genuine [30].
For text-to-image synthesis, GAN takes the following steps: • G receives text as an input and synthesizes an image.• D accepts a generated image, as well as sample images from the actual dataset, and returns the probability that the image is real, with 1 indicating a real image and 0 indicating a false image.
The core principle of GAN is depicted in Figure 5. G generates a sample of data from a random input z from P(z), where z is a sample from the probability distribution P(z).D receives the created data.D takes an input from either real data x ∼ P data (x) or fake data G and attempts to predict whether the data are real or fake.D uses a sigmoid function to solve a binary classification problem of real or false images and returns a value in the range of 0 to 1 [31].GAN training takes the form of a duel between G and D. Mathematically, this can be expressed as: where GANs are trained on a minimax game rather than an optimization problem.The first term in function V(D,G) is the entropy, which states that the sample from real data is fed to D (best case scenario).D tries to maximize this to 1.The second term in function V(D,G) is the entropy when a sample from random distribution is fed to D (worst case scenario).D tries to minimize this to 0. Overall, D is trying to maximize function V. On the contrary, G is trying to minimize function V so that D cannot differentiate between real and fake.This method of training, which GAN adversarially calls the minimax game, is taken from game theory.
As the synthesis of images from extracted features is highly multimodal, the issue is not solved using deep learning.For a single input text, GANs can generate several photo realistic images.This multimodality makes GANs an ideal candidate for image synthesis problems.In consistency with this idea, the generator model of CapGAN is trained to synthesize images with basic shape and color.

Generator (G)
The output of the text encoding step, i.e., a caption vector, is the input of the generator network, as shown in step 3 of Figure 3.In the generator network, the caption vector of length 2400 obtained from skip-thought vectors is first compressed to acquire the text embedding of dimension 256, as shown in Figure 6.This is performed by passing the caption vector through fully connected layers, followed by LeakyReLU.The resulting text embedding is concatenated with noise, projected and reshaped into a tensor of dimension 4 × 4 × 1024.This tensor is passed though the series of deconvolutions for upsampling, and as a result, a tensor of dimension 64 × 64 × 3 is obtained.This tensor is a generated image from the given text, and it is fed to the discriminator for further training.In convolution-based deep learning models such as CNN, the rotation and translation information among different pixel groups are not captured; therefore, using only convolution layers in GANs has limitations for precise image synthesis.Capsule networks [6,32,33] have recently been proposed to address this limitation of CNN.Capsules are locally invariant groups of neurons that learn to recognize the presence of visual entities in an image by encoding their properties into vector outputs [34].In CNN, higher level features combine with low level features as a weighted sum.Nowhere in this process is there is a pose (translational and rotational) relationship between features that makes up higher level features.In a capsule network, there is a capsule corresponding to each entity in an image, which gives: Probability that the entity exists.

2.
Instantiation parameters of that entity.
Instantiation parameters are the properties of that entity in an image such as position, size, hue, etc.As opposed to a neuron's scalar output, a capsule outputs a vector that enables it to encapsulate all important information about the state of the feature.
Table 2 highlights the important differences between capsules and neurons.Neurons receive scalar input from low level neurons, whereas a capsule receives vector input either from low level input or from other neurons.Both the neuron and capsule perform various operations, which include transformation, weighting, summation and activation.The final output produced by the neurons at each layer is a scalar quantity, while capsules produce a vector output.The input and output vectors of capsules enable them to capture the relationship among entities of an object and make them an ideal choice for models aimed at precise image synthesis.
The discriminator of a GAN intended for text-to-image synthesis can receive two types of input: 1.
Real images with real text.

2.
Synthesized/ fake images with random text.
For the proposed CapGAN model, the two types of input are shown in step 4 of Figure 3.In the CapGAN discriminator, a capsule layer is used, along with CNN layers, so that more information is retained by the vectors, thus, capturing the relationship among different entities of an object in the input image.Figure 7 shows the overall architecture of the discriminator used for the CapGAN model.Four CNN layers of stride 2 convolutions, each followed by LeakyReLU, are applied on the input image to perform downsampling.Additionally, the caption vectors of size 2400 are also transformed to text embeddings of size 256.This resultant vector, along with the output of the 4th convolutional layer, is then passed through a capsule layer, followed by the LeakyReLU and the squashing function [35].Then, for further downsampling, the max pooling operation is applied at the output of the activation function.In the end, the discriminator resolves a binary classification problem of real or fake images using a sigmoid function and giving an output between 0 to 1.The detailed architecture utilized for the CapGAN model is shown in Table 3.For real images, the discriminator just has to decide if an image is real.For fake images, the discriminator should distinguish two forms of errors: (1) a fake image with any text caption and (2) a real image with a mismatching text caption.In view of this, the discriminator (D) has to deal with the following three cases: The first scenario is presented in Equation (3), which shows that the first real image x from the dataset, along with real text k, is given as input to D, while D computes and returns a value in the range 0 and 1, named as s rr , in response to this input.
Similarly, the second scenario is depicted in Equation ( 4), in which D is given a real image as input x, along with a fake text k, while D computes and returns a sigmoid value, named s rw , in response to this input.
Likewise, the third scenario is presented in Equation ( 5), in which D is called with a fake image x and real text k, while D computes and returns the sigmoid value, named s fr , in response to this input.
The three values received from Equations ( 3)-( 5) are used to calculate the overall loss of D, named L D , as shown in Equation (6).
The first term in Equation ( 6) is entropy, which is calculated when an image from the real data along with the real text is fed to D. D tries to maximize the output to 1.The second and third terms, on the other hand, show that a real image with incorrect text and a fake image with correct text are provided to D, respectively.D strives to keep this to a minimum.As a result, D is attempting to maximize function L D , i.e., it is attempting to maximize the difference between its output on real and false images.

Results
The CapGAN architecture utilizes a capsule network for image synthesis from a given text statement to overcome the problem of global coherent structures in complex scenes.We conducted comprehensive experimentation using standardized datasets to evaluate the proposed model's performance.In the next subsections, we detail our experimental setup and also benchmark our key results.

Experimental Setup
The CapGAN model is evaluated on the Oxford-102 dataset [8], consisting of flower images, Caltech-UCSD Birds 200 [9] for bird images and ImageNet [10] for images of dogs.The detail of each dataset utilized for training and testing of CapGAN are presented in Table 4.We conducted the experiments in a ten fold cross validation setting, i.e., we conducted a total of 10 experimental rounds by performing random splits of the dataset at each round.For each round, our CapGAN model operates using a fixed number of parameters.Table 5 lists the most important parameters, along with their values, that are used during the execution of the CapGAN model.For training, Adam [36] is used as an optimizer, and sigmoid cross entropy given logits are utilized for calculating the generator and discriminator loss.Because they are used to assess the probabilistic error in discrete classification problems where each class is independent and not mutually exclusive, we picked sigmoid cross entropy given logits.For the optimal performance, after several trials, the model is optimized for a batch size of 32, with a learning rate of 0.0002, while we run it for 100 epochs.The complete list of hyperparameters and their values can be seen in Table 5.The proposed model is trained and tested on a Tesla K40c GPU by utilizing the OpenCV, Tensorflow, Keras and CuDNN libraries.

Evaluation Metric
Evaluation metrics for supervised learning tasks are straightforward, as the problem at hand will always have a clearly defined ground truth that is always available.However, for text-to-image synthesis problems, the conventional evaluation metrics are not feasible, due to the illusive nature of the expected output, i.e., the results are highly multimodal, and no ground truth is available.Therefore, for our problem, we chose the most widely used inception score (IS) and Fréchet inception distance evaluation metrics.The details of these two metrics are given as follows:

Inception Score
The inception score is an evaluation metric for generative models that measures "on average how different is the score distribution of synthesized images from the overall class balance" [37].The inception score uses two criteria for measuring GAN performance: • Saliency: Saliency indicates that objects in an image should be recognizable.Given x as an input, the predicted output y should have a high probability.In terms of image generation, given an image, an object should be recognized easily.Thus, conditional probability p(y|x) should be high, and as a result, the entropy is low.• Diversity: Diversity indicates the variety of details in an image.This means, given a predicted output y, the marginal probability p(y), should be high.This implies that for diverse images, the data distribution of y should be uniform, thus, resulting in high entropy.
For computing the inception score, Kullback-Leibler (KL) divergence D KL is used by plugging both probabilities, i.e., the conditional probability p(y|x) and the marginal probability p(y), as shown in Equation (8).
where N indicates the number of images generated.The intuition behind calculating the inception score is that the model should generate diverse but meaningful images.A higher value of inception score depicts that the generated images are diverse and the objects in images are highly predictable.

Fréchet Inception Distance
Heusel [38] proposed the Fréchet inception distance (FID), which is a variation of IS.FID is a technique for capturing the similarity between generated and real-world images.The IS calculates the quality of the synthetic images by combining the confidence of each synthesized image's conditional predictions with the marginal probability of the predicted class.Real photos are never matched with generated images in this method.The goal of the FID score is to compare the statistics of a collection of synthetic images to the statistics of a collection of real photos in order to evaluate the synthetic images.
In order to calculate FID, the generated samples are embedded into a feature space of the inception network.The mean and covariance are estimated for both the generated data and the real data by analyzing the embedding layer as a continuous multivariate Gaussian.The Fréchet distance between these two Gaussians is then used to quantify the quality of generated samples using the following equation: To calculate FID, the produced samples are embedded into an inception network feature space.By analyzing the embedding layer as a continuous multivariate Gaussian, the mean and covariance are computed for both the produced and real data [39].Using the following equation, the Fréchet distance between these two Gaussians is used to measure the quality of the generated samples: The estimated mean and covariance of the real and produced data are represented by µ r , ∑ r and µ g , ∑ g , respectively.This means that the lower the FID, the more realistic the resulting images are.The positive linear association between the FID score and the distorted/poorly synthesized images is depicted in Figure 8, indicating that FID is sensitive to any disruption in a created image.

Statistical Results
The original proposal for the inception score recommended applying the estimator (of Equation ( 8)) 10 times with N = 5000 (the number of target images).The mean and standard deviation of the obtained scores are then calculated [37].The inception score calculated for 5000 generated images from given random captions using the CapGAN model after the 100th epoch of training remained 2.28 ± 0.627, while the inception score on the entire Oxford-102 flowers dataset [8], after the 10-fold cross validation remained 4.05 ± 0.050.The epoch-wise training results for 5000 generated images can be seen in Figure 9.The most remarkable result to emerge from these data is that as the model improves with training, the inception score raises significantly, while the standard deviation of scores initially increased and then started to decline.This indicates that the diversity of generated images and their predictability increases considerably with more training.We stopped our model training at the 100th epoch as the model was tending towards overfitting after this iteration.The calculated IS and FID for various complete datasets using CapGAN are listed in Table 6.re. Figure 11 shows losses for G and D calculated for GAN and CapGAN, while training.After 100t del is fully trained, the model learns the coherent structure details described in the input sentence, as ond row of the first caption output in Table 6.Similarly, the same ability of the model can be observed fo orted in Table 6.Moreover samples of various dog images generated by CapGAN model trained with 10 igure 10.These results offer compelling evidence about the ability of the CapGAN model to learn the sp ng different entities of an object in an image.

Visual Results
Another interesting aspect of looking at the results is through the visual inspection of the synthesized images.For this purpose, we present the images generated using CapGAN in Table 7.For each input caption, we report ten images of size 64 × 64 generated after the 25th and 100th epochs using CapGAN.As an example, for the first caption, i.e., This flower has petals that are yellow and has black stamen., the model instantly learns the trivial details such as yellow color, as can be seen in the 25th epoch images, but the global coherent structure, e.g., petals on the flower are not learned well at this stage.As the learning continues, the sigmoid cross entropy given logits loss at the generator and discriminator reduces significantly, which ultimately improves the inception score.Figure 10 shows losses for G and D calculated for GAN and CapGAN while training.After the 100th epoch, when the model is fully trained, the model learns the coherent structure details described in the input sentence, as can be seen in the second row of the first caption output in Table 7.Similarly, the same ability of the model can be observed for all other captions reported in Table 7.Moreover, samples of the various dog images generated by CapGAN model trained with 100 epochs are listed in Figure 11.These results offer compelling evidence about the ability of the CapGAN model to learn the spatial relationships among different entities of an object in an image.

Comparative Results
CapGAN is compared to the earlier state-of-the-art models for text-to-image synthesis to further highlight the usefulness of the proposed model.Our model is compared against GAN [1], StackGAN [4], StackGAN++ [21] and TAC-GAN [3] architectures.All these methods utilize GAN architecture as the backbone for translating a single sentence directly into pixels.A single deep convolutional generative adversarial network (DC-GAN) is trained and conditioned on text features encoded by a convolutional recurrent neural network (CRNN) in text-to-image synthesis utilizing GANs .Feed forward inference is performed by both the generator (G) and the discriminator (D) based on the text feature.Among all these models, the text-conditioned auxiliary classifier generative adversarial network (TAC-GAN) is specifically designed to improve the resolution of images synthesized for complex scenes from a given text.In TAC-GAN, the generator is a neural network made up of a series of transposed convolutional layers, while the discriminator is a network that takes an input image and passes it through a number of convolutional layers to determine if the resulting image is real or fake.

Input Text Epoch Examples of Generated Images
This flower has petals that are yellow and has black stamen.

100
The pretty flower has a lot of short blue petals.Table 8 lists the ISs and FIDs calculated using different models for the Oxford-102 flowers dataset [8] Caltech-UCSD 200 [9] and ImageNet [10] for images of dogs and evidently shows that CapGAN achieves the highest inception score and lowest Fréchet inception distance and outperforms the previous models.In comparison to the other models, a higher inception score indicates that CapGAN's images are more recognized, meaningful and have a greater diversity of information.The lower FID scores suggest that generated images are less distorted and closer to real-world images.

Discussion
In complex scenes, objects are composed of multiple entities that are interlinked to form a whole part; however, the color(s) and basic shape of each entity in the scene can be fully viewed and determined separately.For complex scenes, we proposed and evaluated a new model called CapGAN that utilizes a capsule network for image synthesis from a given text statement to overcome the problem of global coherent structures in complex scenes.Our model uses skip-thought vectors as text encoders to construct highly generic fixed-length sentence representations from a single text statement as input.This encoded vector is utilized as input for image synthesis, utilizing an adversarial approach in which two models, generator (G) and discriminator (D), are trained simultaneously.Our model is conceptually unique in that it integrates capsules at the discriminator level to allow it to grasp the orientational and relative spatial relationships between different elements of an object.
To better understand the effectiveness of using a capsule layer at the discriminator lever, we compare the images generated using GAN (without capsules) with images synthesized using the CapGAN model.Table 9 shows images generated using GAN and CapGAN.For GAN, all layers are kept as conventional convolutional layers at the discriminator level.However, for CapGAN, capsule layers are integrated at the discriminator level.From Table 9, it is clear that images generated using the capsule layer at the discriminator level are visually more appealing than the images generated using conventional layers.In a similar vein, it is worth noting that the saliency (i.e., the probability of object presence in the synthesized images) and the diversity (i.e., the variations in the synthesized images) are way better in the CapGAN model than the GAN model.

Multimodality Preservance
The problem of generating images from text descriptions is highly multimodal.This means that there can be multiple correct answers for a single input sentence.When it comes to image synthesis from text, multimodality implies that there are numerous possible pixel configurations that can accurately depict the same description.The CapGAN model also preserves the multimodality.To ensure multimodality, in Table 10, the images generated randomly from a given text using CapGAN are compared with images from the dataset.The images in both columns are different from each other; however, it can clearly be seen that they are all correct in the visual illustration of the given input text, i.e., the entities, color(s) and shape of each entity mentioned in the input text is present in the generated image.As an example, for the first sentence, i.e., "This flower has a white petal with a yellow center.",both the dataset and the generated image have a flower with white petals and a yellow center.Similarly, for the text: This particular bird has a belly that is gray and white, the model has generated a bird with a gray and white color; however, the close inspection of a ground truth indicates that the bird has also a yellow beak.Nevertheless, this information was not available in text, thus, in the generated image, the bird has a white beak.

GAN CapGAN
This particular bird has a belly that is gray and yellow.

Synthesis of Global Coherent Structures
It is also of interest to see the correlation between coherent structures in complex scenes, and the ability of CapGAN to synthesize them.For this purpose, we integrated the capsule networks at the discriminator level in the CapGAN model.The capsule networks extract the geometric information of an object in an image in the form of vectors and use it for inverse rendering.Therefore, in contrast to the conventional deep networks, a capsule network easily identifies the spatial associations among an object's several entities in a scene.
The power of capsule networks appears to be well-substantiated by the results produced using the CapGAN model.The images generated by CapGAN evidently show that they are closer to the given text and have more relative spatial and orientational association between objects, as well as group of pixels, in comparison to images generated using conventional networks.For instance, the images generated by CapGAN using the first text shown in Table 9, the flowers have long petals, they are more curved down and have a proper black center compared to images generated by GAN.Moreover, the majority of the images generated by CapGAN are close to realistic images.

Text Ground Truth Generated Images Using CapGAN
This flower has a white petal with a yellow center.
This flower has red petals with white center.
This flower has a yellow petal with orange spots.
This flower has pink petals with a pink center.
This bird is yellow and black in color, with a long black beak.
This particular bird has a belly that is gray and white.This is a brown and beige bird and brown on the crown White Shih-Tzu On the other hand, many images generated by GAN are far from reality if we compare them with the expected output of the input text.For example, in the second text, as shown in Table 9, white and yellow colors are merged together in images generated by GAN, but for CapGAN, the transition from the yellow center to white leaves is much more smooth and close to reality.Likewise, in images generated from the third text using CapGAN, the petals are more pink, vertically layered and connected, compared to petals in images synthesized by GAN.In many GAN images, the connection between the petals and various parts of the flowers are missing, while they are preserved and well-synthesized using CapGAN.Thus, all these findings correlate favorably with our argument and further support the idea that the CapGAN model outshines in capturing the color(s), basic shape of each entity in the scene, as well as the spatial relationships between objects in complex scenes.

Conclusions
We proposed and tested a model called CapGAN for generating images from a given text statement in this paper.The proposed model is based on an adversarial process in which two models, generator (G) and discriminator (D), are trained simultaneously.The convolutional layers are replaced by capsule layers in CapGAN's discriminator stage.The capsules outperform traditional convolutional neural networks because they incorporate orientation and relative spatial interactions between various objects.The suggested CapGAN model's usefulness is convincingly demonstrated by the experimental findings, which is especially important for generating images for complicated scenarios.For the image synthesis problem, the suggested model outperforms the existing state-of-the-art models.In future, the model developed in this research can be scaled up to generate higher resolution images.Since the GANs capability is limited by the generator's potential, in future traditional deconvolutional neural network at the generator level can be replaced by anti-capsule networks for better results.Furthermore, many approaches have used multistage GAN architecture for increasing the image resolution, where the output obtained in the first phase is alternatively passed to the next phase.It is believed that results from CapGAN can be further improved from using such multistage architectures.
Author Contributions: H.U.R. and A.B. supplied the domain knowledge and framed the problem, while M.O.devised the idea of using a capsule network in conjunction with a convolutional neural network for text-to-image synthesis.The experiment code was written by M.O. and H.U.R.Both O.B.S. and G.P. assisted with the execution of experiments, as well as the analysis and preparation of figures.M.A. assisted with the execution of experiments, analysis, preparation of figures and visualization of the experimental results.In the production and revision of the work, all authors contributed equally.All authors have read and agreed to the published version of the manuscript.
Funding: We are extremely thankful to the Qatar National Library for supporting the Open Access publication charges of this publication.

Figure 3 Figure 3 .
Figure 3.An illustration of the proposed CapGAN architecture for text-to-image synthesis.

Figure 5 .
Figure 5. GAN: Idea of Generator Neural Network and Discriminator Neural Network.

Figure 7 .
Figure 7.The discriminator architecture used in the proposed CapGAN model for automatic text-toimage synthesis.

Figure 8 .
Figure 8. Rise in FID score observed as disturbance in images increases.(a) Salt and Pepper Noise.(b) Gaussian Noise.

Figure 9 .
Figure 9.The inception score plotted against epochs during training.

Figure 10 .Figure 9 .
Figure 10.Sample of dog images generated by CapGAN model trained on ImageNet data

Figure 10 .
(a,b) shows losses for G and D for CapGAN, respectively.For D, the loss decreases as the epochs increase.However, the loss of G starts increasing after the 60th epoch which indicates that D became too strong relative to the G. Beyond this point, G finds it almost impossible to fool D. When D loss decreases to a small value (i.e., 0.1 to 0.2) and G loss increases to a high value (i.e., 2 to 3), it means that model is trained, as G cannot be further improved.(c,d) are losses for G and D of GAN: To calculate the loss, all layers are kept as convolutional layers.In comparison, D loss for CapGAN is less than GAN.

Figure 11 .
Figure 11.Sample of the dog images generated by the CapGAN model trained on the ImageNet dataset.
This flower has long yellow petals that are curved down and a black center with black anthers on it.GAN CapGAN This flower is white and yellow in color, and has petals that are yellow near the center.GAN CapGAN This flower is pink in color, and has petals that are oddly shaped and vertically layered.GAN CapGAN This is a bird with grey wings, a white neck and a black beak.GAN CapGAN This bird is red in color, with black wings.

Table 1 .
Examples of multiple images generated from a single text statement.

Table 3 .
Details of discriminator level layers in the proposed CapGAN model for automatic text-toimage synthesis.

Table 4 .
Details for each dataset utilized for training and testing of CapGAN.

Table 6 .
IS and FID score calculated using CapGAN.

Table 7 .
Examples of early and final stage images generated from various input texts using CapGAN.

Table 8 .
The inception score (IS) and and the Fréchet inception score (FID) score calculated using CapGan.

Table 9 .
Comparison of images generated from a given text using the GAN and CapGAN models.

Table 10 .
Images generated vs. images from dataset.