You are currently viewing a new version of our website. To view the old version click .
Applied Sciences
  • Article
  • Open Access

Published: 16 September 2023

Novel Paintings from the Latent Diffusion Model through Transfer Learning

,
and
1
Space Star Technology Co., Ltd., Beijing 100086, China
2
Space Innovation Technology Co., Ltd., Beijing 100070, China
3
Institute of Intelligent Manufacturing, Heilongjiang Academy of Sciences, Harbin 150090, China
*
Author to whom correspondence should be addressed.
This article belongs to the Special Issue Advances and Applications of Digital Image Processing and Deep Learning

Abstract

With the development of deep learning, image synthesis has achieved unprecedented achievements in the past few years. Image synthesis models, represented by diffusion models, demonstrated stable and high-fidelity image generation. However, the traditional diffusion model computes in pixel space, which is memory-heavy and computing-heavy. Therefore, to ease the expensive computing and improve the accessibility of diffusion models, we train the diffusion model in latent space. In this paper, we are devoted to creating novel paintings from existing paintings based on powerful diffusion models. Because the cross-attention layer is adopted in the latent diffusion model, we can create novel paintings with conditional text prompts. However, direct training of the diffusion model on the limited dataset is non-trivial. Therefore, inspired by the transfer learning, we train the diffusion model with the pre-trained weights, which eases the training process and enhances the image synthesis results. Additionally, we introduce the GPT-2 model to expand text prompts for detailed image generation. To validate the performance of our model, we train the model on paintings of the specific artist from the dataset WikiArt. To make up for missing image context descriptions of the WikiArt dataset, we adopt a pre-trained language model to generate corresponding context descriptions automatically and clean wrong descriptions manually, and we will make it available to the public. Experimental results demonstrate the capacity and effectiveness of the model.

1. Introduction

Image synthesis has made excellent achievements with the rapid progress of deep learning and computer vision, which has a wide application in designing and painting. Generative adversarial network [] made great progress in generating high-quality images. It is inspired by the game theory: the generator and the discriminator are competing with each other, which makes both evolve at the same time. However, due to the adversarial training nature, the training of the adversarial network is known as unstable and the diversity of the generated image is limited. The generative model, variational autoencoder [], is similar to the autoencoder but deeply rooted in the variational Bayesian and graphical model. The autoencoder is the neural network, which is designed to learn the identity function to reconstruct the input in an unsupervised way. Different from the autoencoder, the variational autoencoder maps the original input into a distribution, whose training process is supervised by the Kullback–Leibler divergence loss function.
Recently, diffusion probabilistic models [,,,,], are built from the denoising autoencoder, which has shown unprecedented results in image synthesis, super-resolution, inpainting, and stylization. Compared with generative adversarial models, the diffusion model with likelihood nature exhibits more stable training and models complicated structures of the image by exploiting shared parameters. There are two Markov chains, the forward diffusion process and the reverse denoising process, in the diffusion model. The forward diffusion process adds Gaussian random noise to the given image in sequence until the disturbed sample satisfies the Gaussian distribution. The reverse denoising process generates the image from the Gaussian noise conditioned by the given input, e.g., text, audio, and image. The forward diffusion destroys the image with random noises, and the reverse denoising learns to reconstruct the image gradually. The diffusion model with mode-covering characteristics has a powerful ability to represent imperceptible details. However, repeated functions and gradient calculations in the diffusion model demand massive computing resources. Training the diffusion model [] requires plenty of computational resources and takes several hundred GPU days, which is quite costly and difficult for common use. Inference and evaluation of the trained model are also expensive in memory and time.
To ease massive computing resource consumption and improve the accessibility of the diffusion model, Rombach et al. [] proposes the latent diffusion model. The latent diffusion model reduces the computational complexity without obvious performance degradation. The latent diffusion model accomplishes this by learning the denoising process in latent space, striking a balance between visual quality and computational complexity. Furthermore, the diffusion model incorporates a cross-attention framework, enabling free image synthesis, which can accommodate various input modalities, e.g., text, audio, and images. Finally, the decoder of the autoencoder is used to translate the denoised latent code into the real image. Building upon the powerful latent diffusion model, we are dedicated to creating innovative paintings. Throughout history, numerous talented artists, such as Leonardo da Vinci and Vincent van Gogh, have created many impressive artworks in their careers. Our aspiration is to invite those late talented artists to create novel paintings on modern topics with the diffusion model. We attempt to train the latent diffusion model on paintings of the specific artist from the dataset WikiArt []. However, these late artists’ collections offer fewer examples compared to the extensive training data required for neural networks. Therefore, inspired by transfer learning, we train the latent diffusion model with publicly available pre-trained weights. The pre-trained diffusion model has already processed millions of images and demonstrated excellent representation capabilities, which greatly benefit our model’s training process. After retraining, we can synthesize the novel painting with any given text prompt. To the best of our knowledge, we propose to create novel paintings from famous artists’ works based on the diffusion model for the first time.
For detailed image generation, we incorporate the large language model, GPT-2 [], to enhance the text prompts. The prompt is put into the pre-trained large language model with rich knowledge to understand, explain, and extend the prompt further, which supports generating more exquisite images. Because the existing dataset WikiArt lacks the corresponding image context description, we adopt the image tagging model and vision-language model to generate these descriptions, and then we use the contrastive model to remove incorrect image text descriptions. We will release the image context description dataset to the public. Experimental results show the high quality of generated paintings and the effectiveness of our framework. The main contributions of our paper are summarized as follows:
  • We propose the painting model to create novel paintings from those late famous artists’ works for the first time, which is based on the latent diffusion model with transfer learning.
  • We propose the text prompt expansion, which utilizes positives of large language models for completing the text prompts and generating more detailed images.
  • We contribute missing image context descriptions, which are complementary to the original WikiArt dataset, and we will release it to the public.
  • We demonstrate photo-realistic painting results by giving different text prompt inputs to the trained model.

3. Materials and Methods

3.1. Denoising Diffusion Probabilistic Model

Ho et al. [] proposed the denoising diffusion probabilistic model. There are two processes, the diffusion process and denoising process, in this model, which is shown in Figure 4. In the diffusion process, given the sample data x 0 from the data distribution q ( x ) , the Gaussian noise is appended to the sample data x 0 to acquire a sequence of disturbed samples x 0 , x 1 , , x T :
q x t x t 1 = N x t ; 1 β t x t 1 , β t I , q x 1 : T x 0 = t = 1 T q x t x t 1 ,
where T means the diffusion steps, β t is the hyper-parameter with range from 0 to 1, N ( μ , Σ ) is the Gaussian distribution function with the mean μ and variance Σ , I is the identity matrix. We can obtain the arbitrary sample x t in the diffusion process based on Equation (1) through the reparameterization trick:
x t = α t x t 1 + 1 α t ϵ 1 = α t α t 1 x t 2 + 1 α t ϵ 1 + α t ( 1 α t 1 ) ϵ 2 = α t α t 1 x t 2 + 1 α t α t 1 ϵ ¯ 2 = = α ¯ t x 0 + 1 α ¯ t ϵ t ,
where α t = 1 β t , α ¯ t = i = 1 t α i , and ϵ 1 , ϵ 2 , ϵ ¯ 2 , ϵ t N ( 0 , I ) . Usually, β 1 < β 2 < < β T and α ¯ 1 > > α ¯ T . x T will slide into the isotropic Gaussian distribution when T . Equation (2) can also be formulated as
q x t x 0 = N x t ; α ¯ t x 0 , 1 α ¯ t I .
Figure 4. The diffusion and denoising process.
In the denoising process, to reconstruct the image from Gaussian noise x T , we need to reverse the diffusion process and learn the model p θ to estimate these conditional probabilities:
p θ x t 1 x t = N x t 1 ; μ θ x t , t , Σ θ x t , t , p θ x 0 : T = p x T t = 1 T p θ x t 1 x t .
The reverse conditional probability q x t 1 x t , x 0 can be regarded as the reference to learn p θ x t 1 x t , which is tractable and can be computed by Bayes’ rule:
q x t 1 x t , x 0 = N x t 1 ; μ ˜ t x t , x 0 , β ˜ t I = q x t x t 1 , x 0 q x t 1 x 0 q x t x 0 exp 1 2 x t α t x t 1 2 β t + x t 1 α ¯ t 1 x 0 2 1 α ¯ t 1 x t α ¯ t x 0 2 1 α ¯ t = exp 1 2 α t β t + 1 1 α ¯ t 1 x t 1 2 2 α t β t x t + 2 α ¯ t 1 1 α ¯ t 1 x 0 x t 1 + C = exp 1 2 β ˜ t x t 1 μ ˜ t 2 ,
where C is the function not relating x t 1 , μ ˜ t relies on the x t and ϵ t , β ˜ t is the scalar:
μ ˜ t = 1 α t x t 1 α t 1 α ¯ t ϵ t , β ˜ t = 1 α ¯ t 1 1 α ¯ t β t .
The diffusion model is the likelihood model, which learns to denoise the variable with a sequence of denoising autoencoders. To supervise the training of the diffusion model, Ho et al. [] found that the simplified objective function achieves a better performance:
L t DM = E x 0 , t [ 1 , T ] , ϵ t N ( 0 , I ) ϵ t ϵ θ x t , t 2 = E x 0 , t [ 1 , T ] , ϵ t N ( 0 , I ) ϵ t ϵ θ α ¯ t x 0 + 1 α ¯ t ϵ t , t 2 ,
where the conditional denoising nethwork ϵ θ with parameter θ is designed to predict the noise from disturbed x t .

3.2. Latent Diffusion Model

To ease the demanding computing resources, we train the diffusion model in latent space instead of pixel space. In the latent diffusion model, the learning of the likelihood model is divided into perceptual compression and semantic compression. The perceptual compression learns the partial semantic variation and removes the high-frequency details. The semantic compression learns the contextual and semantic features of the image. Because most bits of the digital image contribute to eye-imperceptible details, the latent diffusion model is dedicated to searching a computationally cheap but perceptually equivalent space for high-fidelity image synthesis. The structure of the latent diffusion model is shown in Figure 5. The perceptual compression depends on an autoencoder model E , which maps the image x 0 R H × W × 3 into the latent feature z 0 = E ( x 0 ) R h × w × c with the downsampling rate f = H / h = W / w = 2 m , m N .
Figure 5. The overview of framework.

3.2.1. Autoencoder with Regulation

The autoencoder is pre-trained to encode the image x 0 into latent feature z 0 and decode the latent feature back to the original image. The self-attention block and residual neural network are adopted in both the encoder E and the decoder G , to combine the expressiveness of the transformer and the effectiveness of the convolutional neural network. To avoid the arbitrarily high variance of the latent feature, the vector quantization layer and Kullback–Leibler Divergence regulation are adopted in the autoencoder to penalize the latent feature. The loss of the autoencoder is as follows:
(1) The reconstruction loss:
x ^ 0 = G ( q ( E ( x 0 ) ) ) L d = x 0 x ^ 0 2 ,
where q ( · ) is the vector quantization layer.
(2) The Kullback–Leibler Divergence loss on the latent feature q ( E ( x ) ) :
L K L = c , h , w μ 2 + σ 2 1 log σ 2 2 ,
where μ , σ 2 are the mean and variance of the latent feature, respectively.
(3) The adversarial loss:
L G = E D ( x ^ 0 ) ,
L D = E R e L U 1 D ( x 0 ) + E R e L U 1 + D ( x ^ 0 ) ,
where D is the discriminator, which is employed to penalize the distribution difference between the ground truth and predicted image and enhance the visual quality of the image.

3.2.2. Conditional Denoising Nethwork

Similar to existing generative models, e.g., GAN and VAE, the diffusion model with the conditional encoder is able to model conditional distribution. With the conditional encoder encoding text, audio, semantic map, and pose [], we gain control over the image synthesis process easily. To integrate the conditional encoder with the diffusion model and fuse different modality inputs effectively, the cross-attention mechanism [] with U-Net structure [] makes up for the conditional denoising network, which predicts the noise ϵ t . To process different modalities, the domain-specific encoder τ θ is adopted, which maps signal y to the latent feature τ θ ( y ) . The cross-attention block performs fusion on φ i ( z t ) and τ θ ( y ) , which is able to handle flexible conditional image synthesis. The cross-attention layer is implemented by
Q = W Q ( i ) · φ i z t , K = W K ( i ) · τ θ ( y ) , V = W V ( i ) · τ θ ( y ) , Attention = softmax Q K T d · V ,
where φ i ( z t ) means the intermediate feature of the U-Net structure, and W Q ( i ) , W K ( i ) , W V ( i ) are learnable weights. The residual shortcut exists within the U-Net structure, connecting the input layer and the output layer with the same size.
To achieve conditioned image synthesis and relieve computational resources, we encode the input signal y with specific encoder τ θ ( · ) and train the diffusion model in latent space:
L t LDM = E E x 0 , t [ 1 , T ] , ϵ t N ( 0 , I ) ϵ t ϵ θ z t , t , τ θ ( y ) 2 = E E x 0 , t [ 1 , T ] , ϵ t N ( 0 , I ) ϵ t ϵ θ α ¯ t z 0 + 1 α ¯ t ϵ t , t , τ θ ( y ) 2 ,
where E · is the pre-trained autoencoder, which encodes the high-dimensional pixel into low-dimensional latent space.

3.3. Text Prompt Expansion

Human-generated text prompts often lack critical descriptions necessary for effective image generation. If the prompt description is deficient or the correlation between keywords is less, the generated image will be particularly poor, which is unsatisfying and far from expectations. To overcome this challenge, we utilize the pre-trained large language model, GPT-2 [], to complete and enrich the input prompt. The GPT-2 model is finetuned on the 80k text prompts collected from [] for supporting the text-to-image task. Given any text prompt, the pre-trained large language model can complete text prompts and generate complementary content. We only need to give one concise sentence or several keywords, and the pre-trained large language model continues writing based on the prompt input rapidly. The complementary prompt improves the quality of the generated image greatly.
The flowchart of our system is shown in Figure 6. In the training stage, the latent diffusion model, which takes text as input and outputs the image, is trained with pre-trained weights [] on a specific artist’s paintings to obtain specialized weights. The text prompt expansion module, which takes text as input and outputs expanded text, is also trained with pre-trained weights [] to obtain general weights. Both models are trained independently. After training, we can generate a high-quality painting with a simple sentence in the testing stage.
Figure 6. The flowchart of our system.

4. Experiments

4.1. Dataset

The WikiArt painting dataset collects 81,444 paintings from 1119 artists around the world, consisting of multiple different genres, styles, and art fields. The WikiArt dataset is the largest painting dataset that is available for research. Most paintings span from the 13th century to the 21st century. Each digital image is labeled with information regarding the artist, genre, style, and year. However, one notable shortcoming of the WikiArt dataset is the absence of contextual descriptions for the paintings. To generate the lacking text description, we employ the pre-trained image tagging model called RAM [] and the vision-language model called Tag2Text [] to generate the corresponding image text description automatically. However, wrong descriptions occasionally happen. To remove the wrong image text description, we utilize a contrastive pre-trained model [] to calculate the similarity between the actual images and generated image text descriptions. Any image–text pair with a similarity score lower than 0.2 is subsequently removed. Following the automated text generation and selection process, we take extra measures to manually review inaccurate text descriptions, ensuring the dataset’s overall quality. The number of paintings contributed by each artist is shown in Table 1. Additionally, we have included examples of images and corresponding text prompts in Figure 7, to provide a visual glimpse of the dataset.
Table 1. The number of paintings by each artist.
Figure 7. Sample images and text descriptions of WikiArt.

4.2. Implemental Details

In our scenario, collecting training data presents a formidable challenge, compounded by the fact that the number of artworks by these late artists remains fixed. Training the latent diffusion model from scratch on the limited dataset is non-trivial, which may lead to a significant degradation in the model’s performance. To make up for the insufficient training data, we utilize the core idea of transfer learning []. Transfer learning involves creating a high-performance model trained with data from different domains. Therefore, we leverage the pre-trained weights from the stable diffusion v1-5 model [] and train the latent diffusion model on an NVIDIA GeForce RTX 4090 GPU. The stable diffusion v1-5 is trained on a subset of LAION-5B [], which has seen millions of images and has robust representation capabilities. The latent diffusion model is initialized with stable diffusion v1-5 and retrained on all paintings of 9 artists from WikiArt Dataset. Training the latent model with pre-trained weights is beneficial to the training process. To optimize memory usage, we adopt both 16-bit and 32-bit floating-point mixed precision to train the latent diffusion model. The batch size is set to 1 and the learning rate is set to 1 × 10 5 . For each artist, the latent diffusion model undergoes independent retraining with the resolution of 512 × 512 for 50 epochs.

4.3. Qualitative Results

We retrain the latent diffusion model on nine artists’ works, subsequently leveraging this model to generate novel paintings with text prompts. Notably, we propose to generate novel paintings from famous artists’ works based on the diffusion model for the first time. Consequently, our comparative analysis is restricted to our retrained model and the original latent diffusion model. Figure 8, Figure 9, Figure 10 and Figure 11 show generated images from the original latent diffusion model and the retrained model on WikiArt dataset. The first row of Figure 8, Figure 9, Figure 10 and Figure 11 shows the original generated image. Based on the retrained latent diffusion model, we can request these late famous artists to create novel paintings in their styles. From the second row to the last row, the paintings of Vincent van Gogh, Leonardo da Vinci, Pablo Picasso, and other renowned artists are shown. It can be seen from Figure 8, Figure 9, Figure 10 and Figure 11 that generated paintings effectively capture the distinctive and creative styles associated with each artist. Vincent van Gogh is famous for masterpieces like The Starry Night, Sunflowers, Self-Portraits, and so on. He had great enthusiasm for rural life and landscapes, particularly sunflowers and wheat fields. His paintings were characterized by vibrant and exaggerated colors, and his unique style is clearly discernible in the second row of Figure 8, Figure 9, Figure 10 and Figure 11. In contrast, the fourth row of these figures unmistakably showcases the works of Pablo Picasso, celebrated for pioneering highly exaggerated and distorted artistic techniques. Picasso’s paintings often delved into cubism and surrealism, utilizing geometric shapes to structure his compositions. These deeply emotive and condensed artworks were immensely popular, making his unique style instantly recognizable.
Figure 8. The generated novel paintings of input text prompts “An astronaut riding a horse” and “A woman holding a knife in her hand”.
Figure 9. The generated novel paintings of input text prompts “An emotional dog next to an alien landscape” and “An ethereal swan perched by a planet”.
Figure 10. The generated novel paintings of input text prompts “A gothic sea monster in the high seas” and “A drone hovering over the sky”.
Figure 11. The generated novel paintings of input text prompts “A tank in the lake” and “A computer keyboard with various keys”.
Additionally, we give some modern prompts, such as astronaut, drone, tank, and comptuer keyboard, and from the figure we can see that the diffusion model can generate creative paintings that not only align with modern prompts, but also follow artists’ distinctive styles. However, there are also some failure cases, such as generated images of text prompt a woman holding a knife in her hand. It appears that the diffusion model lacks sufficient training images or the text prompt is not detailed enough, causing the knife to be lost in some pictures. Simple text prompts sometimes cause unreasonable and unexpected results due to insufficient contextual description. To mitigate this issue, we propose to use the pre-trained language model for automatic text prompt expansion. Table 2 provides a detailed presentation of the expanded text prompts, generated with different random seeds. The visual results, shown in the second and fourth columns of Figure 8, Figure 9, Figure 10 and Figure 11, clearly illustrate the significant improvements achieved through text expansion. These expanded prompts infuse the generated images with intricate detail and an overall boost in quality.
Table 2. Expanded text prompts.

4.4. Quantitative Results

Though pairs of text prompts and images are available, the latent diffusion model may produce different results depending on different latent codes. Furthermore, the provided text prompt lacks corresponding image ground truth. Therefore, it is unreasonable to evaluate the model with pixel-aligned metrics, such as PSNR and LPIPS. We employ Fréchet Inception Distance (FID) to measure the distribution distance between generated paintings and real paintings. Quantitative results are shown in Table 3. Obviously, the retrained model achieves more realistic results compared with the original model. With the utilization of text prompt expansion, both the original model and retrained model demonstrate better performance.
Table 3. Quantitative comparison on FID.

5. Discussion and Conclusions

In this paper, we propose to generate novel paintings from existing paintings of famous artists and demonstrate high-quality images based on the powerful latent diffusion model. Our method is dedicated to creating novel paintings by retraining the latent diffusion model and adopting text prompt expansion. After retraining on hundreds of paintings, our framework can generate artworks with modern themes. Experimental results show the effectiveness of our retrained model and text prompt expansion. However, there are still some problems with generated images. Unreasonable and unexpected results sometimes happen, which may be caused by incorrect image text descriptions in the training dataset, insufficient text prompts in the inference stage, or other reasons. Therefore, we will continue to research and develop the framework to work out these problems in the future.
The proposed framework will promote the development of image synthesis and painting creation, which has many applications, such as entertainment and movies. However, efficient painting synthesis may raise issues such as copyright and portrait rights, which may cause some legal disputes and ethical concerns. We suggest that policymakers should establish a strict regulatory system to supervise the practical use of this technology.

Author Contributions

S.S. proposed the initial idea; D.W. and C.M. conducted the experiment; D.W. and S.S. improved the model further; D.W. and C.M. wrote the manuscript; S.S. revised the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Heilongjiang Province, China (Grant No. LH2023E092).

Data Availability Statement

The image and text context description dataset used in our paper is available at https://drive.google.com/file/d/1TKVkctNoYdDQPeksAndkrOk6FE7s-qcp/view?usp=sharing, accessed on 5 March 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

DOAJDirectory of open access journals
TLAThree letter acronym
LDLinear dichroism

References

  1. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, USA, 8–13 December 2014; Volume 27. [Google Scholar]
  2. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114. [Google Scholar]
  3. Sohl-Dickstein, J.; Weiss, E.; Maheswaranathan, N.; Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 2256–2265. [Google Scholar]
  4. Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 6840–6851. [Google Scholar]
  5. Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
  6. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
  7. Dhariwal, P.; Nichol, A. Diffusion models beat gans on image synthesis. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 8780–8794. [Google Scholar]
  8. WIKIART. Available online: https://www.wikiart.org/ (accessed on 5 March 2023).
  9. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  10. Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
  11. Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8110–8119. [Google Scholar]
  12. Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–14 December 2021; Volume 34, pp. 852–863. [Google Scholar]
  13. Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved techniques for training gans. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
  14. Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
  15. Arjovsky, M.; Bottou, L. Towards principled methods for training generative adversarial networks. arXiv 2017, arXiv:1701.04862. [Google Scholar]
  16. Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein GAN. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017. [Google Scholar]
  17. Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. beta-vae: Learning basic visual concepts with a constrained variational framework. In Proceedings of the International Conference on Learning Representations, Toulon, France, 24–26 April 2017. [Google Scholar]
  18. Van Den Oord, A.; Vinyals, O. Neural discrete representation learning. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  19. Razavi, A.; Van den Oord, A.; Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; Volume 32. [Google Scholar]
  20. Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-shot text-to-image generation. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8821–8831. [Google Scholar]
  21. Rombach, R.; Esser, P.; Ommer, B. Network-to-network translation with conditional invertible neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Virtual, 6–12 December 2020; Volume 33, pp. 2784–2797. [Google Scholar]
  22. Gregor, K.; Papamakarios, G.; Besse, F.; Buesing, L.; Weber, T. Temporal difference variational auto-encoder. arXiv 2018, arXiv:1806.03107. [Google Scholar]
  23. Rezende, D.; Mohamed, S. Variational inference with normalizing flows. In Proceedings of the International Conference on Machine Learning, PMLR, Lille, France, 6–11 July 2015; pp. 1530–1538. [Google Scholar]
  24. Su, J.; Wu, G. f-VAEs: Improve VAEs with conditional flows. arXiv 2018, arXiv:1809.05861. [Google Scholar]
  25. Zhang, L.; Agrawala, M. Adding conditional control to text-to-image diffusion models. arXiv 2023, arXiv:2302.05543. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
  27. Hugging Face. Available online: https://huggingface.co/datasets/bartman081523/stable-diffusion-discord-prompts (accessed on 5 March 2023).
  28. Hugging Face. Available online: https://huggingface.co/runwayml/stable-diffusion-v1-5 (accessed on 5 March 2023).
  29. Hugging Face. Available online: https://huggingface.co/gpt2 (accessed on 5 March 2023).
  30. Zhang, Y.; Huang, X.; Ma, J.; Li, Z.; Luo, Z.; Xie, Y.; Qin, Y.; Luo, T.; Li, Y.; Liu, S.; et al. Recognize Anything: A Strong Image Tagging Model. arXiv 2023, arXiv:2306.03514. [Google Scholar]
  31. Huang, X.; Zhang, Y.; Ma, J.; Tian, W.; Feng, R.; Zhang, Y.; Li, Y.; Guo, Y.; Zhang, L. Tag2text: Guiding vision-language model via image tagging. arXiv 2023, arXiv:2303.05657. [Google Scholar]
  32. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  33. Weiss, K.; Khoshgoftaar, T.M.; Wang, D. A survey of transfer learning. J. Big Data 2016, 3, 1–40. [Google Scholar] [CrossRef]
  34. LAION. Available online: https://laion.ai/blog/laion-5b/ (accessed on 5 March 2023).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.