Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders
Abstract
:1. Introduction
1.1. Lightweight Text-to-Image Approaches
1.2. CLIP-Based Text-to-Image Generation
1.3. Our Approach and Contributions
2. Related Work
2.1. Variational Autoencoders and CLIP-Based Generation
2.2. Lightweight Generation and Alternative Divergence Measures
3. Methods
3.1. Model Architecture
3.1.1. Conditional Variational Autoencoder
3.1.2. Text-to-Image Mapping Network
3.2. Training Procedure
3.2.1. CVAE Training
3.2.2. Mapping Network Training
3.3. Generation Procedure
4. Experiment
4.1. Dataset and Implementation Details
4.2. Divergence Metric ( Selection)
4.3. Evaluation of Mapping Network
- CVAE-Mapping (Proposed): Takes the input text prompt, extracts the 40-dimensional attribute vector using keyword matching, and encodes the text into a CLIP text embedding . This is then passed through the trained mapping network to produce a predicted 512-dimensional CLIP image embedding . The decoder then receives the latent vector , the attribute vector , and the predicted CLIP image embedding as inputs: .
- CVAE-Text (Baseline): Also takes the input text prompt and extracts the identical 40-dimensional attribute vector . It encodes the text into the same CLIP text embedding . However, it bypasses the mapping network. Instead, the original 512-dimensional CLIP text embedding is directly used to fill the 512-dimensional conditional slot normally occupied by the CLIP image embedding. The decoder thus receives the latent vector , the attribute vector , and the original CLIP text embedding as inputs: .
4.3.1. Quantitative Evaluation
4.3.2. Qualitative Evaluation
4.4. Efficiency and Comparative Performance Analysis
4.4.1. Performance on CelebA 64 × 64
4.4.2. Generalization to MS COCO 256 × 256
4.4.3. Overall Efficiency and Performance Summary
4.5. Analysis of Learned Latent Space Structure
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Network Architecture Details
Appendix A.1. Encoder and Decoder Network Specifications
Layer | Input Shape | Output Shape | Kernel Size | Stride | Padding | Activation | Batch Norm |
---|---|---|---|---|---|---|---|
Conv1 | (3 + 40 + 512, 64, 64) | (64, 32, 32) | 4 | 2 | 1 | ReLU | No |
Conv2 | (64, 32, 32) | (128, 16, 16) | 4 | 2 | 1 | ReLU | Yes |
Conv3 | (128, 16, 16) | (256, 8, 8) | 4 | 2 | 1 | ReLU | Yes |
Conv4 | (256, 8, 8) | (512, 4, 4) | 4 | 2 | 1 | ReLU | Yes |
Flatten | (512, 4, 4) | (8192) | - | - | - | - | - |
FC () | (8192) | (128) | - | - | - | - | - |
FC () | (8192) | (128) | - | - | - | - | - |
Layer | Input Shape | Output Shape | Kernel Size | Stride | Padding | Activation | Batch Norm |
---|---|---|---|---|---|---|---|
FC (Input) | (128 + 40 + 512) | (8192) | - | - | - | - | - |
Reshape | (8192) | (512, 4, 4) | - | - | - | - | - |
ConvTrans1 | (512, 4, 4) | (256, 8, 8) | 4 | 2 | 1 | ReLU | Yes |
ConvTrans2 | (256, 8, 8) | (128, 16, 16) | 4 | 2 | 1 | ReLU | Yes |
ConvTrans3 | (128, 16, 16) | (64, 32, 32) | 4 | 2 | 1 | ReLU | Yes |
ConvTrans4 | (64, 32, 32) | (3, 64, 64) | 4 | 2 | 1 | Tanh | No |
Appendix A.2. Text-to-Image Mapping Network Architecture
Layer | Input Shape | Output Shape | Activation | Batch Norm | Dropout |
---|---|---|---|---|---|
Linear 1 | (512) | (2048) | ReLU | Yes | 0.3 |
Linear 2 | (2048) | (1024) | ReLU | Yes | 0.3 |
Linear 3 | (1024) | (512) | ReLU | Yes | - |
Linear 4 | (512) | (512) | - | No | - |
Appendix B. Implementation Details and Algorithms
Appendix B.1. Attribute Vector Generation (get_attributes_from_prompt)
- Initialization: The 40-dimensional vector is initialized to all zeros. This represents the neutral or default state for each attribute.
- Keyword Matching: The input text prompt is parsed to identify predefined keywords associated with each of the 40 CelebA attributes. A lexicon maps specific words or short phrases to their corresponding attribute index and target value (typically 1 for presence, potentially −1 or 0 for explicit absence, although our implementation primarily focuses on presence detection).
- Vector Update: If a keyword indicating the presence of an attribute is found in the text prompt, the corresponding element in the vector is set to 1. For example:
- Input text: “A smiling woman with blond hair and wearing glasses”.
- Keywords detected: “smiling”, “blond hair”, “wearing glasses”.
- Corresponding indices in (representing ‘Smiling’, ‘Blond_Hair’, ‘Wearing_Glasses’) are set to 1.
- All other 37 elements remain 0.
- Handling Negation (Limited): Basic negation (e.g., “no beard”) can be handled by mapping it to a specific value (e.g., 0 or potentially a negative value if the model was trained to interpret it) for the corresponding attribute (‘No_Beard’ or ‘Beard’). However, the robustness to complex phrasing or nuanced negation is limited.
- Robustness: It is sensitive to the exact phrasing used in the text prompt. Synonyms or complex sentence structures not containing the predefined keywords may fail to activate the intended attribute.
- Granularity: It cannot capture attribute nuances beyond the binary presence/absence defined in CelebA (e.g., intensity of a smile, specific style of hat).
- Scope: It is limited to the 40 attributes defined in the CelebA dataset.
Qualitative Analysis of Attribute Vector Generation Robustness
Text Prompt | Target Attribute | Generated Value (0 or 1) |
---|---|---|
“A smiling woman”. | Smiling | 1 |
“A joyful person”. | Smiling | 0 |
“Man with blond hair”. | Blond Hair | 1 |
“Woman, her hair is golden”. | Blond Hair | 0 |
“Person wearing glasses” | Wearing Glasses | 1 |
“A man with spectacles” | Wearing Glasses | 0 |
“A person not wearing a hat”. | Wearing Hat | 0 |
“A bald man”. | Wearing Hat | 0 |
Appendix B.2. Algorithms
Algorithm A1 CLIP Embedding-based Conditional VAE Training Procedure |
Input:
|
Algorithm A2 Mapping Network Training |
Input:
|
Algorithm A3 Text-to-Image Generation using CLIP and CVAE | |
Input:
| |
/* Text Encoding | */ |
Encode text using CLIP text encoder: | |
/* Text-to-Image Mapping | */ |
Predict image embedding using mapping network: | |
/* Attribute Vector Derivation | */ |
Derive attribute vector from text prompt using keyword mapping (See Appendix B.1): | |
Appendix C
Appendix C.1. Additional Generation Examples
Appendix D
Model Training Convergence
Appendix E. Baseline VAE Model Comparison Details
Appendix E.1. Baseline Models
- VAE-LD [37]: This model architecture is derived from the decoder principles of the GLIDE model. While often employed for text-conditional generation at higher resolutions (e.g., 256 × 256), its VAE foundation makes it a relevant baseline for evaluating generative quality within the VAE paradigm, even when adapted or evaluated at lower resolutions.
- VDVAE [38]: The Very Deep VAE represents a powerful hierarchical VAE architecture known for achieving strong likelihood scores and high-fidelity image generation across various benchmarks. Its complexity and performance serve as a high-end benchmark for VAE capabilities relevant to our comparison.
Appendix E.2. Evaluation Methodology and Metric Verification
- Framework: PyTorch 1.13.1.
- Dataset: CelebA, center-cropped and resized to 64 × 64, normalized to [−1, 1], using the standard test split for reference statistics.
- Architecture Adaptation: We focused on implementing the essential VAE encoder–decoder structures as described in the original papers, adjusting channel dimensions and potentially simplifying the hierarchical depth where appropriate to suit the 64 × 64 resolution while preserving the core architectural principles (e.g., residual blocks, attention mechanisms where applicable). The aim was to create faithful reproductions suitable for the target resolution.
- Training Details: Models were trained using the Adam optimizer [35] with a learning rate of and a batch size of 128. Training proceeded for a comparable number of epochs as our CVAE-Mapping model (50 epochs), monitoring convergence via validation loss plateau on a held-out subset of the training data. Training parameters were kept consistent with our main experiments where feasible.
- FID Calculation: Crucially, FID was calculated using the exact same implementation (pytorch-fid package) and reference statistics (derived from the CelebA 64 × 64 test set) as used for evaluating our CVAE-Mapping model. This ensures a direct and fair comparison of the generative distribution quality under identical measurement conditions.
Appendix E.3. Performance Discussion
Appendix F. Exploratory Comparison of Mapping Network Architectures
Appendix F.1. Minimalist Transformer Encoder as an Alternative
- Input: CLIP text embeddings (dimension 512) treated as a sequence of length 1.
- Attention Heads: 2 heads for multi-head self-attention.
- Feed-Forward Network (FFN) Dimension: A hidden dimension of 1024 in the position-wise FFN (e.g., 512 -> 1024 -> 512).
- Output Layer: A final linear layer to project the Transformer output to the target 512-dimensional image embedding space.
- Normalization and Dropout: Layer normalization was applied, and a dropout rate of 0.1 was used.
Appendix F.2. Comparative Results and Discussion
Mapping Network Architecture | Approx. Parameters | Cosine Similarity (↑) |
---|---|---|
MLP (Our Primary Model) | 3.94 M | 0.81 ± 0.03 |
Minimalist Transformer Encoder | 5.12 M | 0.78 ± 0.04 |
References
- Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. arXiv 2021, arXiv:2102.12092. [Google Scholar]
- Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. arXiv 2021, arXiv:2012.09841. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv 2022, arXiv:2205.11487. [Google Scholar]
- Lee, Y.; Sun, A.; Hosmer, B.; Acun, B.; Balioglu, C.; Wang, C.; Hernandez, C.D.; Puhrsch, C.; Haziza, D.; Guessous, D.; et al. Characterizing and Efficiently Accelerating Multimodal Generation Model Inference. arXiv 2024, arXiv:2410.00215. [Google Scholar]
- Ma, Z.; Zhang, Y.; Jia, G.; Zhao, L.; Ma, Y.; Ma, M.; Liu, G.; Zhang, K.; Li, J.; Zhou, B. Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices. arXiv 2024, arXiv:2410.11795. [Google Scholar]
- Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis. arXiv 2021, arXiv:2101.04775. [Google Scholar]
- Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
- Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Zhang, Q.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv 2023, arXiv:2211.01324. [Google Scholar]
- Zhao, Y.; Xu, Y.; Xiao, Z.; Jia, H.; Hou, T. MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices. arXiv 2024, arXiv:2311.16567. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
- Wang, P.; Lu, W.; Lu, C.; Zhou, R.; Li, M.; Qin, L. Large Language Model for Medical Images: A Survey of Taxonomy, Systematic Review, and Future Trends. Big Data Min. Anal. 2025, 8, 496–517. [Google Scholar] [CrossRef]
- Sun, Q.; Fang, Y.; Wu, L.; Wang, X.; Cao, Y. EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv 2023, arXiv:2303.15389. [Google Scholar]
- van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. arXiv 2018, arXiv:1711.00937. [Google Scholar]
- Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. arXiv 2022, arXiv:2204.08583. [Google Scholar]
- Frans, K.; Soros, L.B.; Witkowski, O. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders. arXiv 2021, arXiv:2106.14843. [Google Scholar]
- Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv 2021, arXiv:2103.17249. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar]
- Sohn, K.; Lee, H.; Yan, X. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
- Sikka, H.; Zhong, W.; Yin, J.; Pehlevan, C. A Closer Look at Disentangling in β-VAE. In Proceedings of the Conference Record of the 2019 Fifty-Third Asilomar Conference On Signals, Systems & Computers, Pacific Grove, CA, USA, 3–6 November 2019; Matthews, M., Ed.; pp. 888–895. [Google Scholar]
- Mi, L.; Shen, M.; Zhang, J. A Probe Towards Understanding GAN and VAE Models. arXiv 2018, arXiv:1812.05676. [Google Scholar]
- Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5908–5916. [Google Scholar] [CrossRef]
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1316–1324. [Google Scholar] [CrossRef]
- Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
- Zhao, S.; Song, J.; Ermon, S. InfoVAE: Balancing learning and inference in variational autoencoders. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar] [CrossRef]
- Li, Y.; Turner, R.E. Rényi Divergence Variational Inference. arXiv 2016, arXiv:1602.02311. [Google Scholar]
- Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv 2016, arXiv:1606.00709. [Google Scholar]
- Tolstikhin, I.; Bousquet, O.; Gelly, S.; Schoelkopf, B. Wasserstein Auto-Encoders. arXiv 2019, arXiv:1711.01558. [Google Scholar]
- Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. arXiv 2015, arXiv:1411.7766. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2018, arXiv:1706.08500. [Google Scholar]
- Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv 2018, arXiv:1801.03924. [Google Scholar]
- Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
- van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv 2022, arXiv:2112.10741. [Google Scholar]
- Child, R. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv 2021, arXiv:2011.10650. [Google Scholar]
- Zhang, H. Evaluation of Natural Image Generation and Reconstruction Capabilities Based on the β-VAE Model. ITM Web Conf. 2025, 70, 03006. [Google Scholar] [CrossRef]
- Fan, L.; Tang, L.; Qin, S.; Li, T.; Yang, X.; Qiao, S.; Steiner, A.; Sun, C.; Li, Y.; Zhu, T.; et al. Unified Autoregressive Visual Generation and Understanding with Continuous Tokens. arXiv 2025, arXiv:2503.13436. [Google Scholar]
- Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are Worth Words: A ViT Backbone for Diffusion Models. arXiv 2023, arXiv:2209.12152. [Google Scholar]
- Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; Aila, T. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv 2023, arXiv:2301.09515. [Google Scholar]
- You, X.; Zhang, J. Text-to-Image GAN with Pretrained Representations. arXiv 2024, arXiv:2501.00116. [Google Scholar]
- Rampas, D.; Pernias, P.; Aubreville, M. A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces. arXiv 2023, arXiv:2211.07292. [Google Scholar]
FID (↓) | IS (↑) | LPIPS (↓) | |
---|---|---|---|
−2.0 | 318.9333 | 1.6552 | 0.3872 |
−1.0 | 340.5644 | 1.7164 | 0.3654 |
−0.7 | 178.7920 | 1.9458 | 0.3031 |
−0.5 | 72.4416 | 2.1221 | 0.2216 |
−0.3 | 125.2464 | 2.0804 | 0.2625 |
0.5 | 87.8511 | 2.0293 | 0.2633 |
0.7 | 50.2313 | 2.3726 | 0.1725 |
0.9 | 140.1787 | 2.0480 | 0.3247 |
1.0 | 47.0873 | 2.2405 | 0.1376 |
Comparison | Cosine Similarity (↑) | p-Value vs. Baseline |
---|---|---|
Baseline: Orig Text vs. Orig Img | 0.65 ± 0.05 | - |
Mapped: Pred Img vs. Orig Img | 0.81 ± 0.03 | <0.001 |
Model | FID (↓) | IS (↑) | Cosine Sim (Gen) (↑) | p-Value |
---|---|---|---|---|
CVAE-Text | 48.32 | 2.3428 | 0.61 | - |
CVAE-Mapping | 40.53 | 2.43 | 0.86 | <0.001 |
Type | Model | Params (M) ↓ | FLOPs/MACs ↓ | FID↓ | IS↑ | FPS ↑ |
---|---|---|---|---|---|---|
Diffusion | MobileDiffusion Variant [11] | 58 | 8.9 GFLOPs | 40 | 2.89 | N/A |
GAN | FastGAN [8] | 30 | N/A | 35.5 | N/A | N/A |
VAE | VAE-LD [37] | 96 | N/A | 25.8 | N/A | 11 |
VDVAE [38] | 112 | N/A | 23.25 | N/A | 9 | |
CVAE-Mapping (Ours) | 42 | 3.2 GMACs | 40.53 | 2.43 | 21 |
Type | Model | Params (M) ↓ | FID ↓ | CLIP Score ↑ | R-Precision ↑ | FPS ↑ |
---|---|---|---|---|---|---|
Autoregressive | Fluid (Gemma-2B) [40] | 2 B | 6.16 | 0.295–0.32 | 0.350 | N/A |
Diffusion | U-ViT-S/2 [41] | 131 M a | 5.48 | 0.274–0.295 | N/A | N/A |
GAN | StyleGAN-T [42] | 120 M | 26.7 | 0.292-0.305 | 0.342 | 10 |
TIGER [43] | 69 M | 21.96 | N/A | 0.33 | Faster | |
Token-based | Paella [44] | 573 M | 26.7 | 0.307 | N/A | 2 |
VAE/Trans. | VQGAN+CLIP [16] | 10 B | 19.9 | 0.28-0.30 | 0.10–0.12 | N/A |
CVAE | CVAE-Mapping (Ours) | 54 | 24.26 | 0.275–0.287 | 0.32 | 8 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Wang, Y.; Zhang, G. Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics 2025, 14, 2185. https://doi.org/10.3390/electronics14112185
Wang Y, Zhang G. Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics. 2025; 14(11):2185. https://doi.org/10.3390/electronics14112185
Chicago/Turabian StyleWang, Yubo, and Gaofeng Zhang. 2025. "Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders" Electronics 14, no. 11: 2185. https://doi.org/10.3390/electronics14112185
APA StyleWang, Y., & Zhang, G. (2025). Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics, 14(11), 2185. https://doi.org/10.3390/electronics14112185