Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders

Wang, Yubo; Zhang, Gaofeng

doi:10.3390/electronics14112185

Open AccessArticle

Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders

by

Yubo Wang

and

Gaofeng Zhang

^*

School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(11), 2185; https://doi.org/10.3390/electronics14112185

Submission received: 1 May 2025 / Revised: 22 May 2025 / Accepted: 26 May 2025 / Published: 28 May 2025

(This article belongs to the Special Issue Big Model Techniques for Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Deploying text-to-image (T2I) models is challenging due to high computational demands, extensive data needs, and the persistent goal of enhancing generation quality and diversity, particularly on resource-constrained devices. We introduce a lightweight T2I framework that uses a dual-conditioned Conditional Variational Autoencoder (CVAE), leveraging CLIP embeddings for semantic guidance and enabling explicit attribute control, thereby reducing computational load and data dependency. Key to our approach is a specialized mapping network that bridges CLIP text–image modalities for improved fidelity and Rényi divergence for latent space regularization to foster diversity, as evidenced by richer latent representations. Experiments on CelebA demonstrate competitive generation (FID: 40.53, 42 M params, 21 FPS) with enhanced diversity. Crucially, our model also shows effective generalization to the more complex MS COCO dataset and maintains a favorable balance between visual quality and efficiency (8 FPS at 256 × 256 resolution with 54 M params). Ablation studies and component validations (detailed in appendices) confirm the efficacy of our contributions. This work offers a practical, efficient T2I solution that balances generative performance with resource constraints across different datasets and is suitable for specialized, data-limited domains.

Keywords:

text-to-image generation; CLIP embeddings; conditional variational autoencoder (CVAE); Rényi divergence; lightweight model

1. Introduction

Deep generative models have made remarkable progress in image generation, particularly text-to-image (T2I) synthesis, enabling applications like artistic creation and image editing [1,2,3]. However, their widespread deployment is hindered by substantial computational demands. State-of-the-art models like Stable Diffusion [4], DALL-E2 [3], and Imagen [5] require significant processing power and memory, making them challenging to deploy on resource-constrained devices (e.g., mobile platforms with 2–4 GB memory and limited GPU capacity [6]). Furthermore, traditional T2I models often necessitate large paired image–text datasets, which are costly to acquire in specialized domains.

1.1. Lightweight Text-to-Image Approaches

To address these challenges, lightweight T2I approaches have emerged. Model compression techniques (quantization, knowledge distillation) applied to diffusion models can reduce parameters by up to 75%, often at the cost of generation quality [7]. GAN-based alternatives like FastGAN (30 M parameters) offer faster inference but may have lower text-conditioning fidelity [8]. MobileViT-based architectures are mobile-friendly but have limited generative capacity compared to larger models [9]. Recent efficient T2I methods include eDiff-I [10], which reduces FLOPs significantly, and SD-Mobile [11] for mobile deployment. However, these often exhibit degraded quality (30–40% higher FID scores). Most existing lightweight solutions focus on model compression or architectural efficiency rather than a foundational rethinking of the T2I paradigm for resource-constrained deployment.

1.2. CLIP-Based Text-to-Image Generation

The cross-modal capabilities of CLIP [12,13,14] have inspired its use in T2I generation, which leverages its pre-trained knowledge to potentially reduce computational burdens. For instance, DALL-E 2 [3] uses VQ-VAE [15] with Transformer models and CLIP embeddings, but its reliance on large Transformers is computationally expensive. VQGAN-CLIP [16] employs separate VQGAN and CLIP models, which require coordinated operation. While methods like CLIPDraw [17] and StyleCLIP [18] use CLIP for guidance, they often depend on intensive optimization or large pre-trained GANs. The full potential of CLIP for inherently lightweight T2I models suitable for resource-constrained environments remains largely untapped.

1.3. Our Approach and Contributions

Addressing the challenges of deploying T2I models on resource-constrained devices, we propose a novel lightweight framework centered on a Conditional Variational Autoencoder (CVAE) [19,20], which leverages the pre-trained capabilities of CLIP. Our approach synergistically integrates key components into a purpose-built lightweight architecture, focusing on efficiency and reduced data dependency, thus distinguishing it from methods primarily centered on model compression.

Our dual-conditioned CVAE framework is engineered for inherent efficiency and stability. It utilizes explicit, low-dimensional attribute vectors for fine-grained control and implicit, high-dimensional CLIP image embeddings for rich semantic guidance. This design substantially cuts computational needs and lessens reliance on extensive paired text–image data during CVAE training.

We introduce two pivotal technical innovations: First, to enhance latent space expressiveness and promote generative diversity crucial for lightweight models with limited capacity, we employ Rényi divergence for latent space regularization. As detailed in Section 2.2, specific properties of Rényi divergence (particularly with parameter

α < 1

) offer a promising alternative to KL divergence for capturing wider data variations. Our ablation studies (Section 4.2) empirically validate this. Second, a dedicated text-to-image mapping network transforms CLIP text embeddings into corresponding CLIP image embeddings, effectively bridging the modality gap and improving semantic alignment and generation quality (Section 4.3).

Experiments on CelebA show our CVAE-Mapping model achieves competitive generation quality (FID: 40.53) with a significantly reduced footprint (42 M parameters, 3.2 GMACs/inference, 21 FPS), outperforming larger VAEs and contemporary lightweight T2I methods (Section 4). This validates our framework as an efficient T2I synthesis alternative, especially for resource-constrained environments and data-limited specialized domains.

2. Related Work

2.1. Variational Autoencoders and CLIP-Based Generation

Variational Autoencoders (VAEs) [19] learn latent representations by optimizing an evidence lower bound. Extensions like

β

-VAE [21] target disentangled representations, while VQ-VAE [15] uses discrete latents. Conditional VAEs (CVAEs) [20] integrate conditional information (e.g., text) for controlled generation

p (x | c)

, offering enhanced control and more stable training than Generative Adversarial Networks (GANs) [22], making them suitable for efficient generative frameworks.

T2I generation has evolved from early GAN-based methods like StackGAN [23] and AttnGAN [24] to large Transformer-based models (e.g., DALL-E [1]) and diffusion models like Stable Diffusion [4]. While powerful, these state-of-the-art models (e.g., DALL-E 2 with 3.5 B parameters [3], Imagen with 3B+ [5]) demand substantial computational resources, limiting their deployability.

CLIP [12] has become a cornerstone in T2I, providing rich cross-modal semantic understanding. Models like DALL-E 2 and VQGAN-CLIP [16] leverage this but often rely on computationally intensive frameworks. Effectively fusing information from different modalities is a persistent challenge in multi-modal learning; for instance, recent work in RGB-Thermal salient object detection also explores strategies to manage modality-specific cues and complementary information [25]. The potential for integrating CLIP within inherently lightweight generative architectures, particularly CVAEs designed for efficiency and robust multi-modal conditioning, remains largely unexplored in T2I.

2.2. Lightweight Generation and Alternative Divergence Measures

To address the computational demands of large T2I models, various lightweight strategies have emerged. Model compression has been applied to diffusion models [7], and mobile-specific adaptations like SD-Mobile [11] have been developed. Inherently efficient architectures include GAN-based FastGAN [8] and MobileViT-based approaches [9]. Knowledge distillation, as in eDiff-I [10], also aims to transfer capabilities to smaller models. However, these often involve trade-offs in generation quality and primarily compress existing architectures rather than fundamentally rethinking lightweight T2I design [6].

Researchers have also investigated alternatives to the Kullback–Leibler (KL) divergence in VAEs, as KL can lead to overly smoothed or less diverse generations [26]. The Rényi divergence [27], parameterized by an order

α

, offers a flexible alternative. Specifically, for

α < 1

, it emphasizes regions where the approximate posterior

q_{ϕ} (z | \cdot)

is high even if the prior

p (z)

is low. This is hypothesized to enhance latent space expressivity and promote diversity by better capturing multi-modal data aspects [27]. Broader explorations of f-divergences (including Rényi-related alpha-divergences) in VAEs suggest potential benefits for sample diversity, mode coverage, and representation quality over standard KL divergence (e.g., [28]). These properties make Rényi divergence compelling for regularizing VAE latent spaces, especially for achieving diversity and expressiveness in lightweight architectures. While alternatives like the Wasserstein distance [29] exist, Rényi divergence remains underutilized in conditional generation, particularly for lightweight T2I models.

3. Methods

3.1. Model Architecture

3.1.1. Conditional Variational Autoencoder

The core of our model is a Conditional Variational Autoencoder (CVAE) based on CLIP image embeddings as conditional information. This approach circumvents the need for paired text–image data during the CVAE training phase, leveraging the pre-trained knowledge within CLIP. The CVAE consists of an encoder and a decoder. Figure 1 illustrates the overall architecture and data flow during CVAE training, with example dimensionalities for a

64 \times 64

input.

Encoder, denoted as

q_{ϕ} (z | x, a, c_{i m a g e})

, accepts an image

x

(e.g.,

B \times 3 \times 64 \times 64

), its corresponding attribute vector

a

(e.g.,

B \times 40

), and its CLIP image embedding

c_{i m a g e}

(e.g.,

B \times 512

) as input. As depicted in Figure 1, both

a

and

c_{i m a g e}

are first reshaped (e.g., to

B \times D_{a t t r} \times 1 \times 1

and

B \times D_{c l i p} \times 1 \times 1

respectively) and then spatially tiled to match the image’s spatial dimensions (e.g., to

B \times D_{a t t r} \times 64 \times 64

and

B \times D_{c l i p} \times 64 \times 64

). These tiled conditional features are subsequently concatenated with the input image

x

along the channel dimension, forming a combined input tensor (e.g.,

B \times (3 + 40 + 512) \times 64 \times 64

). This tensor is then processed by a deep convolutional neural network comprising a sequence of four convolutional layers with downsampling (as detailed in Appendix A, Table A1), which ultimately maps the input to the parameters (

μ

and

σ

, e.g.,

B \times 128

each) of a Gaussian distribution in the latent space. The first convolutional layer uses ReLU activation, while subsequent layers incorporate Batch Normalization before ReLU, except for the output, which is flattened before being passed to fully connected layers for

μ

and

σ

.

Decoder, denoted as

p_{θ} (x^{'} | z, a, c_{i m a g e})

, takes a latent vector

z

(e.g.,

B \times 128

, sampled from the distribution defined by

μ

and

σ

using the reparameterization trick:

z = μ + σ ⊙ ε

, where

ε \sim N (0, I)

), the original attribute vector

a

(e.g.,

B \times 40

), and the original CLIP image embedding

c_{i m a g e}

(e.g.,

B \times 512

) as input. As illustrated in Figure 1, these three components are first concatenated along their feature dimension (resulting in, e.g.,

B \times (128 + 40 + 512)

). This concatenated vector is then transformed by an initial fully connected layer (denoted decoder_input) and reshaped into a spatial feature map (e.g.,

B \times 512 \times 4 \times 4

). This feature map is subsequently processed by a series of four transposed convolutional layers with upsampling (mirroring the encoder’s structure, detailed in Appendix A, Table A2) to reconstruct the image

x^{'}

(e.g.,

B \times 3 \times 64 \times 64

). Each transposed convolutional layer is followed by Batch Normalization and ReLU activation, except for the final layer, which uses a Tanh activation function to constrain the output pixel values to the range [−1, 1], consistent with the data normalization described in Section 4.1.

Latent Space: The model employs a latent space characterized by a Gaussian distribution. The encoder (described above) outputs the parameters (

μ

and

σ

) that define this distribution for each input sample. The specific dimensionality of this latent space is a key hyperparameter (denoted latent_dim, e.g., 128), the value of which was determined through preliminary experiments balancing representational capacity and computational efficiency (see Appendix A, Table A1 for the implemented dimension). Latent variables

z

are then sampled from this parameterized distribution using the reparameterization trick for input to the decoder.

3.1.2. Text-to-Image Mapping Network

To enable text-guided image generation, we introduce a text-to-image mapping network. This network learns a mapping from the CLIP text embedding space to the corresponding CLIP image embedding space.

Architecture: The mapping network architecture is based on a Multi-Layer Perceptron (MLP). It accepts the CLIP text embedding (

c_{t e x t}

) as input and processes it through a sequence of fully connected layers. These layers progressively transform the feature representation via intermediate dimensionalities before a final linear projection yields the output embedding. The dimensionality of this output is designed to match that of the target CLIP image embeddings. Standard components facilitate training and generalization; specifically, Batch Normalization and ReLU activations follow the main hidden transformation layers, and Dropout regularization is applied after the initial transformation stages to mitigate overfitting. The specific configuration, including the exact number of layers, the chosen intermediate dimensionalities, and the dropout probability, resulted from empirical evaluation aimed at balancing model capacity and generalization and is detailed fully in Appendix A.2 (Table A3).

The choice of an MLP for this mapping task was motivated by our overarching goal of a lightweight framework and the nature of transforming already semantically rich CLIP embeddings. We hypothesized that a well-configured MLP would offer a strong balance between expressive capability and parameter efficiency for this specific alignment refinement. To empirically validate this architectural decision against a more complex, modern alternative, an exploratory comparison with a minimalist single-layer Transformer encoder was conducted. As detailed in Appendix F (see Table A5), our MLP architecture (approx. 3.94 M parameters) not only demonstrated superior performance in aligning the embeddings (cosine similarity of

0.81 \pm 0.03

) compared to the minimalist Transformer (approx. 5.12 M parameters, cosine similarity of

0.78 \pm 0.04

) but also achieved this with a significantly smaller parameter count. This finding provides strong empirical support for the efficacy and resource efficiency of our chosen MLP design for the mapping network component.

A detailed illustration of this MLP-based mapping network, showing its sequential layers and intermediate feature dimensions, is integrated within the overall generation pipeline depicted in Figure 2.

3.2. Training Procedure

3.2.1. CVAE Training

The CVAE is trained using a variational inference framework, with the aim to maximize the evidence lower bound (ELBO). The overall loss function, L, combines a reconstruction term and a Rényi divergence term, controlled by a weighting factor,

β

:

L = L_{r e c o n s t r u c t i o n} + β \cdot L_{Rényi}

(1)

The reconstruction loss,

L_{r e c o n s t r u c t i o n}

, is the mean squared error (MSE) between the input image,

x

, and the reconstructed image,

x^{'}

:

L_{r e c o n s t r u c t i o n} = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - x_{i}^{'})}^{2}

(2)

where N is the number of pixels, and the summation is over all pixels i.

To regularize the latent space and encourage the CVAE to learn richer, more diverse representations from the data—particularly beneficial for our lightweight architecture—we employ Rényi divergence as an alternative to the standard KL divergence. As discussed in Section 2.2, Rényi divergence, especially with

α < 1

, can offer advantages in capturing the tail behaviors of distributions and promoting expressivity. The Rényi divergence,

L_{Rényi}

, between the approximate posterior,

q_{ϕ} (z | x, a, c_{i m a g e})

, and the prior,

p (z)

(a standard normal distribution

N (0, I)

), with order

α

is generally defined as follows:

D_{α} (q_{ϕ} (z | x, a, c_{i m a g e}) | | p (z)) = \frac{1}{α - 1} log (E_{z \sim q_{ϕ}} [{(\frac{q_{ϕ} (z | x, a, c_{i m a g e})}{p (z)})}^{α - 1}])

(3)

In practice, for Gaussian distributions, a closed-form expression derived in [27] is used for computational stability and efficiency:

L_{Rényi} = D_{α} (q_{ϕ} (z | x, a, c_{i m a g e}) | | p (z)) = \frac{1}{2 (α - 1)} [\sum_{i} log (σ_{i}^{2}) + α \sum_{i} μ_{i}^{2} + \sum_{i} σ_{i}^{2} - d]

(4)

where d is the dimensionality of the latent space, and

μ_{i}

and

σ_{i}

are the i-th elements of the mean vector

μ

and standard deviation vector

σ

, respectively. Based on our ablation study (detailed in Section 4.2), we empirically selected

α = 0.7

, as this value demonstrated the most promising balance between reconstruction quality and the potential for enhanced latent space structure leading to generative diversity.

Based on empirical evaluations (see Section 4.2), we found that setting

β = 1

provided an optimal balance between image fidelity and latent space regularization. Therefore, in our final implementation, we directly combine the reconstruction and Rényi divergence terms without an explicit weighting factor.

A detailed, step-by-step description of the training process can be found in Algorithm A1 in Appendix B.2.

3.2.2. Mapping Network Training

The mapping network is trained separately to learn a transformation from the CLIP text embedding space to the CLIP image embedding space. The training data consists of pairs of pre-computed CLIP text embeddings (

c_{t e x t}

) and corresponding CLIP image embeddings (

c_{i m a g e}

), derived from the training set of the CelebA dataset [30].

We use a mean squared error (MSE) loss function to minimize the Euclidean distance between the predicted CLIP image embeddings (

c_{p r e d i c t e d}

) and the ground truth CLIP image embeddings (

c_{i m a g e}

):

L_{m a p p i n g} = \frac{1}{n} \sum_{i = 1}^{n} | | f_{θ} (c_{t e x t}^{(i)}) - c_{i m a g e}^{(i)} {| |}_{2}^{2}

(5)

where

f_{θ}

represents the mapping network,

c_{t e x t}

is the CLIP text embedding,

c_{i m a g e}

is the corresponding CLIP image embedding, and n is the number of training samples.

The detailed training procedure is provided in Algorithm A2 in Appendix B.2.

3.3. Generation Procedure

Image generation is initiated by providing a textual description. This text is first transformed into a CLIP text embedding,

c_{t e x t}

, using the pre-trained CLIP text encoder [12]. This text embedding is subsequently passed through the trained mapping network,

f_{θ}

(Section 3.1.2), to yield a predicted CLIP image embedding,

c_{p r e d i c t e d_i m a g e} = f_{θ} (c_{t e x t})

. Concurrently, a latent vector,

z

(e.g.,

B \times 128

), is sampled from the standard normal prior distribution,

z \sim N (0, I)

. In parallel, an attribute vector,

a

(e.g.,

B \times 40

), representing desired facial characteristics, is derived from the input text description using a predefined keyword mapping mechanism (detailed in Appendix B.1). The trained CVAE decoder,

p_{θ}

(whose architecture is consistent with that described in Section 3.1.1 and Figure 1), then accepts the concatenated latent vector

z

, the derived attribute vector

a

, and the predicted CLIP image embedding

c_{p r e d i c t e d_i m a g e}

as input to generate the final output image:

x^{'} = p_{θ} (x^{'} | z, a, c_{p r e d i c t e d_i m a g e})

. The complete data flow for this generation process, which illustrates the roles of the CLIP text encoder, the MLP-based Mapping Network, attribute vector derivation, latent sampling, and the subsequent CVAE Decoder operations (including input concatenation, FC transformation, reshaping, and deconvolutional upsampling for a

64 \times 64

output image example), is depicted in Figure 2. A detailed, step-by-step description of the generation algorithm is provided in Algorithm A3 in Appendix B.2.

4. Experiment

To comprehensively evaluate the proposed lightweight text-to-image generation model (CVAE-Mapping), we conducted a series of experiments designed to assess its performance, efficiency, and underlying mechanisms. The evaluation focuses on the following: (1) the efficacy of Rényi divergence compared to the standard Kullback–Leibler (KL) divergence for CVAE training; (2) the contribution of the proposed text-to-image mapping network; (3) the model’s computational efficiency relative to relevant baselines; (4) the characteristics of the learned latent space. Code to reproduce our experiments is available at https://github.com/laowangshini/CVAE-CLIP (accessed on 23 May 2025).

For the quantitative assessment of generation quality and diversity, we primarily employed several standard metrics. Fréchet Inception Distance (FID) [31] measures the similarity between the distribution of generated images and real images in a feature space; a lower FID indicates better quality and realism. Inception Score (IS) [32] evaluates both the quality (clarity of objects) and diversity of generated images; a higher IS is preferred. For evaluating text–image alignment on the MS COCO dataset, we also utilized CLIP Score, which computes the cosine similarity between image and text embeddings from a pre-trained CLIP model (higher is better), and R-precision, which measures the fraction of top-retrieved images relevant to a text query (higher is better). Learned Perceptual Image Patch Similarity (LPIPS) [33] was used where appropriate for reconstruction tasks, measuring perceptual similarity between two images, where lower values indicate higher similarity. All metrics were calculated using established implementations. FID was calculated using the ‘pytorch-fid’ package (https://github.com/mseitzer/pytorch-fid, accessed on 22 May 2024). IS was computed using the standard InceptionV3 model from torchvision (PyTorch’s computer vision library; https://pytorch.org/vision/stable/index.html, accessed on 22 May 2024). LPIPS was calculated using the ‘lpips’ package (https://github.com/richzhang/PerceptualSimilarity, accessed on 22 May 2024) with the AlexNet backbone. CLIP Score and R-precision were computed based on outputs from a standard CLIP model implementation (https://github.com/openai/CLIP, accessed on 22 May 2024). Code to reproduce our experiments is available at https://github.com/laowangshini/CVAE-CLIP (accessed on 22 May 2024). Our experiments were performed on two primary datasets: the CelebA dataset [30] for detailed component analysis and efficiency benchmarks, and the more complex MS COCO dataset [34] to evaluate generalization capabilities.

4.1. Dataset and Implementation Details

We utilized the Large-scale Celeb-Faces Attributes (CelebA) dataset [30], comprising over 200,000 celebrity facial images annotated with 40 binary attributes. Following standard protocols, the dataset was partitioned into training (162,770 images), validation (19,867 images), and test (19,962 images) sets. All images were preprocessed by cropping to the central facial region using provided bounding boxes and resizing to 64 × 64 pixels using bilinear interpolation. Pixel values were subsequently normalized to the range [−1, 1].

All models were implemented in PyTorch (Version 1.13.1; Meta AI, USA) and trained on NVIDIA RTX 4090 (24 GB) GPUs (NVIDIA Corporation, Santa Clara, CA, USA) using CUDA (Version 11.7; NVIDIA Corporation, Santa Clara, CA, USA). Training employed the Adam optimizer [35] with a learning rate of

1 \times 10^{- 4}

,

β_{1} = 0.9

, and

β_{2} = 0.999

. A batch size of 128 was used unless otherwise specified.

4.2. Divergence Metric ( $α$ Selection)

To empirically validate our choice of Rényi divergence and determine an optimal

α

parameter for our CVAE framework, we conducted an ablation study. As outlined in Section 2.2, Rényi divergence with

α < 1

is hypothesized to encourage the model to learn a more expressive latent space, potentially capturing more diverse data modes beneficial for generative tasks, compared to the standard KL divergence (

α = 1.0

). This study evaluates the reconstruction performance of CVAE models (Section 3.1.1) trained with various

α

values in the Rényi divergence term (Equation (4)), following the methodology of Rényi VI [27]. We then assessed their ability to reconstruct images from the validation set using standard reconstruction quality metrics: FID (comparing reconstructions to real validation images), IS (evaluating the quality/diversity of the reconstructed images), and LPIPS (comparing original images to their reconstructions).

The results, presented in Table 1, reveal a distinct trade-off. The standard VAE approach using KL divergence (

α = 1.0

, denoted CVAE-KL) achieved the best reconstruction fidelity, yielding the lowest reconstruction FID (47.0873) and LPIPS (0.1376) scores. However, using Rényi divergence with

α = 0.7

(denoted CVAE-Rényi) resulted in the highest Inception Score (2.3726) for the reconstructed images. While the reconstruction FID for

α = 0.7

(50.2313) is slightly worse than for

α = 1.0

, it remains comparable.

Our primary goal is effective text-to-image generation, which benefits from both fidelity and diversity. While this ablation focuses on reconstruction, the metrics provide insights into the latent space quality. The superior reconstruction IS achieved with

α = 0.7

(Table 1) supports our hypothesis (Section 2.2) that this setting encourages the model to learn a latent representation that better captures distinct features and potentially diverse modes present in the data, even if it comes at a small cost to average pixel-wise reconstruction accuracy (as measured by FID and LPIPS here). We hypothesize that this richer latent structure, indicated by the high reconstruction IS and motivated by the theoretical properties of Rényi divergence for

α < 1

, is more conducive to generating diverse and high-quality samples in the downstream text-to-image generation task. Therefore, prioritizing the potential for enhanced generative diversity indicated by the reconstruction IS, we selected

α = 0.7

for our primary model configuration (CVAE-Rényi) used in subsequent generation experiments (Section 4.3 onwards). The effectiveness of this choice is indirectly supported by the competitive generation performance achieved by the final CVAE-Mapping model (Section 4.3.1), which incorporates this CVAE trained with

α = 0.7

. Stable training convergence for both

α = 0.7

and

α = 1.0

settings was confirmed (Appendix D Figure A4).

4.3. Evaluation of Mapping Network

To rigorously evaluate the effectiveness of the text-to-image mapping network (Section 3.1.2) for text-guided image generation, we compared the performance of our full proposed model (CVAE-Mapping) against a carefully controlled baseline (CVAE-Text). Both configurations utilize the identical core CVAE architecture (Section 3.1.1), pre-trained using Rényi divergence (

α = 0.7

) with dual conditions (attribute vectors

a

and ground-truth CLIP image embeddings

c_{i m a g e_g t}

), as detailed in Section 3.2.1. The sole difference between CVAE-Mapping and CVAE-Text occurs during the generation phase (Section 3.3) in how the conditional inputs are provided to the trained CVAE decoder

p_{θ} (x^{'} | z, a, \cdot)

:

CVAE-Mapping (Proposed): Takes the input text prompt, extracts the 40-dimensional attribute vector $a$ using keyword matching, and encodes the text into a CLIP text embedding $c_{t e x t}$ . This $c_{t e x t}$ is then passed through the trained mapping network $f_{θ}$ to produce a predicted 512-dimensional CLIP image embedding $c_{p r e d i c t e d_i m a g e} = f_{θ} (c_{t e x t})$ . The decoder then receives the latent vector $z$ , the attribute vector $a$ , and the predicted CLIP image embedding $c_{p r e d i c t e d_i m a g e}$ as inputs: $p_{θ} (x^{'} | z, a, c_{p r e d i c t e d_i m a g e})$ .
CVAE-Text (Baseline): Also takes the input text prompt and extracts the identical 40-dimensional attribute vector $a$ . It encodes the text into the same CLIP text embedding $c_{t e x t}$ . However, it bypasses the mapping network. Instead, the original 512-dimensional CLIP text embedding $c_{t e x t}$ is directly used to fill the 512-dimensional conditional slot normally occupied by the CLIP image embedding. The decoder thus receives the latent vector $z$ , the attribute vector $a$ , and the original CLIP text embedding $c_{t e x t}$ as inputs: $p_{θ} (x^{'} | z, a, c_{t e x t})$ .

This experimental setup ensures that both models receive the same fine-grained attribute control via

a

and the same initial semantic information via

c_{t e x t}

. The comparison therefore isolates the impact of using the mapping network to translate the text embedding into the target CLIP image embedding modality (

c_{p r e d i c t e d_i m a g e}

) versus using the raw CLIP text embedding (

c_{t e x t}

) directly as the high-dimensional semantic condition for the pre-trained CVAE decoder.

4.3.1. Quantitative Evaluation

Our central hypothesis is that the mapping network improves generation by better aligning text-derived conditional information with the image embedding space utilized by the CVAE decoder. We tested this quantitatively using the designated test set partition.

First, we assessed the mapping network’s direct ability to bridge the modality gap. We computed the average cosine similarity between the original CLIP text embeddings (

c_{text}

) and their corresponding original CLIP image embeddings (

c_{image}

) from the test set to establish a baseline measure of misalignment. We then compared this baseline to the average similarity between the predicted image embeddings (

c_{pred_image} = f_{θ} (c_{text})

) generated by our mapping network and the ground-truth image embeddings (

c_{image}

). Results are presented in Table 2. Statistical significance was determined via a t-test over 10,000 test set embedding pairs.

As shown in Table 2, the mapping network yields a statistically significant improvement in embedding alignment (p < 0.001), substantially increasing the cosine similarity between text-derived and image-derived representations compared to the baseline misalignment.

To further visualize this alignment, Figure 3 presents t-SNE [36] projections of the embedding spaces. Before applying the mapping network (Figure 3a), the t-SNE visualization reveals a noticeable separation between the manifolds of original CLIP text embeddings (e.g., shown in blue) and their corresponding CLIP image embeddings (e.g., shown in orange), indicating a clear modality gap despite their shared semantic pre-training. In stark contrast, after the text embeddings are processed by our mapping network (Figure 3b), the resulting predicted image embeddings exhibit a significantly improved overlap and intermingling with the ground-truth image embeddings. This increased proximity and more unified distribution visually corroborates the quantitative improvements in cosine similarity reported in Table 2, suggesting that the mapping network successfully transforms text embeddings into a space that is more congruent with the image embedding modality for corresponding semantic concepts. Figure 4 further illustrates the distribution of cosine similarities between predicted and real image embeddings, demonstrating the effectiveness of our approach with a mean similarity of 0.8324.

Second, we evaluated the impact of this improved alignment on the downstream image generation task. We generated images from a diverse set of 1000 text prompts selected from the test set using both the CVAE-Mapping and CVAE-Text configurations. We compared their generation performance using FID and IS. Critically, differing from the reconstruction evaluation (Section 4.2), FID and IS here evaluate the distribution of images generated from text prompts against the distribution of real images from the test set. We also calculated the average cosine similarity between the input text embeddings (

c_{text}

) and the CLIP embeddings derived from the final generated images (

c_{gen_image}

). The results are summarized in Table 3 Statistical significance was determined using t-tests over 5 independent runs.

The evaluation detailed in Table 3 confirms that improved embedding alignment translates to superior generation outcomes. CVAE-Mapping significantly outperforms the CVAE-Text baseline across all generation metrics (p < 0.001 for all comparisons), achieving lower FID, higher IS, and higher text-to-generated-image cosine similarity. This strongly supports the conclusion that the mapping network effectively bridges the modality gap for higher-quality, more semantically accurate text-to-image synthesis.

4.3.2. Qualitative Evaluation

The qualitative impact of the mapping network is illustrated in Figure 5, which presents side-by-side comparisons of images generated by the CVAE-Mapping and CVAE-Text models using identical text prompts, attribute vectors (

a

), and latent vectors (

z

). This controlled setup ensures that the comparison isolates the specific impact of the mapping network on generation quality.

The comparison in Figure 5 showcases the image results generated by the two models for the text prompt “A young woman with bangs and blond hair”. Image (a), generated by CVAE-Mapping, successfully captures the specified attribute features, including the details of bangs and blond hair (highlighted by the red circles), along with facial characteristics consistent with a young woman. In contrast, for image (b), generated by CVAE-Text, some attributes (such as bangs or blond hair) are less evident or lack detail, particularly the absence of clear bangs despite the prompt. This visual discrepancy strongly supports the conclusion that the mapping network effectively bridges the modality gap, enhancing the CVAE decoder’s ability to precisely interpret and render fine-grained textual descriptions, thereby improving semantic alignment and overall generation fidelity. Further examples are provided in Appendix C.1.

4.4. Efficiency and Comparative Performance Analysis

To comprehensively evaluate the practical advantages of our CVAE-Mapping model, we analyze its computational efficiency and compare its performance profile against relevant generative model baselines. We focus on model size (parameter count), computational complexity (Multiply–Accumulate operations, MACs, estimated for single image generation at the respective resolution), inference speed (Frames Per Second, FPS, measured on an NVIDIA RTX 4090 GPU with batch size 1), and generation quality (FID, IS, CLIP Score, and R-precision on the evaluated datasets). Our analysis is conducted on the CelebA 64 × 64 dataset to demonstrate core efficiency and performance in a resource-constrained context and on the MS COCO 256 × 256 dataset to evaluate generalizability to more complex, higher-resolution text-to-image tasks.

4.4.1. Performance on CelebA 64 × 64

In evaluating our CVAE-Mapping model’s performance on the CelebA 64 × 64 dataset, we compare it against various generative approaches, particularly focusing on efficiency within this resource-constrained context (Table 4). We define “lightweight” by a synergistic combination of low parameter count, low inference computational cost (MACs), high inference speed (FPS), inherent training stability, and reduced reliance on extensive paired text–image data. These attributes are paramount for the practical deployment scenarios motivating our research.

Table 4 presents a comprehensive comparison on the CelebA 64 × 64 dataset.

Efficiency and Parameter Count: Our CVAE-Mapping model (42 M parameters, 3.2 GMACs) demonstrates significant efficiency, particularly within the VAE family (VAE-LD: 96 M, VDVAE: 112 M) and compared to diffusion-based alternatives like MobileDiffusion (58 M, approx. 4.45 GMACs). This translates to a high inference speed of 21 FPS, advantageous for resource-constrained applications. While FastGAN (30 M) is more parameter-efficient, its computational cost is not directly comparable.

Generation Quality (FID and IS): On CelebA 64 × 64, our CVAE-Mapping achieves an FID of 40.53 and an IS of 2.43. This FID is higher than larger VAEs (VAE-LD: 25.8, VDVAE: 23.25), reflecting the trade-off for lightweight design. Compared to other lightweight methods, our FID is comparable to MobileDiffusion (40) and slightly higher than FastGAN’s unconditional FID (35.5). Our IS is respectable, though lower than MobileDiffusion (2.89). It is important to contextualize these FID values, as direct comparison across different architectural paradigms (GAN vs. Diffusion vs. VAE/CVAE) is nuanced due to their varying strengths (e.g., GANs for sharpness, Diffusion for diversity). While advanced architectures often achieve lower FIDs, they typically do so with substantially larger models or greater training complexity. Simpler VAEs, for instance, often struggle with low FID in generation tasks [39].

In summary, for CelebA, our CVAE-Mapping model offers a compelling balance: it provides reasonable generation quality for a CVAE-based approach while excelling in parameter count, computational cost, and inference speed, making it highly suitable for scenarios prioritizing efficiency. To further assess its robustness and applicability to more diverse and complex visual concepts, we extend our evaluation to the MS COCO dataset.

4.4.2. Generalization to MS COCO 256 × 256

To assess the broader applicability and generalization capabilities of our CVAE-Mapping framework, we extended our evaluation to the more challenging MS COCO dataset [34], performing text-to-image generation at a 256 × 256 resolution. For experiments on MS COCO, image–text pairs were sourced from the COCO 2014 training set, with evaluations performed on the COCO 2014 validation set, utilizing captions as text prompts. Images were resized to

256 \times 256

pixels and normalized to the [−1, 1] range, which was consistent with our CelebA preprocessing. The core CVAE architecture (Section 3.1.1) was adapted for this higher resolution primarily by adjusting the channel dimensions in the initial and final convolutional/transposed convolutional layers to accommodate the larger feature maps, resulting in the CVAE-Mapping model for MS COCO having approximately 54 M trainable parameters. The AdamW optimizer [35] was employed with an initial learning rate of

1 \times 10^{- 4}

. The model was trained for 50 epochs. Other fundamental training settings, where applicable, were kept consistent with the CelebA experiments to ensure a fair comparison of the framework’s adaptability. This benchmark, with its greater semantic diversity and image complexity compared to CelebA, serves as a rigorous test for model robustness and scalability.

Table 5 presents the performance of our model on MS COCO alongside representative T2I baselines. Our CVAE-Mapping model, adapted for this task with 54 M parameters, achieves an FID of 24.26, a CLIP Score range of 0.275–0.287, and an R-precision of 0.32, while operating at an inference speed of 8 FPS.

Performance Analysis on MS COCO: The FID of 24.26 on MS COCO, while higher than that of leading large-scale diffusion models like Fluid and U-ViT-S/2, is notably competitive, outperforming some larger GAN-based counterparts such as StyleGAN-T (26.7) and Paella (26.7). This performance is achieved with only 54M parameters, underscoring our model’s significant lightweight advantage compared to most models listed, some of which scale to billions of parameters. Our text–image alignment, reflected by a CLIP Score of 0.275–0.287 and R-precision of 0.32, remains strong and comparable to several more complex models (e.g., R-precision similar to TIGER and Fluid). This validates the effectiveness of our CLIP-based conditioning and MLP mapping network in bridging the text–image modality gap even for the diverse semantic content present in MS COCO.

Generalization and Efficiency Trade-offs: Adapting our model from CelebA (42 M params, 21 FPS, FID 40.53 for 64 × 64) to MS COCO (54 M params, 8 FPS, FID 24.26 for 256 × 256) involved a modest parameter increase and a reduction in inference speed, as expected for the higher resolution and complexity. The observed improvement in FID on MS COCO compared to CelebA, despite MS COCO’s greater difficulty, is an interesting finding, potentially reflecting the FID metric’s dataset-specific sensitivities or our model’s capacity to better leverage its architecture on more varied semantic inputs once scaled. Crucially, these results affirm our framework’s ability to generalize beyond simpler datasets like CelebA to more complex visual scenes, generating semantically diverse content at a higher resolution while maintaining a favorable efficiency profile. This is further illustrated by the qualitative examples in Figure 6, which showcase the model’s capability to generate images semantically aligned with diverse text prompts, capturing key elements of the described scenes. While the visual realism may not match multi-billion parameter models, the results highlight the effectiveness of our lightweight approach for achieving reasonable text-to-image synthesis for varied content under resource constraints.

Positioning Relative to Advanced Diffusion Models: It is important to position our CVAE-based approach distinctly from state-of-the-art large-scale diffusion models. Architectures like eDiff-I [5], while achieving superior FID scores, are typically orders of magnitude larger in parameter count and computational demand. Our CVAE-Mapping model is not designed to directly compete on raw generation fidelity with these giants but rather to offer a uniquely efficient and practical CVAE-based solution. It prioritizes a balance suitable for resource-constrained scenarios where extreme lightweightness, fast inference, and reduced data dependency are paramount, even if it means a trade-off in peak visual quality compared to SOTA diffusion models. This positions our work as a valuable alternative for applications demanding broader accessibility and on-device feasibility.

Qualitative examples of images generated by our CVAE-Mapping model on MS COCO 256 × 256 are shown in Figure 6. Visual inspection of these examples demonstrates the model’s capability to generate images that are semantically aligned with diverse text prompts and capture key elements of the described scenes such as sheep in the grass, kites in the sky, clock towers, and baseball players. Although the visual realism might not match that of significantly larger state-of-the-art models, the results highlight the effectiveness of our lightweight framework in generalizing to a more complex dataset and achieving reasonable text-to-image synthesis for varied content.

4.4.3. Overall Efficiency and Performance Summary

Considering the evaluations on both CelebA 64 × 64 and MS COCO 256 × 256, our CVAE-Mapping framework consistently demonstrates its core strength: achieving a practical balance between generative performance (both fidelity and text–image alignment) and computational resource constraints. On CelebA, it offers significant efficiency gains (42 M params, 21 FPS) over larger VAEs while maintaining reasonable FID for its class. On the more challenging MS COCO, the model (54 M params, 8 FPS) proves its ability to generalize to diverse semantics and higher resolutions, delivering competitive alignment scores and mid-range FID with a parameter count that remains substantially lower than many contemporary T2I systems. This multifaceted efficiency (low parameter count, respectable inference speed, reduced data dependency due to CLIP pre-training and CVAE’s training regime) combined with stable training positions CVAE-Mapping as a promising and practical direction for T2I generation in environments where computational resources are a primary concern, such as on-device applications or specialized domains with limited data. While further platform-specific optimizations (e.g., quantization, pruning) would be necessary for optimal on-device deployment, the inherent efficiency demonstrated establishes a strong foundation.

4.5. Analysis of Learned Latent Space Structure

The structural integrity and semantic organization of the latent space learned by our dual-conditioned CVAE (regularized with Rényi divergence,

α = 0.7

) are critical indicators of its generative capabilities. To investigate these aspects, we performed linear interpolations between latent mean vectors (

μ

) derived from pairs of distinct real images. Specifically, for each pair

(x_{1}, x_{2})

, their respective encodings yielded

μ_{1}, σ_{1}

and

μ_{2}, σ_{2}

. Interpolated latent codes were then generated as

z_{i n t e r p} = (1 - λ) μ_{1} + λ μ_{2}

, where

λ

varied from 0 to 1. During decoding, the conditional inputs—attribute vector

a_{1}

and CLIP image embedding

c_{i m a g e, 1}

—were held constant, corresponding to those of the first image (

x_{1}

). This experimental design isolates the visual transformations attributable primarily to traversals within the latent space

z

.

Representative interpolation sequences are presented in Figure 7. A consistent observation across diverse image pairs is the remarkable smoothness of the visual transitions, devoid of perceptible artifacts or abrupt discontinuities. This inherent smoothness signifies a well-regularized latent manifold, a foundational property for coherent generation. More profoundly, these interpolations reveal evidence of semantic structure and potential disentanglement within the latent space. For instance, in Sample Pair 2 (Figure 7, second row), a continuous and naturalistic transformation of hairstyle and subtle hair color is observed, progressing from shorter, darker hair to longer, lighter, and more voluminous styles. Crucially, this significant alteration in a primary visual attribute occurs while the core facial identity and the subject’s smiling expression are largely preserved. This suggests that the latent space has encoded variations related to hair characteristics along specific, traversable pathways that are, to a notable extent, orthogonal to those encoding identity and expression.

Further evidence of such structured representation is found in Sample Pairs 4 and 5 (Figure 7, fourth and fifth rows). Here, the foreground subject’s identity remains stable while the background ambiance, lighting conditions, and degree of blur undergo smooth, continuous transformations (e.g., transitioning from a cool, bright, sharp background to a warmer, dimmer, more diffused one in Pair 4). This capacity to modulate global scene characteristics independently of the primary subject implies that the latent space has learned to factorize these distinct sources of variation. Even more subtle transformations, such as minor pose adjustments or nuanced expression shifts while maintaining an overarching emotional tone (Sample Pairs 1 and 3), are also navigated smoothly. The collective evidence from these varied interpolations—spanning fine-grained facial attributes, identity preservation, and global scene properties—indicates that the learned latent space is not merely a compressed representation but possesses a rich semantic organization.

We posit that this desirable latent structure is a synergistic outcome of our model’s dual-conditioning mechanism and the choice of Rényi divergence as the regularization term. The explicit attribute vector

a

provides targeted, low-dimensional guidance, while the high-dimensional CLIP image embedding

c_{i m a g e}

offers rich semantic context. The Rényi divergence (with

α = 0.7

), by potentially fostering a more expressive posterior approximation that encourages better coverage of diverse data modes and discourages posterior collapse (as theorized in Section 2.2), may play a crucial role in shaping a latent space where semantic factors are more effectively disentangled and represented along continuous, navigable manifolds. This structured organization is paramount for enabling fine-grained, controllable image synthesis and editing. While the presented interpolations offer compelling qualitative insights, a rigorous quantitative assessment of disentanglement (e.g., using established metrics) and a systematic study of attribute-specific latent traversals would be necessary to fully substantiate these claims and precisely delineate the individual contributions of each model component to the learned latent structure. Nevertheless, the current analysis strongly suggests that our lightweight framework learns a highly structured and semantically meaningful latent space conducive to versatile and coherent image generation.

5. Conclusions

The substantial computational requirements and extensive need for paired data of state-of-the-art text-to-image (T2I) models pose significant barriers to their practical deployment, especially on resource-constrained platforms and in domains with scarce data. To tackle these challenges, we proposed and assessed a lightweight T2I generation framework. Our core approach is founded on a dual-conditioned Conditional Variational Autoencoder (CVAE) that efficiently utilizes semantic guidance from pre-trained CLIP embeddings and enables explicit attribute control. This fundamental design is crucial for the framework’s lightweight characteristic and diminishes the heavy dependence on extensive paired text–image data during the core CVAE training stage.

Within this efficient CVAE framework, we incorporated and validated two pivotal technical components that enhance generation performance. We introduced a specialized text-to-image mapping network based on an MLP architecture chosen for its balance of effectiveness and efficiency within our lightweight design goals to effectively bridge the modality gap between CLIP text and image embeddings, thereby improving semantic alignment and generation fidelity. The rationale for this architectural choice, including an exploratory comparison against a minimalist Transformer encoder, which further validated the MLP’s superior performance-to-parameter ratio for this task, is detailed in Appendix A.2 and Appendix F (see Table A5). Moreover, we employed Rényi divergence for latent space regularization to enhance generative diversity while maintaining acceptable reconstruction fidelity, as evidenced by our ablation studies.

Comprehensive experiments conducted on the CelebA dataset demonstrate that our integrated CVAE-Mapping model attains competitive generation quality (FID: 40.53) with a significantly reduced computational footprint (42 M parameters, facilitating real-time generation at 21 FPS). This performance validates our framework as a viable and efficient alternative for T2I synthesis. The results underscore its practical potential for deployment in environments with limited computational resources and its suitability for specialized domains by alleviating the requirement for extensive paired text–image data during core model training.

Future research will focus on extending this framework to higher resolutions and exploring more advanced techniques. Regarding conditional attribute information from text, while our current keyword-based attribute extraction method (detailed in Appendix B.1) aligns with the model’s lightweight nature due to its simplicity and low computational cost, its sensitivity to linguistic variations beyond direct keyword matches has been noted. A qualitative analysis, presented in Appendix B.1 (see Table A4 and Figure A1), illustrates this limitation and suggests a clear avenue for enhancement. Implementing more sophisticated Natural Language Processing (NLP) methods for attribute derivation could significantly improve user control and the model’s ability to interpret nuanced prompts more accurately. Concerning the mapping network, while our MLP demonstrated strong performance and efficiency (Appendix F), further investigation into other compact, modern architectures can be pursued for potential marginal gains while the trade-off with complexity in resource-constrained scenarios is being considered. Further avenues for future work also include investigating alternative divergence measures for potentially greater gains in generation quality or diversity and exploring architectural refinements to boost image fidelity without substantially increasing the computational demands, thereby maintaining the model’s suitability for resource-constrained applications. Ultimately, this work contributes toward making text-to-image generation technology more accessible and practical across a wider range of devices and domains.

Author Contributions

Conceptualization, Y.W. and G.Z.; Methodology, G.Z.; Software, Y.W.; Validation, Y.W.; Formal analysis, Y.W.; Investigation, Y.W.; Resources, Y.W.; Data curation, Y.W.; Writing—original draft, Y.W.; Writing—review & editing, Y.W. and G.Z.; Visualization, Y.W.; Supervision, G.Z.; Project administration, Y.W. and G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available in this article.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Network Architecture Details

This appendix provides comprehensive architectural details of the encoder and decoder networks used in our Conditional Variational Autoencoder (CVAE) model. The precise network architecture is crucial for reproducibility and for understanding the computational efficiency of our approach. Table A1 and Table A2 present the layer-by-layer specifications of the encoder and decoder networks, respectively, including input/output shapes, kernel sizes, stride values, padding configurations, and the activation functions used.

Appendix A.1. Encoder and Decoder Network Specifications

Table A1. Detailed Architecture of the CVAE Encoder.

Layer	Input Shape	Output Shape	Kernel Size	Stride	Padding	Activation	Batch Norm
Conv1	(3 + 40 + 512, 64, 64)	(64, 32, 32)	4	2	1	ReLU	No
Conv2	(64, 32, 32)	(128, 16, 16)	4	2	1	ReLU	Yes
Conv3	(128, 16, 16)	(256, 8, 8)	4	2	1	ReLU	Yes
Conv4	(256, 8, 8)	(512, 4, 4)	4	2	1	ReLU	Yes
Flatten	(512, 4, 4)	(8192)	-	-	-	-	-
FC ( $μ$ )	(8192)	(128)	-	-	-	-	-
FC ( $σ$ )	(8192)	(128)	-	-	-	-	-

Table A2. Detailed Architecture of the CVAE Decoder.

Layer	Input Shape	Output Shape	Kernel Size	Stride	Padding	Activation	Batch Norm
FC (Input)	(128 + 40 + 512)	(8192)	-	-	-	-	-
Reshape	(8192)	(512, 4, 4)	-	-	-	-	-
ConvTrans1	(512, 4, 4)	(256, 8, 8)	4	2	1	ReLU	Yes
ConvTrans2	(256, 8, 8)	(128, 16, 16)	4	2	1	ReLU	Yes
ConvTrans3	(128, 16, 16)	(64, 32, 32)	4	2	1	ReLU	Yes
ConvTrans4	(64, 32, 32)	(3, 64, 64)	4	2	1	Tanh	No

Appendix A.2. Text-to-Image Mapping Network Architecture

Table A3. Detailed Architecture of the Text-to-Image Mapping Network.

Layer	Input Shape	Output Shape	Activation	Batch Norm	Dropout
Linear 1	(512)	(2048)	ReLU	Yes	0.3
Linear 2	(2048)	(1024)	ReLU	Yes	0.3
Linear 3	(1024)	(512)	ReLU	Yes	-
Linear 4	(512)	(512)	-	No	-

Appendix B. Implementation Details and Algorithms

Appendix B.1. Attribute Vector Generation (`get_attributes_from_prompt`)

To provide explicit control over generated image characteristics alongside the semantic guidance from CLIP embeddings, our model utilizes a 40-dimensional attribute vector

a

. This vector corresponds directly to the 40 binary facial attributes annotated in the CelebA dataset [30] (e.g., ‘Smiling’, ‘Blond_Hair’, ‘Wearing_Hat’, ‘Male’, etc.). During the generation phase (Section 3.3, Algorithm A3), this vector

a

is derived from the input text prompt

t e x t

using a predefined keyword mapping mechanism, conceptually represented by the function get_attributes_from_prompt(text).

The implementation employs a straightforward keyword spotting approach:

Initialization: The 40-dimensional vector $a$ is initialized to all zeros. This represents the neutral or default state for each attribute.
Keyword Matching: The input text prompt is parsed to identify predefined keywords associated with each of the 40 CelebA attributes. A lexicon maps specific words or short phrases to their corresponding attribute index and target value (typically 1 for presence, potentially −1 or 0 for explicit absence, although our implementation primarily focuses on presence detection).
Vector Update: If a keyword indicating the presence of an attribute is found in the text prompt, the corresponding element in the vector $a$ is set to 1. For example:
- Input text: “A smiling woman with blond hair and wearing glasses”.
- Keywords detected: “smiling”, “blond hair”, “wearing glasses”.
- Corresponding indices in $a$ (representing ‘Smiling’, ‘Blond_Hair’, ‘Wearing_Glasses’) are set to 1.
- All other 37 elements remain 0.
Handling Negation (Limited): Basic negation (e.g., “no beard”) can be handled by mapping it to a specific value (e.g., 0 or potentially a negative value if the model was trained to interpret it) for the corresponding attribute (‘No_Beard’ or ‘Beard’). However, the robustness to complex phrasing or nuanced negation is limited.

Limitations and Rationale: This keyword-based approach is a simplification chosen for its low computational overhead, aligning with the lightweight nature of our overall model. It provides a direct mechanism for users to specify desired high-level attributes present in the CelebA annotation scheme. However, its limitations are clear:

Robustness: It is sensitive to the exact phrasing used in the text prompt. Synonyms or complex sentence structures not containing the predefined keywords may fail to activate the intended attribute.
Granularity: It cannot capture attribute nuances beyond the binary presence/absence defined in CelebA (e.g., intensity of a smile, specific style of hat).
Scope: It is limited to the 40 attributes defined in the CelebA dataset.

Despite these limitations, this explicit attribute vector

a

acts as a complementary control signal to the richer, but less directly controllable, semantic information provided by the CLIP image embedding

c_{p r e d i c t e d_i m a g e}

, allowing for a degree of targeted feature manipulation within our efficient framework. Future work could explore more sophisticated Natural Language Processing techniques to derive attributes with greater robustness and granularity. This detailed description aims to ensure clarity for reproducibility and the scope of the attribute control mechanism.

Qualitative Analysis of Attribute Vector Generation Robustness

This subsection provides a qualitative analysis of the robustness of our keyword-based attribute vector generation method (described in Appendix B.1). The aim is to illustrate its effectiveness with precise keywords and its sensitivity to linguistic variations—a trade-off made for computational simplicity within our lightweight model.

Experimental Setup: We selected key facial attributes (e.g., ‘Smiling’, ‘Blond Hair’, ‘Wearing Glasses’) and crafted various text prompts. These prompts ranged from those containing direct, predefined keywords to those using synonyms or descriptive phrases that imply the attribute without using the exact keyword. We then recorded the value (1 for present, 0 for absent) of the target attribute in the vector

a

generated by our get_attributes_from_prompt function.

Results and Discussion: Table A4 highlights the performance of our keyword-matching approach across different textual inputs. A value of ‘1’ indicates the target attribute was successfully activated, while ‘0’ indicates it was not.

Table A4. Effectiveness of Keyword-Based Attribute Extraction with Varied Prompts.

Text Prompt	Target Attribute	Generated Value (0 or 1)
“A smiling woman”.	Smiling	1
“A joyful person”.	Smiling	0
“Man with blond hair”.	Blond Hair	1
“Woman, her hair is golden”.	Blond Hair	0
“Person wearing glasses”	Wearing Glasses	1
“A man with spectacles”	Wearing Glasses	0
“A person not wearing a hat”.	Wearing Hat	0
“A bald man”.	Wearing Hat	0

As Table A4 demonstrates, the keyword-based attribute extraction is reliable when predefined keywords are explicitly present in the text prompt, allowing for direct control over these attributes. However, its sensitivity to linguistic variations, such as the use of synonyms (e.g., “joyful” instead of “smiling”; “golden” instead of “blond hair”; “spectacles” instead of “wearing glasses”), is also evident, often resulting in the failure to activate the intended attribute.

This limitation directly impacts the fine-grained control over specific visual details in the generated images. Figure A1 provides a clear visual example using the prompt “Woman, her hair is golden”. Figure A1a was generated with the ’Blond Hair’ attribute manually set to 1, resulting in an image accurately depicting golden/blond hair. In contrast, Figure A1b generated using the attribute vector produced by our keyword-matching method (where ‘Blond Hair’ was 0 due to the absence of the “blond hair” keyword), lacks this specific hair color, even though other semantic cues from the CLIP conditioning might be present. To isolate the effect of the attribute vector

a

, both images were generated using the identical latent vector

z

and predicted CLIP image embedding

c_{p r e d i c t e d_i m a g e}

.

Figure A1. Visual impact of attribute extraction for the prompt “Woman, her hair is golden”. (a) Image generated with the ‘Blond Hair’ attribute ideally set to 1. (b) Image generated with the attribute vector from keyword matching, where ‘Blond Hair’ was 0. The difference clearly illustrates how the accuracy of the attribute vector affects specific visual details.

The adoption of this simplified keyword-based approach was a deliberate design choice aimed at maintaining low computational overhead in line with the lightweight philosophy of our model. While it offers a degree of explicit control, we fully acknowledge its limitations in interpreting the full spectrum of natural language. Future research could fruitfully explore the integration of more sophisticated Natural Language Processing (NLP) techniques for attribute extraction. Such advancements could enhance robustness to linguistic variations and further refine the fine-grained controllability and expressiveness of the generated images, potentially without a significant increase in computational demands for the generation phase itself.

Appendix B.2. Algorithms

Algorithm A1 CLIP Embedding-based Conditional VAE Training Procedure

Input:

Training dataset: $D = {(x^{(i)}, a^{(i)}, c^{(i)})}_{i = 1}^{N}$ , where $x^{(i)}$ represents images, $a^{(i)}$ represents ground-truth CelebA attribute vectors, and $c^{(i)}$ denotes corresponding CLIP image embeddings
Model: Conditional VAE (encoder $q_{ϕ} (z | x, a, c)$ , decoder $p_{θ} (x | z, a, c)$ )
Optimizer: Adam
Hyperparameters: batch size B, latent dimension $l a t e n t_d i m$ , learning rate $α$ , Rényi divergence parameter $α_{R}$

Output:

Trained model parameters: $θ$ (decoder parameters), $ϕ$ (encoder parameters)

Initialization: Randomly initialize model parameters

θ

and

ϕ

Annotation: This algorithm details the training procedure for the core Conditional VAE model, as described in Section 3.2.1. Note that the inputs include the image

x^{(i)}

, the explicit attribute vector

a^{(i)}

, and crucially, the corresponding pre-computed CLIP image embedding

c^{(i)}

. This dual conditioning is central to the CVAE training phase. The loss function combines Mean Squared Error for reconstruction (

L_{r e c o n}

) with the closed-form Rényi divergence (

L_{Rényi}

) using the empirically selected parameter

α_{R} = 0.7

(Section 4.2) and

β = 1

. Training hyperparameters such as learning rate (

α

) and batch size (B) are specified in Section 4.1.

Algorithm A2 Mapping Network Training

Input:

Training dataset: $D = {(c_{t e x t}^{(i)}, c_{i m a g e}^{(i)})}_{i = 1}^{N}$ , where $c_{t e x t}^{(i)}$ are CLIP text embeddings and $c_{i m a g e}^{(i)}$ are corresponding CLIP image embeddings
Model: Mapping network ( $m a p p i n g_n e t w o r k$ )
Optimizer: Adam
Hyperparameters: batch size B, learning rate $α$
Number of Epochs: $e p o c h s$

Output:

Trained mapping network parameters: $θ_{m a p}$

Initialization: Randomly initialize mapping network parameters

θ_{m a p}

Annotation: This algorithm outlines the separate training process for the text-to-image mapping network (

m a p p i n g_n e t w o r k

, architecturally defined in Section 3.1.2), as discussed in Section 3.2.2. This network learns a direct transformation

f_{θ_{m a p}}

from pre-computed CLIP text embeddings (

c_{t e x t}^{(i)}

) to predict the corresponding CLIP image embeddings (

{\hat{c}}_{i m a g e}^{(i)}

). Training minimizes the Mean Squared Error (MSE) between the predicted image embedding and the ground-truth image embedding (

c_{i m a g e}^{(i)}

). Training hyperparameters are specified in Section 4.1.

Algorithm A3 Text-to-Image Generation using CLIP and CVAE
Input: Text description: $t e x t$ Pretrained CLIP model: $c l i p_m o d e l$ (text encoder component) Trained mapping network: $m a p p i n g_n e t w o r k$ (with parameters $θ_{m a p}$ ) Trained CVAE decoder: $d e c o d e r$ (with parameters $θ$ ) Parameters: latent dimension $l a t e n t_d i m$ , number of images to generate n Output: Generated images: ${{\hat{x}}^{(i)}}_{i = 1}^{n}$
`/* Text Encoding`	`*/`
Encode text using CLIP text encoder: $c_{t e x t} = c l i p_m o d e l . e n c o d e_t e x t (t e x t)$
`/* Text-to-Image Mapping`	`*/`
Predict image embedding using mapping network: $c_{i m a g e} = m a p p i n g_n e t w o r k (c_{t e x t})$
`/* Attribute Vector Derivation`	`*/`
Derive attribute vector from text prompt using keyword mapping (See Appendix B.1):
$a = get_attributes_from_prompt (text)$

Annotation: This algorithm describes the inference pipeline for generating images from text descriptions, corresponding to Section 3.3 and illustrated in Figure 2. The process starts with encoding the input text using the pre-trained CLIP model. The resulting text embedding (

c_{t e x t}

) is then transformed by the trained mapping network (Algorithm A2) into a predicted image embedding (

c_{i m a g e}

). Concurrently, an attribute vector (a) is derived from the text (Appendix B.1). Finally, the trained CVAE decoder (from Algorithm A1) synthesizes the output image

{\hat{x}}^{(i)}

conditioned on the predicted image embedding (

c_{i m a g e}

), the derived attribute vector (a), and a randomly sampled latent vector

z^{(i)}

. Note that the CVAE encoder is not utilized during this generation phase.

Appendix C

Appendix C.1. Additional Generation Examples

Figure A2 presents additional comparison examples of images generated by CVAE-Mapping and CVAE-Text models with the prompt related to wearing glasses. These supplementary examples clearly demonstrate that CVAE-Mapping can more accurately capture the detailed features in text descriptions, such as glasses, hairstyles, and facial expressions, further proving the effectiveness of the mapping network in enhancing image generation quality and fine-grained control.

Figure A2. Comparison examples of images generated with a prompt related to wearing glasses. (a) Images generated by CVAE-Mapping; (b) Images generated by CVAE-Text. Both models use the same text prompt, attribute vectors, and latent vectors. CVAE-Mapping generates images that more accurately represent the specified attributes.

Figure A3 shows another set of comparison examples with a prompt describing shocked expressions. The visual comparison further highlights how CVAE-Mapping better translates emotional descriptions into corresponding facial features compared to the baseline CVAE-Text approach.

Figure A3. Comparison examples of images generated with a prompt related to shocked expressions. (a) Images generated by CVAE-Mapping; (b) Images generated by CVAE-Text. The examples demonstrate CVAE-Mapping’s superior ability to translate emotional descriptions into appropriate facial expressions.

Appendix D

Model Training Convergence

Figure A4 presents the training convergence curves for our CVAE models. Figure A4a shows the convergence dynamics of the CLIP-CVAE model using Rényi divergence with the optimal parameter

α = 0.7

. The plot demonstrates a stable training process with both reconstruction loss and overall ELBO decreasing consistently throughout the training epochs, confirming the effectiveness of our divergence parameter selection. Figure A4b displays the convergence comparison between CVAE-KL (standard KL divergence,

α = 1.0

) and CVAE-Rényi (

α = 0.7

) models. This comparison illustrates that while CVAE-KL achieves marginally lower reconstruction loss, as expected from our ablation study results in Table 1, both approaches maintain stable convergence trajectories, validating our choice of the Rényi divergence parameter.

Figure A4. Training convergence curves for our CVAE models. (a) Detailed convergence plot for CVAE-Rényi using

α = 0.7

, showing the stabilization of both reconstruction loss and total loss (ELBO) over training epochs; (b) Convergence comparison between CVAE-KL (

α = 1.0

) and CVAE-Rényi (

α = 0.7

) models, demonstrating that both configurations achieve stable convergence with CVAE-KL showing slightly lower reconstruction loss as expected from our ablation study.

Figure A4. Training convergence curves for our CVAE models. (a) Detailed convergence plot for CVAE-Rényi using

α = 0.7

, showing the stabilization of both reconstruction loss and total loss (ELBO) over training epochs; (b) Convergence comparison between CVAE-KL (

α = 1.0

) and CVAE-Rényi (

α = 0.7

) models, demonstrating that both configurations achieve stable convergence with CVAE-KL showing slightly lower reconstruction loss as expected from our ablation study.

Appendix E. Baseline VAE Model Comparison Details

This appendix provides further details regarding the comparison between our proposed CVAE-Mapping model and the baseline VAE models, VAE-LD [37] and VDVAE [38], as summarized in Table 4 of the main text. The objective of this comparison was to position our lightweight model’s performance and efficiency relative to established, high-performance VAE architectures, specifically within the context of the 64 × 64 CelebA generation task used in our study.

Appendix E.1. Baseline Models

VAE-LD [37]: This model architecture is derived from the decoder principles of the GLIDE model. While often employed for text-conditional generation at higher resolutions (e.g., 256 × 256), its VAE foundation makes it a relevant baseline for evaluating generative quality within the VAE paradigm, even when adapted or evaluated at lower resolutions.
VDVAE [38]: The Very Deep VAE represents a powerful hierarchical VAE architecture known for achieving strong likelihood scores and high-fidelity image generation across various benchmarks. Its complexity and performance serve as a high-end benchmark for VAE capabilities relevant to our comparison.

Appendix E.2. Evaluation Methodology and Metric Verification

The primary metrics used for comparison in Table 4 were Fréchet Inception Distance (FID) for generation quality and parameter count/inference speed (FPS) for efficiency. All generation quality comparisons were conducted using the CelebA dataset processed to 64 × 64 resolution, consistent with the training and evaluation of our CVAE-Mapping model.

The baseline FID scores reported in Table 4 (VAE-LD: 25.8, VDVAE: 23.25) were sourced directly from the original publications [37,38]. Recognizing that performance metrics can be sensitive to evaluation protocols, dataset preprocessing, resolution, and specific implementation details, we performed a verification step to confirm the appropriateness of using these values as benchmarks for our 64 × 64 CelebA task.

We conducted a re-implementation targeting the core generative components of the VAE-LD and VDVAE architectures, adapting them where necessary for the 64x64 image generation task on CelebA. Key aspects of our verification setup included the following:

Framework: PyTorch 1.13.1.
Dataset: CelebA, center-cropped and resized to 64 × 64, normalized to [−1, 1], using the standard test split for reference statistics.
Architecture Adaptation: We focused on implementing the essential VAE encoder–decoder structures as described in the original papers, adjusting channel dimensions and potentially simplifying the hierarchical depth where appropriate to suit the 64 × 64 resolution while preserving the core architectural principles (e.g., residual blocks, attention mechanisms where applicable). The aim was to create faithful reproductions suitable for the target resolution.
Training Details: Models were trained using the Adam optimizer [35] with a learning rate of $1 \times 10^{- 4}$ and a batch size of 128. Training proceeded for a comparable number of epochs as our CVAE-Mapping model (50 epochs), monitoring convergence via validation loss plateau on a held-out subset of the training data. Training parameters were kept consistent with our main experiments where feasible.
FID Calculation: Crucially, FID was calculated using the exact same implementation (pytorch-fid package) and reference statistics (derived from the CelebA 64 × 64 test set) as used for evaluating our CVAE-Mapping model. This ensures a direct and fair comparison of the generative distribution quality under identical measurement conditions.

Our re-implementation and evaluation yielded FID scores that were broadly consistent and in close agreement with the originally reported values, considering the potential minor differences arising from architectural adaptations for the 64 × 64 resolution or slight variations in training dynamics. For instance, our re-run might yield slightly different FID values but confirm the same relative performance ranking and magnitude. This verification process substantiated that the performance levels cited from the original papers [37,38] serve as valid and relevant high-performance benchmarks for the 64 × 64 CelebA generation task within the context of our study, against which the trade-offs of our lightweight model can be assessed.

Appendix E.3. Performance Discussion

As noted in Table 4 and Section 4.4.1, our CVAE-Mapping model (FID: 40.53) shows a higher FID score compared to the verified baseline performances of VAE-LD (25.8) and VDVAE (23.25). This reflects the intended trade-off in our work: prioritizing model lightweightness (42 M parameters vs. 96 M/112 M) and computational efficiency (estimated 3.2 GMACs, 21 FPS) over achieving the absolute lowest FID possible with significantly larger models. The comparison validates that CVAE-Mapping offers a substantial reduction in resource requirements while maintaining reasonable generation fidelity within the VAE framework.

Appendix F. Exploratory Comparison of Mapping Network Architectures

To further investigate the architectural choice for the text-to-image embedding mapping network and address its perceived arbitrariness, we conducted an exploratory comparative experiment. The goal was not to perform an exhaustive search for the optimal architecture but rather to provide a conceptual validation of whether a more modern, attention-based mechanism, even if resulting in a comparable or slightly larger parameter count than our Multi-Layer Perceptron (MLP), would offer substantial advantages for this specific mapping task.

Appendix F.1. Minimalist Transformer Encoder as an Alternative

We implemented a minimalist single-layer Transformer encoder to serve as an alternative mapping architecture. This choice was motivated by the Transformer’s proven efficacy in sequence transduction and representation learning tasks. The Transformer encoder was configured as follows:

Input: CLIP text embeddings (dimension 512) treated as a sequence of length 1.
Attention Heads: 2 heads for multi-head self-attention.
Feed-Forward Network (FFN) Dimension: A hidden dimension of 1024 in the position-wise FFN (e.g., 512 -> 1024 -> 512).
Output Layer: A final linear layer to project the Transformer output to the target 512-dimensional image embedding space.
Normalization and Dropout: Layer normalization was applied, and a dropout rate of 0.1 was used.

The total number of trainable parameters for this minimalist Transformer encoder was approximately 5.12 M parameters. This is compared to our chosen MLP architecture (Appendix A.2), which has approximately 3.94 M parameters. The minimalist Transformer encoder was trained using the exact same dataset, loss function (MSE between predicted and ground-truth CLIP image embeddings), and optimization settings as our primary MLP-based mapping network (Section 3.1.2 and Algorithm A2).

Appendix F.2. Comparative Results and Discussion

We evaluated the minimalist Transformer encoder on the same test set partition used for the MLP mapping network, focusing on the average cosine similarity between the predicted image embeddings and the ground-truth CLIP image embeddings. The results are presented in Table A5.

Table A5. Comparison of Mapping Network Architectures on CLIP Embedding Alignment (Appendix).

Mapping Network Architecture	Approx. Parameters	Cosine Similarity (↑)
MLP (Our Primary Model)	3.94 M	0.81 ± 0.03
Minimalist Transformer Encoder	5.12 M	0.78 ± 0.04

↑: higher is better.

The results from this exploratory comparison, shown in Table A5, provide strong support for our choice of MLP architecture. Our MLP model, with approximately 3.94 M parameters, achieved a superior average cosine similarity of

0.81 \pm 0.03

. In contrast, the minimalist Transformer encoder, despite employing a more complex attention mechanism and having a notably larger parameter count (approximately 5.12 M parameters), yielded a lower cosine similarity of

0.78 \pm 0.04

.

This outcome decisively favors the MLP in terms of both performance and parameter efficiency for this specific mapping task. It strongly suggests that for aligning already semantically rich CLIP embeddings, the non-linear transformation capabilities of our well-configured MLP are not only highly effective but also more resource-efficient than a comparably tuned yet more complex and larger, minimalist Transformer. The increased parameterization and architectural complexity of the Transformer did not translate into improved alignment in this constrained, direct embedding-to-embedding mapping scenario.

The primary purpose of this experiment was to ascertain if a basic MLP was an arbitrarily poor choice compared to a modern alternative. The findings compellingly demonstrate that the MLP is not only a robust and effective baseline but, in this direct comparison, also offers superior performance with fewer parameters. While the potential of more sophisticated or extensively tuned Transformer architectures cannot be entirely dismissed for all scenarios, this experiment robustly justifies our selection of the MLP as a well-suited and efficient mapping network within our lightweight T2I framework. Further exploration of highly optimized, compact attention-based models remains an area for future investigation, but the current evidence strongly supports the MLP’s efficacy and efficiency.

References

Ramesh, A.; Pavlov, M.; Goh, G.; Gray, S.; Voss, C.; Radford, A.; Chen, M.; Sutskever, I. Zero-Shot Text-to-Image Generation. arXiv 2021, arXiv:2102.12092. [Google Scholar]
Esser, P.; Rombach, R.; Ommer, B. Taming Transformers for High-Resolution Image Synthesis. arXiv 2021, arXiv:2012.09841. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. arXiv 2022, arXiv:2112.10752. [Google Scholar]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Ayan, B.K.; Mahdavi, S.S.; Lopes, R.G.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. arXiv 2022, arXiv:2205.11487. [Google Scholar]
Lee, Y.; Sun, A.; Hosmer, B.; Acun, B.; Balioglu, C.; Wang, C.; Hernandez, C.D.; Puhrsch, C.; Haziza, D.; Guessous, D.; et al. Characterizing and Efficiently Accelerating Multimodal Generation Model Inference. arXiv 2024, arXiv:2410.00215. [Google Scholar]
Ma, Z.; Zhang, Y.; Jia, G.; Zhao, L.; Ma, Y.; Ma, M.; Liu, G.; Zhang, K.; Li, J.; Zhou, B. Efficient Diffusion Models: A Comprehensive Survey from Principles to Practices. arXiv 2024, arXiv:2410.11795. [Google Scholar]
Liu, B.; Zhu, Y.; Song, K.; Elgammal, A. Towards Faster and Stabilized GAN Training for High-fidelity Few-shot Image Synthesis. arXiv 2021, arXiv:2101.04775. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2022, arXiv:2110.02178. [Google Scholar]
Balaji, Y.; Nah, S.; Huang, X.; Vahdat, A.; Song, J.; Zhang, Q.; Kreis, K.; Aittala, M.; Aila, T.; Laine, S.; et al. eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. arXiv 2023, arXiv:2211.01324. [Google Scholar]
Zhao, Y.; Xu, Y.; Xiao, Z.; Jia, H.; Hou, T. MobileDiffusion: Instant Text-to-Image Generation on Mobile Devices. arXiv 2024, arXiv:2311.16567. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. arXiv 2021, arXiv:2103.00020. [Google Scholar]
Wang, P.; Lu, W.; Lu, C.; Zhou, R.; Li, M.; Qin, L. Large Language Model for Medical Images: A Survey of Taxonomy, Systematic Review, and Future Trends. Big Data Min. Anal. 2025, 8, 496–517. [Google Scholar] [CrossRef]
Sun, Q.; Fang, Y.; Wu, L.; Wang, X.; Cao, Y. EVA-CLIP: Improved Training Techniques for CLIP at Scale. arXiv 2023, arXiv:2303.15389. [Google Scholar]
van den Oord, A.; Vinyals, O.; Kavukcuoglu, K. Neural Discrete Representation Learning. arXiv 2018, arXiv:1711.00937. [Google Scholar]
Crowson, K.; Biderman, S.; Kornis, D.; Stander, D.; Hallahan, E.; Castricato, L.; Raff, E. VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance. arXiv 2022, arXiv:2204.08583. [Google Scholar]
Frans, K.; Soros, L.B.; Witkowski, O. CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders. arXiv 2021, arXiv:2106.14843. [Google Scholar]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. arXiv 2021, arXiv:2103.17249. [Google Scholar]
Kingma, D.P.; Welling, M. Auto-Encoding Variational Bayes. arXiv 2022, arXiv:1312.6114. [Google Scholar]
Sohn, K.; Lee, H.; Yan, X. Learning Structured Output Representation using Deep Conditional Generative Models. In Advances in Neural Information Processing Systems; Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Sikka, H.; Zhong, W.; Yin, J.; Pehlevan, C. A Closer Look at Disentangling in β-VAE. In Proceedings of the Conference Record of the 2019 Fifty-Third Asilomar Conference On Signals, Systems & Computers, Pacific Grove, CA, USA, 3–6 November 2019; Matthews, M., Ed.; pp. 888–895. [Google Scholar]
Mi, L.; Shen, M.; Zhang, J. A Probe Towards Understanding GAN and VAE Models. arXiv 2018, arXiv:1812.05676. [Google Scholar]
Zhang, H.; Xu, T.; Li, H.; Zhang, S.; Wang, X.; Huang, X.; Metaxas, D. StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5908–5916. [Google Scholar] [CrossRef]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1316–1324. [Google Scholar] [CrossRef]
Tang, H.; Li, Z.; Zhang, D.; He, S.; Tang, J. Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 1958–1974. [Google Scholar] [CrossRef] [PubMed]
Zhao, S.; Song, J.; Ermon, S. InfoVAE: Balancing learning and inference in variational autoencoders. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019. [Google Scholar] [CrossRef]
Li, Y.; Turner, R.E. Rényi Divergence Variational Inference. arXiv 2016, arXiv:1602.02311. [Google Scholar]
Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. arXiv 2016, arXiv:1606.00709. [Google Scholar]
Tolstikhin, I.; Bousquet, O.; Gelly, S.; Schoelkopf, B. Wasserstein Auto-Encoders. arXiv 2019, arXiv:1711.01558. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Deep Learning Face Attributes in the Wild. arXiv 2015, arXiv:1411.7766. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. arXiv 2018, arXiv:1706.08500. [Google Scholar]
Salimans, T.; Goodfellow, I.; Zaremba, W.; Cheung, V.; Radford, A.; Chen, X. Improved Techniques for Training GANs. arXiv 2016, arXiv:1606.03498. [Google Scholar]
Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. arXiv 2018, arXiv:1801.03924. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. arXiv 2015, arXiv:1405.0312. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. arXiv 2022, arXiv:2112.10741. [Google Scholar]
Child, R. Very Deep VAEs Generalize Autoregressive Models and Can Outperform Them on Images. arXiv 2021, arXiv:2011.10650. [Google Scholar]
Zhang, H. Evaluation of Natural Image Generation and Reconstruction Capabilities Based on the β-VAE Model. ITM Web Conf. 2025, 70, 03006. [Google Scholar] [CrossRef]
Fan, L.; Tang, L.; Qin, S.; Li, T.; Yang, X.; Qiao, S.; Steiner, A.; Sun, C.; Li, Y.; Zhu, T.; et al. Unified Autoregressive Visual Generation and Understanding with Continuous Tokens. arXiv 2025, arXiv:2503.13436. [Google Scholar]
Bao, F.; Nie, S.; Xue, K.; Cao, Y.; Li, C.; Su, H.; Zhu, J. All are Worth Words: A ViT Backbone for Diffusion Models. arXiv 2023, arXiv:2209.12152. [Google Scholar]
Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; Aila, T. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv 2023, arXiv:2301.09515. [Google Scholar]
You, X.; Zhang, J. Text-to-Image GAN with Pretrained Representations. arXiv 2024, arXiv:2501.00116. [Google Scholar]
Rampas, D.; Pernias, P.; Aubreville, M. A Novel Sampling Scheme for Text- and Image-Conditional Image Synthesis in Quantized Latent Spaces. arXiv 2023, arXiv:2211.07292. [Google Scholar]

Figure 1. Flowchart of the CVAE Training Phase, illustrated with example dimensionalities for a

64 \times 64

input image (e.g., from CelebA). Input image x, attribute vector

a

, and ground-truth CLIP image embedding

c_{i m a g e}

are processed and concatenated to form the input to the Encoder (

q_{ϕ}

). The attribute vector

a

and CLIP image embedding

c_{i m a g e}

are first reshaped and spatially tiled to match the image dimensions before channel-wise concatenation with x. The Encoder, a convolutional neural network, then maps this combined input to the parameters (

μ, σ

) of the latent Gaussian distribution. A latent vector

z

is sampled via reparameterization. The Decoder (

p_{θ}

), also a convolutional (transposed) network, takes

z

, the original attribute vector

a

, and the original CLIP image embedding

c_{i m a g e}

as concatenated input. It first processes this through an FC layer, reshapes it into a spatial feature map, and then upsamples it to reconstruct the image

x^{'}

. The total loss L comprises a reconstruction term and a Rényi divergence term. Dimensionalities at key stages are indicated (B denotes batch size).

Figure 1. Flowchart of the CVAE Training Phase, illustrated with example dimensionalities for a

64 \times 64

input image (e.g., from CelebA). Input image x, attribute vector

a

, and ground-truth CLIP image embedding

c_{i m a g e}

are processed and concatenated to form the input to the Encoder (

q_{ϕ}

). The attribute vector

a

and CLIP image embedding

c_{i m a g e}

are first reshaped and spatially tiled to match the image dimensions before channel-wise concatenation with x. The Encoder, a convolutional neural network, then maps this combined input to the parameters (

μ, σ

) of the latent Gaussian distribution. A latent vector

z

is sampled via reparameterization. The Decoder (

p_{θ}

), also a convolutional (transposed) network, takes

z

, the original attribute vector

a

, and the original CLIP image embedding

c_{i m a g e}

as concatenated input. It first processes this through an FC layer, reshapes it into a spatial feature map, and then upsamples it to reconstruct the image

x^{'}

. The total loss L comprises a reconstruction term and a Rényi divergence term. Dimensionalities at key stages are indicated (B denotes batch size).

Figure 2. Generation phase architecture, illustrated with example dimensionalities for generating a

64 \times 64

image. A text description is first encoded into a CLIP text embedding (

c_{t e x t}

) using a pre-trained CLIP text encoder. This

c_{t e x t}

is then processed by the trained Mapping Network (

f_{θ}

), whose MLP architecture (conceptually illustrated here with its sequential transformations and detailed in Appendix A.2, Table A3) produces a predicted CLIP image embedding (

c_{p r e d i c t e d_i m a g e}

). Concurrently, an attribute vector (

a

) is derived from the text prompt via keyword mapping (Appendix B.1). A latent vector (

z

) is sampled from a standard normal distribution. These three components—

c_{p r e d i c t e d_i m a g e}

,

a

, and

z

—are concatenated and input to the trained CVAE Decoder (

p_{θ}

). The Decoder, consistent with its structure in the training phase (Figure 1), processes this concatenated input through an initial FC layer, reshapes it, and then employs a series of transposed convolutional layers to generate the final image

x^{'}

. This process aligns with the generation procedure described in Section 3.3 and Algorithm A3. Key dimensionalities are indicated (B denotes batch size).

Figure 2. Generation phase architecture, illustrated with example dimensionalities for generating a

64 \times 64

image. A text description is first encoded into a CLIP text embedding (

c_{t e x t}

) using a pre-trained CLIP text encoder. This

c_{t e x t}

is then processed by the trained Mapping Network (

f_{θ}

), whose MLP architecture (conceptually illustrated here with its sequential transformations and detailed in Appendix A.2, Table A3) produces a predicted CLIP image embedding (

c_{p r e d i c t e d_i m a g e}

). Concurrently, an attribute vector (

a

) is derived from the text prompt via keyword mapping (Appendix B.1). A latent vector (

z

) is sampled from a standard normal distribution. These three components—

c_{p r e d i c t e d_i m a g e}

,

a

, and

z

—are concatenated and input to the trained CVAE Decoder (

p_{θ}

). The Decoder, consistent with its structure in the training phase (Figure 1), processes this concatenated input through an initial FC layer, reshapes it, and then employs a series of transposed convolutional layers to generate the final image

x^{'}

. This process aligns with the generation procedure described in Section 3.3 and Algorithm A3. Key dimensionalities are indicated (B denotes batch size).

Figure 3. t-SNE visualization (perplexity = 30, samples = 1000) of CLIP embedding spaces from the test set, showing embeddings (a) before and (b) after applying the mapping network. The improved overlap in (b) demonstrates successful alignment.

Figure 4. Distribution of cosine similarities between predicted image embeddings (from text via mapping network) and corresponding real image embeddings on the test set. Mean similarity = 0.8324.

Figure 5. Qualitative comparison demonstrating the mapping network’s effectiveness. Images were generated for the text prompt: “A young woman with bangs and blond hair”. (a) Results from CVAE-Mapping. (b) Results from CVAE-Text. For each corresponding pair of images shown (e.g., top-left in (a) vs. top-left in (b)), identical latent vectors (

z

) and attribute vectors (

a

) were used as input to both models, ensuring a fair comparison of their ability to interpret the textual prompt via their respective conditioning mechanisms. CVAE-Mapping (a) consistently produces images with better semantic correspondence, accurately depicting the specified attributes (e.g., bangs and blond hair), while these are often less distinct or absent in CVAE-Text (b) outputs. Visual markers (e.g., red circles in (a)) highlight accurately rendered attributes, as discussed in Section 4.3.2.

Figure 5. Qualitative comparison demonstrating the mapping network’s effectiveness. Images were generated for the text prompt: “A young woman with bangs and blond hair”. (a) Results from CVAE-Mapping. (b) Results from CVAE-Text. For each corresponding pair of images shown (e.g., top-left in (a) vs. top-left in (b)), identical latent vectors (

z

) and attribute vectors (

a

) were used as input to both models, ensuring a fair comparison of their ability to interpret the textual prompt via their respective conditioning mechanisms. CVAE-Mapping (a) consistently produces images with better semantic correspondence, accurately depicting the specified attributes (e.g., bangs and blond hair), while these are often less distinct or absent in CVAE-Text (b) outputs. Visual markers (e.g., red circles in (a)) highlight accurately rendered attributes, as discussed in Section 4.3.2.

Figure 6. Qualitative examples of images generated by our CVAE-Mapping model on the MS COCO 256 × 256 validation set, based on the provided text prompts, shown for compactness. Text Prompts: (a) A group of sheep in a grassy area with trees in the back ground. (b) Kites fly high in the air over a park. (c) a big clock tower sits in the middle of a road. (d) a baseball player swinging a bat at a ball.

Figure 7. Linear interpolations in the learned latent space between pairs of real images. From top to bottom: Sample Pairs 1 through 5. Each row shows 10 interpolated steps. These examples illustrate the smoothness of transitions and the ability to vary semantic attributes (e.g., hairstyle in Pair 2, background/lighting in Pairs 4 and 5) while largely preserving core facial identity and coherence. The fixed conditions for decoding (

a_{1}, c_{i m a g e, 1}

) were derived from the leftmost image in each sequence.

Figure 7. Linear interpolations in the learned latent space between pairs of real images. From top to bottom: Sample Pairs 1 through 5. Each row shows 10 interpolated steps. These examples illustrate the smoothness of transitions and the ability to vary semantic attributes (e.g., hairstyle in Pair 2, background/lighting in Pairs 4 and 5) while largely preserving core facial identity and coherence. The fixed conditions for decoding (

a_{1}, c_{i m a g e, 1}

) were derived from the leftmost image in each sequence.

Table 1. Reconstruction Performance: Ablation Study on the Rényi Divergence Parameter (

α

) evaluated on the validation set. FID/IS compare reconstructions to real validation images. LPIPS compares originals to reconstructions. Best results for each metric are bolded. CVAE-KL corresponds to

α = 1.0

.

Table 1. Reconstruction Performance: Ablation Study on the Rényi Divergence Parameter (

α

) evaluated on the validation set. FID/IS compare reconstructions to real validation images. LPIPS compares originals to reconstructions. Best results for each metric are bolded. CVAE-KL corresponds to

α = 1.0

.

$α$	FID (↓)	IS (↑)	LPIPS (↓)
−2.0	318.9333	1.6552	0.3872
−1.0	340.5644	1.7164	0.3654
−0.7	178.7920	1.9458	0.3031
−0.5	72.4416	2.1221	0.2216
−0.3	125.2464	2.0804	0.2625
0.5	87.8511	2.0293	0.2633
0.7	50.2313	2.3726	0.1725
0.9	140.1787	2.0480	0.3247
1.0	47.0873	2.2405	0.1376

↑: higher is better, ↓: lower is better.

Table 2. Cosine Similarity between CLIP Embeddings, evaluating mapping network alignment.

Comparison	Cosine Similarity (↑)	p-Value vs. Baseline
Baseline: Orig Text vs. Orig Img	0.65 ± 0.05	-
Mapped: Pred Img vs. Orig Img	0.81 ± 0.03	<0.001

Mean ± standard deviation reported for cosine similarity over test set pairs. ↑: higher is better.

Table 3. Generation Performance: Effectiveness of the Mapping Network (CVAE-Mapping vs. CVAE-Text). FID/IS compare generated sets to real images. Cosine Sim compares input text embedding vs. generated image embedding.

Model	FID (↓)	IS (↑)	Cosine Sim (Gen) (↑)	p-Value
CVAE-Text	48.32	2.3428	0.61	-
CVAE-Mapping	40.53	2.43	0.86	<0.001

Mean values reported over 5 runs. p-values compare CVAE-Mapping to CVAE-Text for each metric. ↑: higher is better, ↓: lower is better.

Table 4. Comparison with VAE-based and Lightweight T2I Methods on CelebA 64 × 64. Baseline results require verification or specific citation for the exact task configuration. CVAE-KL corresponds to

α = 1.0

in the divergence term.

Table 4. Comparison with VAE-based and Lightweight T2I Methods on CelebA 64 × 64. Baseline results require verification or specific citation for the exact task configuration. CVAE-KL corresponds to

α = 1.0

in the divergence term.

Type	Model	Params (M) ↓	FLOPs/MACs ↓	FID↓	IS↑	FPS ↑
Diffusion	MobileDiffusion Variant [11]	58	8.9 GFLOPs	40	2.89	N/A
GAN	FastGAN [8]	30	N/A	35.5	N/A	N/A
VAE	VAE-LD [37]	96	N/A	25.8	N/A	11
	VDVAE [38]	112	N/A	23.25	N/A	9
	CVAE-Mapping (Ours)	42	3.2 GMACs	40.53	2.43	21

Est. MACs: Estimated for generating one 64 × 64 image using ptflops library. GFLOPs for MobileDiffusion are as reported in [11]. FastGAN FID is for unconditional generation [8], while other models are text-conditional. Desktop FPS measured on NVIDIA RTX 4090, batch size 1 for VAE-based models and Ours (from Table 4 context). MobileDiffusion FPS is not directly reported in [11] for this specific configuration. Baseline FID/IS scores are sourced from original papers [8,11,37,38] and our re-implementation/evaluation (details in Appendix E and Section 4.3). Reported values may stem from different settings or datasets in original works where noted. ↑: higher is better, ↓: lower is better.

Table 5. Comparison with Text-to-Image Methods on MS COCO 256x256. Metrics are evaluated on the MS COCO 2014/2017 validation set at 256 × 256 resolution.

Type	Model	Params (M) ↓	FID ↓	CLIP Score ↑	R-Precision ↑	FPS ↑
Autoregressive	Fluid (Gemma-2B) [40]	2 B	6.16	0.295–0.32	0.350	N/A
Diffusion	U-ViT-S/2 [41]	131 M ^a	5.48	0.274–0.295	N/A	N/A
GAN	StyleGAN-T [42]	120 M	26.7	0.292-0.305	0.342	10
GAN	TIGER [43]	69 M	21.96	N/A	0.33	Faster
Token-based	Paella [44]	573 M	26.7	0.307	N/A	2
VAE/Trans.	VQGAN+CLIP [16]	10 B	19.9	0.28-0.30	0.10–0.12	N/A
CVAE	CVAE-Mapping (Ours)	54	24.26	0.275–0.287	0.32	8

FID, CLIP Score, and R-precision for our model are calculated on generated samples from the MS COCO 2014/2017 validation set with corresponding prompts, compared against the real validation images. Params(M) for our model refers to total trainable parameters. FPS is measured on an NVIDIA RTX 4090 GPU with batch size 1. Baseline data are sourced from cited literature and may reflect varying evaluation settings. ^a Parameter count for U-ViT-S/2 backbone; total T2I model includes additional components. ↑: higher is better, ↓: lower is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Zhang, G. Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics 2025, 14, 2185. https://doi.org/10.3390/electronics14112185

AMA Style

Wang Y, Zhang G. Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics. 2025; 14(11):2185. https://doi.org/10.3390/electronics14112185

Chicago/Turabian Style

Wang, Yubo, and Gaofeng Zhang. 2025. "Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders" Electronics 14, no. 11: 2185. https://doi.org/10.3390/electronics14112185

APA Style

Wang, Y., & Zhang, G. (2025). Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders. Electronics, 14(11), 2185. https://doi.org/10.3390/electronics14112185

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight Text-to-Image Generation Model Based on Contrastive Language-Image Pre-Training Embeddings and Conditional Variational Autoencoders

Abstract

1. Introduction

1.1. Lightweight Text-to-Image Approaches

1.2. CLIP-Based Text-to-Image Generation

1.3. Our Approach and Contributions

2. Related Work

2.1. Variational Autoencoders and CLIP-Based Generation

2.2. Lightweight Generation and Alternative Divergence Measures

3. Methods

3.1. Model Architecture

3.1.1. Conditional Variational Autoencoder

3.1.2. Text-to-Image Mapping Network

3.2. Training Procedure

3.2.1. CVAE Training

3.2.2. Mapping Network Training

3.3. Generation Procedure

4. Experiment

4.1. Dataset and Implementation Details

4.2. Divergence Metric ( α Selection)

4.3. Evaluation of Mapping Network

4.3.1. Quantitative Evaluation

4.3.2. Qualitative Evaluation

4.4. Efficiency and Comparative Performance Analysis

4.4.1. Performance on CelebA 64 × 64

4.4.2. Generalization to MS COCO 256 × 256

4.4.3. Overall Efficiency and Performance Summary

4.5. Analysis of Learned Latent Space Structure

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Network Architecture Details

Appendix A.1. Encoder and Decoder Network Specifications

Appendix A.2. Text-to-Image Mapping Network Architecture

Appendix B. Implementation Details and Algorithms

Appendix B.1. Attribute Vector Generation (get_attributes_from_prompt)

Qualitative Analysis of Attribute Vector Generation Robustness

Appendix B.2. Algorithms

Appendix C

Appendix C.1. Additional Generation Examples

Appendix D

Model Training Convergence

Appendix E. Baseline VAE Model Comparison Details

Appendix E.1. Baseline Models

Appendix E.2. Evaluation Methodology and Metric Verification

Appendix E.3. Performance Discussion

Appendix F. Exploratory Comparison of Mapping Network Architectures

Appendix F.1. Minimalist Transformer Encoder as an Alternative

Appendix F.2. Comparative Results and Discussion

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2. Divergence Metric ( $α$ Selection)

Appendix B.1. Attribute Vector Generation (`get_attributes_from_prompt`)