Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces

Fejjari, Asma; Abela, Aaron; Tanti, Marc; Muscat, Adrian

doi:10.3390/app15158692

Open AccessArticle

Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces

¹

Department of Communications and Computer Engineering, University of Malta, MSD 2080 Msida, Malta

²

Institute of Linguistics and Language Technology, University of Malta, MSD 2080 Msida, Malta

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2025, 15(15), 8692; https://doi.org/10.3390/app15158692

Submission received: 8 July 2025 / Revised: 23 July 2025 / Accepted: 26 July 2025 / Published: 6 August 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

In this paper, we explore the generation of face images conditioned on a textual description, as well as the capabilities of the models in editing a machine-generated image on the basis of additional text prompts. We leverage open source state-of-the-art face image generators, StyleGAN models and couple these with the open source multimodal embedding space, CLIP, in an optimisation loop using the method in StyleCLIP to set up our experimental system. We make use of automatic metrics and human ratings to evaluate the results and, in addition, obtain insight into how much automatic metrics are correlated with human ratings. We found compelling evidence that both the text-to-image and editing models based on StyleGAN2 stand out as the better options. In addition, the automatic evaluation metrics are only weakly correlated with human ratings.

Keywords:

stylegan; face image; generation; text prompt; comparative evaluation; vision and language models

1. Introduction

Generative models [1,2,3] are machine learning models that aim to generate novel and realistic data on the basis of the probability distribution, which describes a given training set. Such models can find applications in many fields such as generating realistically looking human faces [4], animating art works [5], composing music [6] and augmenting datasets with synthetic data [7]. However, the usefulness of these models depends on both their efficiency in creating images and the output quality. Two successful model architectures for synthesising images are the Generative Adversarial Network (GAN) [8,9,10] and the diffusion model [11,12]. The former requires less computational resources whilst the latter is known to produce higher quality images [13,14]. In this study, we limit ourselves to GANs since we are interested in fast computation.

The GAN model is made up of two models that compete against each other to learn to synthesise realistic data. The two modules, the generator and the discriminator, are two independent neural networks. The generator synthesises fake data starting from random noise input, while the discriminator, with the help of a dataset of real images, classifies whether the input data is from the dataset or from the generator. These two modules are trained one at a time. The generator is trained to produce outputs that the discriminator categorizes as real, while the discriminator is trained to correctly classify the new fake data. After several iterations, the generator is expected to generate data that looks real even to a human, whilst the discriminator is unable to distinguish real from fake data.

Different network topologies have been used in defining the GAN architecture to implement the generator and discriminator modules, including Deep Convolutional GAN (DCGAN) [15], Conditional GAN (CGAN) [16], InfoGAN [17] and StyleGAN [18], the latter considered the state-of-the-art GAN in generating non-conditioned images [19]. In addition to their ability to generate high-quality face images, StyleGAN models provide the opportunity to control the style of the generated photo by altering the style vector inputs. The model was improved with its successors StyleGAN2 [19] and StyleGAN3 [20]. Nevertheless, the default StyleGAN architecture generates images unconditionally from simple latent code features rather than from some controllable input such as a text prompt, which restricts its applications. Since text can conveniently describe what a user requires, it can be explored to control the latent space directly to achieve text-guided image generation and editing tasks. Recently, a number of works [21,22,23] have explored the use of Contrastive Language Image Pretraining (CLIP), a multimodal embedding space model [24], to manipulate the StyleGAN model input vector in order to control the text-to-image output. We leverage upon these works to set up the experimental system for our study.

In this paper, we compare the output quality of five StyleGAN models combined with CLIP to generate images of faces from textual descriptions. These models are StyleGAN [18], StyleGAN2 [19], StyleGAN2-adaptive (StyleGAN2-ADA) [25], StyleGAN2-Distillation (StyleGAN2-Dist) [26] and StyleGAN3 [20]. In addition, following the works described in [21,27], we study the image editing task. The comparative analysis is carried out on the basis of automatic metrics, as well as human evaluations, and in addition, we obtain some insight into the correlation of automatic evaluation metrics with human ratings.

The rest of this paper is organised as follows. In Section 2 and Section 3, the StyleGAN and CLIP models used in this work and the text-to-image evaluation metrics from the literature are reviewed. In Section 4, the experimental methodology is described, and in Section 5, the models are evaluated and the results are discussed. Section 6 concludes the paper and proposes future work.

2. Review of Generative Models

In this section, we briefly describe the various open-source image-generating StyleGAN models, the multi-modal embedding CLIP model, the text-to-image model and the image editing model used in this study.

2.1. Review of StyleGAN Models

In this section, we briefly review the evolution of the StyleGAN models.

StyleGAN: The style-based generator architecture was introduced by Karras et al. in 2019 [18]. The idea was to improve the traditional GAN generator in its ability to generate and interpolate face images with high fidelity and high-quality resolution. The novel GAN model can control both the coarse-grained (pose, face form) and fine-grained (eyes, hair) details. This is completed by projecting the noise vector into an intermediate latent space, which is then fed to the synthesis network. In spite of its powerful ability to generate realistic images, StyleGAN outputs suffer mainly from three issues [19]: (1) water droplet artifacts, (2) smoother with less texture detail images caused by the nature of the cost function, and (3) phase artifacts produced due to the adopted progressive growing strategy. StyleGAN was improved in later versions.
StyleGAN2: To address the above-mentioned StyleGAN issues, the second variant StyleGAN2 [19] was proposed in 2020. A set of revisions in the generator normalisation form was made to avoid the appearance of water droplet artifacts, including weighted demodulation integration. Smoother outputs were obtained after adding a new regularisation term to the loss function. To resolve the phase artifacts problem in StyleGAN, the progressive growing strategy model was revised and replaced with skip connections in the generator and residual layers in the discriminator. Furthermore, the training process was revised by training larger models, which helps to improve the image quality generation.
StyleGAN2 Adaptive (StyleGAN2-ADA): In 2020, the NVIDIA research team noted that training the model with a small amount of data causes the discriminator to overfit [25]. To overcome this, a novel StyleGAN2 variant, StyleGAN2 Adaptive (StyleGAN2-ADA), was developed. StyleGAN2-ADA integrates an adaptive discriminator augmentation strategy that considerably stabilises training in small data regimes. The new method adopts a wide range of augmentations to prevent the discriminator from overfitting. It has proved its capability to generate good results using only a few thousand training images.
StyleGAN2 Distillation (StyleGAN2-Dist): This StyleGAN variant [26] addresses the inference time of StyleGAN2 while image quality is preserved. This is completed by uniting unconditional image generation with paired image-to-image translation to accelerate the StyleGAN2 image generation process. The novel framework, termed StyleGAN2 Distillation (StyleGAN2-Dist), aims to distill a specific image manipulation in StyleGAN2 latent code into one image-to-image translation. Despite its impressive quality, StyleGAN2 Distillation suffers from a number of limitations. Firstly, the framework’s latent space is not disentangled enough, and the transformations produced by the model are not very natural-looking. Another limitation is the need to distil each transformation to an independent model, where a universal model could be trained.
StyleGAN3: Despite the different improvements made by StyleGAN2 and its variants, some image details seem to be generated at fixed pixel coordinates. This makes the images look less natural. This lack of translation equivariance in StyleGAN2 outputs decreases its competence in video and animation generation. In 2021, the third version of the StyleGAN models was developed [20]. The goal of StyleGAN3 is to counter the texture sticking problem (aliasing) in StyleGAN2. The new equivariant generator to translation and rotation helps the model generate images with a more natural transition animation and achieves video-quality generative performance. However, although the model maintains image quality, it somewhat trades off spatial consistency and editability since it no longer fixes textures/features to specific image coordinates.

2.2. Contrastive Language Image Pretraining (CLIP)

The models described in Section 2.1 focus only on the image synthesis tasks, i.e., they generate random images of human faces. In this section, we review the CLIP model, which is used to control the output of the image generator models. CLIP [24] is a pre-trained multi-modal embedding space developed by OpenAI researchers and has significant importance in state-of-the-art text-to-image-based methods. The developed method is a modified version of ConVIRT [28] that combines a text transformer with a pre-trained CNN to train a model on image–text pairs. CLIP is trained on around 400 million image and text examples collected from the Internet. It aims to form relationships between image–text pairs, by learning text–image associations from images that are extracted from public web pages and which have associated text, either in the form of metadata or captions. These methods employ a contrastive learning framework, where the model learns to classify images and their corresponding texts as positive pairs and images with (randomly selected) portions of texts as negative pairs. In this paper, CLIP is used in both the generation of images from text, as well as for evaluating the output.

2.3. Text to Image Generation Model

This paper explores the combination of CLIP and StyleGAN models. We used the setup in [29], which implemented the method in [21] to join the StyleGAN3 model with CLIP. In this study, we adapted the process to all five StyleGAN models. The process is depicted in Figure 1. The system is fundamentally built around a StyleGAN model, pre-trained on the Flickr Faces HQ (FFHQ) dataset [18]. A latent space vector is randomly generated and passed on to the StyleGAN model to unconditionally generate an image. This image is then passed through CLIP, which is used to compute its similarity to a text description specified by the user. A step of gradient descent (using an AdamW optimiser) is then applied to adjust the latent space vector in order to increase the score and, thus, generate an image that better fits the description. This new image is then sent again to CLIP, and the whole process is repeated for a set number of times in order for the final image to match the description. Through experimental iteration, we observed that the models, on average, take 25 iterations to converge.

2.4. The Image Editing Model

Further modifying/editing a generated image using StyleGAN can be achieved through manipulation of its latent code. If the code to generate the image is not known, one way of recovering it is to apply a GAN inversion method [30,31], which inverts a given image back into the latent space of a pre-trained GAN model, and therefore, allows the reconstruction of the image by the generator. There are works [32,33] that investigate how changing the latent space edits an image in a controlled and semantically meaningful manner. This method is, however, difficult to control. Recently, a number of works focused on identifying global directions [21,27], in a latent space of StyleGAN, which are correlated with desired, semantically interpretable editing attributes, like smile, pose, hairstyle and age. The goal is to find a global direction in StyleGAN space such that traversing along this direction modifies an arbitrary image in the target attribute. The solution lies in constructing a direction from the text description (CLIP space) and finding a corresponding direction in StyleGAN space. The direction in CLIP space is found by applying the image encoder to a pair of before- and after-images. However, this direction is also a direction in CLIP space defined by the text. So, two text pairs are used to define a direction in CLIP space. For example, to find the semantic direction of the attribute ‘smile’ in the latent space of CLIP, we can take the direction between text embedding vectors such as ‘face’ and ‘face with smile’. To find the global direction of the example attribute ‘smile’ in the StyleGAN space for some image, the coordinate of its latent point is moved in the positive and negative directions for a selected channel. This produces a pair of generated images that are fed into CLIP to measure how the perturbed direction of the chosen StyleGAN channel in StyleGAN space is correlated with the semantic direction of ‘smile’ in the latent space of CLIP. This process is repeated for a number of images, always using the same channel in StyleGAN space, to compute the corresponding averaged direction in CLIP space. Projection of this computed average direction on the semantic direction of attribute ‘smile’ in the latent space of CLIP provides a direct measure of how much the chosen channel in StyleGAN space affects the selected semantic direction in CLIP space. After going through all the StyleGAN channels, those with projection values greater than a certain threshold are selected as related to the given semantic direction in the StyleGAN space, and these generate the edited image.

3. Review of Text-to-Image Evaluation Metrics

Due to the multi-modal nature of mixing textual descriptions with images, measuring the alignment between the generated image and an input description is, in general, a highly challenging task. In this section, we discuss some evaluation metrics from the literature to benchmark text-to-image models.

3.1. Automatic Evaluation Metrics

In this subsection, we explore a number of automatic metrics to measure the text–image alignment.

CLIP score: The CLIP model score [23,34,35] measures the compatibility of text–image pairs by computing the cosine similarity between the embedding of the generated images and the embedding of the text prompts. Higher CLIP scores signify higher semantic similarity between the image and the given description. CLIP scores were found to have a high correlation with human judgement [36].
AugCLIP score: is a variant of the standard CLIP score and is more robust against adversarial attacks [37]. AugCLIP introduces random augmentations on the image, such as random colourisation, random translation, random resize and random cutout. Since the CLIP model is trained on various image formats, the augmentation strategy does not destroy the semantic features encoded by CLIP.
R-Precision (RP): RP [38,39,40] is a popular metric that computes the top-R text-to-image retrieval accuracy. The goal of this metric is to retrieve the matching text from 100 candidate texts using the synthesised image as a query and the cosine similarity between the image encoding vectors and the text encoding vectors as a similarity score. An adapted version of the Deep Attentional Multimodal Similarity Model (DAMSM) [39] is used to compute the image–text similarity score for retrieval.
CLIP-R-Precision: Park et al. [40] observed that models not specifically tuned on Deep Attentional Multimodal Similarity Models (DAMSMs) tend to perform poorly on the RP metric. To address this issue, the CLIP model [24] is used in place of the standard DAMSM to calculate the R-Precision scores. Compared to DAMSM, CLIP obtains higher image–text retrieval performance.
TIFA (Text-to-Image Faithfulness evaluation with question Answering): TIFA [41] is a recent automatic evaluation metric that evaluates the correctness between the text and the image using a visual question answering (VQA) model. In addition to the VQA model, TIFA integrates two other models: a question-generating (QG) model and a question-answering (QA) model. Given a text input, the QG model generates several question–answer pairs, and the QA model filters them. Finally, the VQA model provides answers to the questions using the synthesised image and the answers are checked for correctness. To facilitate comparative studies, a TIFA benchmark is provided. This includes 4k text prompts in addition to 25k questions. TIFA is considered much more accurate than CLIP and has better correlation with human assessment. However, we were not able to use TIFA in our study since a suitable VQA model specifically for face descriptions was not readily available.
FID (Fréchet Inception Distance): FID [42] quantifies how similar the generated images are to real ones by comparing the distributions of feature representations, and as such, it is suitable to capture how realistic the generated image is. Both real and generated images are passed through a pretrained Inception-v3 [43] network, and the resulting feature vectors are assumed to follow multivariate Gaussians. FID computes the Fréchet (or Wasserstein-2) distance between these two Gaussians. A lower score is therefore better.
CMMD (CLIP Maximum Mean Discrepancy): CMMD [44] is similar in concept to FID; however, CMMD does not assume image feature vectors to be multivariate Gaussians and instead uses Maximum Mean Discrepancy (MMD) to measure the distance between CLIP embeddings instead of Inception-v3 feature vectors. The authors argue that CLIP embeddings should perform better since these are trained on 400 million image–text pairs instead of the 1M images used to train the Inception-v3 model. It is shown that CMMD correlates better with human evaluations when used to predict image realism.

3.2. Human Evaluation Studies

The best method for evaluating a generative model that is meant to generate data that looks good to humans is via a human ratings study. Automatic metrics can only approximate this. Recently, researchers have explored different methods to establish the human evaluation. An example [45,46] is to ask evaluators to rate the degree to which the generated images match the text descriptions and the overall quality of the generated images. Other works adopt the rating methods [46,47] such as categorical choice ratings, yes/no questions and Likert scales to report human evaluation. An evaluation can be either comparative, where evaluators are asked to rank images in a list according to their relative quality, or absolute, where evaluators are asked to give a score to each image in isolation.

4. Evaluation Methodology

This section describes the evaluation methodology adopted in this study. We discuss how we selected the data generated, the human ratings process and the automatic evaluation metrics used in the study.

4.1. Data Generation and Organisation

In this study, eight StyleGAN-based models were evaluated in the text-to-image generation or image editing tasks. We generated a total of 1842 real human face images using 141 different textual descriptions to evaluate the different models. The descriptions were taken from the Face2Text 2.0 dataset [48]. The descriptions consist of 69 single-sentence descriptions and 72 two-sentence descriptions. To ensure a wide variety of text lengths, the descriptions were sampled in a stratified way according to the number of tokens they were represented in (based on the CLIP tokeniser).

The generated images were split into two groups: The first consists of 1410 face images synthesised from the five StyleGAN-based generation models: StyleGAN [18], StyleGAN2 [19], StyleGAN2-Adaptive [25], StyleGAN2-Distillation [26] and StyleGAN3 [20]. In this group, we used all 141 descriptions, and each description was used to generate two different images. The second group consists of 432 edited images, produced by three StyleGAN-based models for face image editing: StyleGAN2-editing [21], StyleGAN3-editing [27] and the StyleGAN2-CLIPInverter model [23]. The 72 two-sentence descriptions were used to generate the images in this group, such that the first sentence was used to create an image and the second sentence was used to edit the synthesised image. As in group one, each description was used to produce two different edited images.

The generated images were organised in 47 batches, ready for evaluation by 47 unique human raters. Each batch consisted of 39 images, 30 of which are synthesised images, whilst the remaining 9 are edited images. A sample of generated images can be found in Table 1. To strike a balance between coverage and cost of evaluation, we opted for a larger number of batches, rather than a smaller repeated number and inserted an inter-annotator subset. The batches were organised using a Latin square design, such that each evaluator saw the same number of images generated by each model from unique descriptions. In addition, we added 13 control images to each batch, and these are the same across all batches.

Eight of the control images were used for quality control purposes. As explained further in the next section, the human evaluation consists of asking how realistic the image looks and how well it matches the description. Therefore, to test the quality of the evaluators’ work, we included some vector graphic images of faces which are unambiguously unrealistic and some real photos of faces, which should be considered realistic, and these images are either paired with an unambiguously wrong description or an unambiguously correct description. We then checked that these are evaluated as expected and rejected the batch of evaluations if they failed the test. The fake images were all taken from the free stock photo website Pixabay (https://pixabay.com/), which provides images with a permissive license, while the real face photos were taken from the FFHQ dataset [18]. This collection of 8 images with the descriptions used can be seen in Table 2.

The remaining 5 control images form the inter-annotator subset. These images are created from five different descriptions and five different models; StyleGAN, StyleGAN2, StyleGAN2-Adaptive, StyleGAN2-Distillation and StyleGAN3. The aim of inter-annotator agreement images is to assess the agreement among annotators and measure the extent to which the annotation process is consistent. These images can be seen in Table 3.

In total, each batch consisted of 52 images. All the batches were evaluated using both human ratings and automatic metrics.

4.2. Human Evaluation Process

To evaluate the different image batches, the Amazon Mechanical Turk (MTurk) (https://www.mturk.com/) crowdsourced market was used. The images of each batch and their corresponding text descriptions were grouped and sent in the same Human Intelligence Task (HIT). This process allows the evaluation of the content of each batch by the same annotator. In our study, the 47 batches were assessed by 47 different annotators. To evaluate images, each human evaluator was asked to answer three questions for all 52 images within one batch:

Question 1: To what extent does the image match the description?
Question 2: To what extent does the image look like a real photo?
Question 3: How satisfied are you with this image, considering both how much it matches the description and the quality of the image?

For the three questions, the evaluator needs to give a score between 0% and 100%, via a slider. The quality control images (Table 2) were used to decide if an evaluator’s batch results would be accepted or rejected (if rejected, then it would be discarded and sent to a different potential evaluator). Intentionally correct descriptions are expected to have a score of over 50% for question 1, and intentionally realistic photos are expected to have a score of over 50% for question 2. The score must be less than 50% otherwise. The score given for question 3 is ignored for quality control images. If at least one quality control image has an unexpected score, the whole batch is rejected. Note that even with this lenient 50% threshold, 53% of batches were rejected.

4.3. Automatic Evaluation Process

In addition to human measures, the different images were evaluated using automatic metrics. In this study, we used three automatic measures as follows: CLIP-Score, Fréchet Inception Distance (FID) and Clip Maximum Mean Discrepancy (CMMD) distance scores (Section 3.1). CLIP-Score is designed to compute the alignment between the generated image and the given textual description, while FID and CMMD are two measures developed to assess the quality and the realism of images generated by computational models. They compute the distance between the embeddings of the reference images and those of the generated ones. FID uses the Inception-v3 [43] model to extract feature representations, while the CMMD uses the CLIP embeddings, which are considered more suitable to capture the complex image content. We used existent code libraries to compute the FID (https://github.com/Nvlabs/stylegan3, accessed on 21 March 2024) and CMMD (https://github.com/google-research/google-research/tree/master/cmmd, accessed on 22 March 2024) scores, respectively. Both image realism scores were computed for each image using 10,000 real samples, selected from the Flickr Faces HQ (FFHQ) human faces dataset [18] (https://github.com/NVlabs/ffhq-dataset, accessed on 21 March 2024).

Finally, the correlation between human and automatic metrics is studied.

5. Results and Discussion

In this section, we summarise and discuss the results. We compute inter-rater agreement scores to inform our confidence in human ratings, and we study the correlation between human ratings and automatic metrics on the various models. In addition, we compare editing and non-editing models in terms of the final image, and we compare models on the basis of computational time.

Inter-rater Agreement: Table 4 gives the per-question, per-model and overall mean score, standard deviation and intraclass correlation (ICC) coefficients. The overall ICC for Q1 is 0.78, indicating good inter-rater agreement; the ICC for Q2 is 0.63, i.e., moderate inter-rater agreement; whilst the ICC for Q3 is 0.55, moderate (fair) inter-rater agreement. Overall, the inter-rater agreement interpretation is moderate to good. The raters agreed mostly on the first question, i.e., the extent to which the image matches the description, whilst the second and third questions are probably more subjective.

Correlation between human ratings and metrics: Figure 2 shows the linear relationships between the human ratings and automatic metrics (CLIP (r = 0.065, p = 0.005), FID (r = −0.005, p = 0.822) and CMMD (r = −0.264, p = 0.0)). All values indicate no or very weak linear relationships between the variables. R values tend to be positive for Q1-CLIP but negative for both Q2-CMMD and Q2-FID, as expected. In addition, CMMD is a better model of human ratings for Q2, when compared to FID. However, it is clear that automatic metrics require further study and development. Thus, we consider human ratings as more reliable.

Inferring Q3 rating from Q1 and Q2 human ratings: Automatic metrics usually only check for Q1 or Q2, but we also want to know the answer to Q3. We were therefore interested in knowing whether the overarching ‘Q3 score’ can be predicted from the simpler scores (Q1 and Q2). To predict the ‘Q3 score’, we considered a weighted average of Q1 and Q2 scores (given by human raters), computed as

\hat{Q 3} = λ \times Q 1 + (1 - λ) \times Q 2

, and maximized the correlation coefficient (

Q 3

vs

\hat{Q 3}

) via a line search that yielded

λ = 0.65

. Figure 3 depicts the correlation between ‘Q3 score’ and ‘Q1 score’ and ‘Q2 score’ separately, and the ‘combination of Q1 and Q2 score’. The correlation coefficient for the latter is (r = 0.91, p < 0.0001) and, therefore, we conclude that to predict the ‘Q3 score’, it is sufficient to take a weighted average, given good automatic metrics from Q1 and Q2. The weighting coefficient

λ

is not

0.5

(i.e., Q1 and Q2 have equal weighting), probably because Q1 correlates with Q3 more than Q2 correlates with Q3. This can be interpreted that humans value Q1 more than Q2 overall.

Evaluation of different models on the basis of human ratings: Figure 4 shows the distributions per model for human ratings of images generated by all five image generation models. Statistical analysis (post hoc HSD-Tukey test) identifies StyleGAN2 as the best model, whose mean aggregate ratings are significantly higher than all other models and is followed by both StyleGAN3 and StyleGAN2-dist. The difference between StyleGAN2 and StyleGAN3 is mainly from Q2 (picture realism) since both models perform equally on Q1 (alignment of image to text). Interestingly, StyleGAN2-ada generates the most realistic images (statistically different to all other models). However, the same model fails in Q1 (alignment with text). This failure can be attributed to the StyleGAN2-ada training process, which uses Adaptive Discriminator Augmentation (ADA), and prioritises image realism over latent space structure. ADA introduces heavy image augmentations to the discriminator to stabilise training on limited data, which distorts (or modifies) the feedback the generator receives. As a result, the generator learns to produce photorealistic images but develops a less disentangled and less semantically consistent latent space. Since the text-to-image method in use benefits from a clean and interpretable latent space to align CLIP text embeddings with meaningful visual features, the warped latent structure of StyleGAN2-ADA does not help the optimisation method to converge to an acceptable text–image alignment. In contrast, StyleGAN2 and StyleGAN3 preserve better latent geometry, enabling more faithful text-to-image alignment when gradient descent is used to search for a latent space that produces an image which matches the description. Furthermore, StyleGAN1 performs better than styleGAN2-ada in Q1, although it performs extremely badly in Q2. In addition, the Q3 ratings alone align with the aggregated ratings, perhaps with the exception of StyleGAN2-ada. The HSD-Tukey tests yield the same cluster groupings for both Q3 and the aggregated scores. Finally, we conclude that, overall, according to human judgments, StyleGAN2 is the best model for generating still images.

Evaluation of editing models on the basis of human ratings: Figure 5 shows the distributions per model for human ratings of images generated by the StyleGAN2 and StyleGAN3 models in a single step (i.e two cascaded sentences at input) and in two steps, i.e., first generating an image with the first sentence using StyleGAN2/3 models and then editing the image using the respective editing models. From the plots, the StyleGAN2-edit model has the highest mean, followed by the StyleGAN2-CLIPinvertor and StyleGAN3-edit models. A post hoc HSD-Tukey test reveals that the ratings for the StyleGAN3-edit are significantly less than for both the StyleGAN2 editing models. However, there is no significant difference between the StyleGAN2-edit and the StyleGAN2-CLIPinvertor models. We also note that these results follow the same trends observed in the analysis of the single-step models.

Comparison of editing and non-editing models: We are interested in knowing whether the editing models perform better than the single-step generation models on the basis of human evaluation. With reference to Figure 5, the StyleGAN2-edit model’s mean rating is the highest. However, a post hoc HSD-Tukey test does not support the hypothesis that the editing model performed any better than the single-step generation models. In addition, the StyleGAN3-edit model yielded a mean rating that is significantly lower than the mean ratings of all the other models.

Together, these findings indicate that two-step approaches involving editing models do not offer a consistent advantage over single-step image generation methods. In addition, the use of the StyleGAN3-edit model significantly degrades output quality, whilst the outputs of StyleGAN2 exhibit greater robustness and consistency under editing transformations. This is not surprising given that the StyleGAN3 architecture does not allow for fixing textures/features at specific image coordinates.

Computational time: We were also interested in the computational time required for each model. Figure 6 is a scatter plot of the average human ratings (’q3 score’) for each model against the mean time for generating the image. The higher quality images generated from StyleGAN2 and StyleGAN3 models require more than four times the computational effort of StyleGAN2-adaptive and StyleGAN2-distilled. The latter, in particular, trades off some of the quality to gain in computational time. Figure 7 is a plot of mean computational time per model, where we notice that the editing models for StyleGAN3 and more so for StyleGAN2 require only a fraction of the generation time taken by the respective models. This observation underscores the potential value of further research and development in editing model methodologies since once the initial image is generated, subsequent edits take much less time.

6. Conclusions

In this paper, we assessed the text-to-image synthesis ability of StyleGAN-CLIP models using both automatic metrics and human ratings. In addition, we studied the correlation between these two evaluation methods. StyleGAN2 was identified as the best model, followed by StyleGAN3. The StyleGAN2-based model performed better in terms of picture realism. Likewise, the StyleGAN2-based editing models performed better than the StyleGAN3 editing model. In addition, from a text–image alignment perspective, we did not find any advantage in using the editing models over the single-step generation models. However, further development of editing models is encouraged since they offer a cost advantage. Furthermore, humans agree most on whether the description matches the image, whilst rating image quality seems to be a more subjective task. Moreover, the human ratings on a matching description indicate that there are face features that the models fail to adjust. Finally, it is clear that the automatic metrics are inconsistent with human perception, and therefore, it is hard to compare the performance of the text-to-image generation results in an objective way on the basis of automatic metrics. We, therefore, plan to carry out further work to evaluate the TIFA metric and subsequently further research in view of improving the automatic evaluation methods.

Author Contributions

Conceptualization, A.A. and M.T.; data curation, A.F.; investigation, M.T. and A.M.; methodology, M.T. and A.M.; project administration, A.M.; resources, A.F. and A.A.; software, A.F. and A.A.; supervision, M.T. and A.M.; validation, A.F., A.A. and M.T.; writing—original draft, A.A., A.F.; writing—review and editing, M.T. and A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Xjenza Malta (formerly Malta Council for Science and Technology) FUSION: Technology Development Programme, grant numbers R&I-2019-004-T and RNS-2024-004.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Faculty of ICT Research Ethics Committee of the University of Malta (protocol code ICT-2024-00182 and date of approval: 18 April 2024).

Informed Consent Statement

This study involved anonymous human evaluations conducted via crowd-sourcing to assess the output of text-to-image models. No personal or demographic data was collected. Participation was voluntary, while consent was implied through task acceptance on the platform, and participants were compensated fairly.

Data Availability Statement

Dataset available on request from the authors.

Acknowledgments

We would like to thank Paul Marty for helping us with the analysis of results.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ADA	Adaptive Discriminator Augmentation
CGAN	Conditional Generative Adversarial Network
CLIP	Contrastive Language Image Pretraining
CMMD	CLIP Maximum Mean Discrepancy
DCGAN	Deep Convolutional Generative Adversarial Network
FID	Fréchet Inception Distance
FFHQ	Flickr Faces High-Quality dataset
GAN	Generative Adversarial Network
HSD-Tukey	Honestly Significant Difference-Tukey test
ICC	Intra-Class Correlation
TIFA	Text-to-Image Faithfulness evaluation with question Answering
RP	R-Precision

References

Harshvardhan, G.M.; Gourisaria, M.K.; Pandey, M.; Rautaray, S.S. A comprehensive survey and analysis of generative models in machine learning. Comput. Sci. Rev. 2020, 38, 100285. [Google Scholar] [CrossRef]
Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7327–7347. [Google Scholar] [CrossRef] [PubMed]
Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans. Knowl. Data Eng. 2023, 35, 3313–3332. [Google Scholar] [CrossRef]
Nekamiche, N.; Zakaria, C.; Bouchareb, S.; Smaïli, K. A Deep Convolution Generative Adversarial Network for the Production of Images of Human Faces. In Intelligent Information and Database Systems, Proceedings of the 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, 28–30 November 2022; Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, T.P., Trawiński, B., Szczerbicki, E., Eds.; Springer: Cham, Switzerland, 2022; pp. 313–326. [Google Scholar]
Han, X.; Wu, Y.; Wan, R. A Method for Style Transfer from Artistic Images Based on Depth Extraction Generative Adversarial Network. Appl. Sci. 2023, 13, 867. [Google Scholar] [CrossRef]
Chen, J.; Fan, C.; Zhang, Z.; Li, G.; Zhao, Z.; Deng, Z.; Ding, Y. A Music-Driven Deep Generative Adversarial Model for Guzheng Playing Animation. IEEE Trans. Vis. Comput. Graph. 2023, 29, 1400–1414. [Google Scholar] [CrossRef] [PubMed]
Motamed, S.; Rogalla, P.; Khalvati, F. Data augmentation using Generative Adversarial Networks (GANs) for GAN-based detection of Pneumonia and COVID-19 in chest X-ray images. Inform. Med. Unlocked 2021, 27, 100779. [Google Scholar] [CrossRef] [PubMed]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Navidan, H.; Moshiri, P.F.; Nabati, M.; Shahbazian, R.; Ghorashi, S.A.; Shah-Mansouri, V.; Windridge, D. Generative Adversarial Networks (GANs) in networking: A comprehensive survey and evaluation. Comput. Netw. 2021, 194, 108149. [Google Scholar] [CrossRef]
Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
Kim, M.; Liu, F.; Jain, A.; Liu, X. DCFace: Synthetic Face Generation with Dual Condition Diffusion Model. arXiv 2023, arXiv:2304.07060. [Google Scholar] [CrossRef]
Peng, Y.; Zhao, C.; Xie, H.; Fukusato, T.; Miyata, K. DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model. arXiv 2023, arXiv:2302.06908. [Google Scholar]
Szeliga, A. A Comparative Study of Deep Generative Models for Image Generation. Master’s Thesis, Hochschule Hannover, Hannover, Germany, 2023. [Google Scholar] [CrossRef]
Wang, H. Comparative Analysis of GANs and Diffusion Models in Image Generation. Highlights Sci. Eng. Technol. 2024, 120, 59–66. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Ehrhart, M.; Resch, B.; Havas, C.; Niederseer, D. A Conditional GAN for Generating Time Series Data for Stress Detection in Wearable Physiological Sensor Data. Sensors 2022, 22, 5969. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the NIPS’16: 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2180–2188. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4396–4405. [Google Scholar] [CrossRef]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8107–8116. [Google Scholar] [CrossRef]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. In Proceedings of the Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; Aila, T. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv 2023, arXiv:2301.09515. [Google Scholar]
Baykal, A.C.; Anees, A.B.; Ceylan, D.; Erdem, E.; Erdem, A.; Yuret, D. CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing. ACM Trans. Graph. 2023, 42, 172. [Google Scholar] [CrossRef]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training Generative Adversarial Networks with Limited Data. In Proceedings of the NIPS’20: 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Viazovetskyi, Y.; Ivashkin, V.; Kashin, E. StyleGAN2 Distillation for Feed-Forward Image Manipulation. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Alaluf, Y.; Patashnik, O.; Wu, Z.; Zamir, A.; Shechtman, E.; Lischinski, D.; Cohen-Or, D. Third Time’s the Charm? Image and Video Editing with StyleGAN3. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 204–220. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the Machine Learning in Health Care 2020, Online, 7–8 August 2020. [Google Scholar]
Herrera-Berg, E. StyleGAN3-CLIP-Notebooks. 2022. Available online: https://github.com/ouhenio/StyleGAN3-CLIP-notebooks (accessed on 21 March 2024).
Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.H.; Zhou, B.; Yang, M.H. GAN Inversion: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3121–3138. [Google Scholar] [CrossRef]
Zhu, J.; Shen, Y.; Zhao, D.; Zhou, B. In-Domain GAN Inversion for Real Image Editing. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 592–608. [Google Scholar]
Shen, Y.; Gu, J.; Tang, X.; Zhou, B. Interpreting the Latent Space of GANs for Semantic Face Editing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9240–9249. [Google Scholar] [CrossRef]
Collins, E.; Bala, R.; Price, B.; Süsstrunk, S. Editing in Style: Uncovering the Local Semantics of GANs. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5770–5779. [Google Scholar] [CrossRef]
Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling up GANs for Text-to-Image Synthesis. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Pinkney, J.N.M.; Li, C. clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP. In Proceedings of the British Machine Vision Conference 2022, London, UK, 21–24 November 2022. [Google Scholar]
Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7514–7528. [Google Scholar] [CrossRef]
Liu, X.; Gong, C.; Wu, L.; Zhang, S.; Su, H.; Liu, Q. FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization. arXiv 2021, arXiv:2112.01573. [Google Scholar]
Dinh, T.M.; Nguyen, R.; Hua, B.S. TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 594–609. [Google Scholar]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar] [CrossRef]
Park, D.H.; Azadi, S.; Liu, X.; Darrell, T.; Rohrbach, A. Benchmark for Compositional Text-to-Image Synthesis. In Proceedings of the NeurIPS Datasets and Benchmarks 2021, Online, 6–14 December 2021. [Google Scholar]
Hu, Y.; Liu, B.; Kasai, J.; Wang, Y.; Ostendorf, M.; Krishna, R.; Smith, N.A. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. arXiv 2023, arXiv:2303.11897. [Google Scholar]
Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
Xia, X.; Xu, C.; Nan, B. Inception-v3 for flower classification. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp. 783–787. [Google Scholar] [CrossRef]
Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 9307–9315. [Google Scholar] [CrossRef]
Wang, Y.; Zhou, W.; Bao, J.; Wang, W.; Li, L.; Li, H. CLIP2GAN: Towards Bridging Text with the Latent Space of GANs. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6847–6859. [Google Scholar] [CrossRef]
Petsiuk, V. Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark. arXiv 2022, arXiv:2211.12112. [Google Scholar]
Otani, M.; Togashi, R.; Sawai, Y.; Ishigami, R.; Nakashima, Y.; Rahtu, E.; Heikkila, J.; Satoh, S. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14277–14286. [Google Scholar] [CrossRef]
Tanti, M.; Abdilla, S.; Muscat, A.; Borg, C.; Farrugia, R.A.; Gatt, A. Face2Text revisited: Improved data set and baseline results. In Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind, Marseille, France, 20–25 June 2022; Paggio, P., Gatt, A., Tanti, M., Eds.; European Language Resources Association: Paris, France, 2022; pp. 41–47. [Google Scholar]

Figure 1. Text-to-image generation procedure, based on the algorithm in StyleCLIP [21]. The procedure is adapted for all five generative models.

Figure 2. Scatter plots and linear relationships: (a) between human rating for Q1 and CLIP score (r = 0.065, p = 0.005), (b,c) between human rating for Q2 and FID (r = −0.005, p = 0.822) and CMMD (r = −0.264, p = 0.0) scores, respectively. Each point represents one of the description–picture items rated in the human evaluation survey. The line is the line of best fit for the relationship between items along both variables of interest.

Figure 3. Scatter plots and linear relationships: (a) between human ratings for Q3 and Q1 (r = 0.79, p < 0.0001), (b) between human ratings for Q3 and Q2 (r = −0.59, p < 0.0001) and (c) between human ratings for Q3 and Q1 + Q2 (r = 0.91, p < 0.0001). Each point represents one of the description–picture items rated in the human evaluation survey. The line is the line of best fit for the relationship between items along both variables of interest.

Figure 4. Violin and Box plots depicting the distributions per model and per question for human ratings of images generated by all five image generation models. The mean is represented by the × symbol, while the green line represents the median.

Figure 5. Violin and box plots depicting the distributions per model and per question for human ratings of images generated by StyleGAN2 and StyleGAN3 models in single-step generation and in two steps using the respective editing models. The mean is represented by the × symbol, whilst the yellow line is the median.

Figure 6. Average human ratings for each model against the mean time for generating the image. For the editing models, the time taken is equal to the time taken to initially generate an image using the respective generation model added to the time taken by the editing model.

Figure 7. Mean computational times per model. For the editing models, the time taken is equal to the time taken to initially generate an image using the respective generation model added to the time taken by the editing model.

Table 1. Example images that were generated by different models using a description. Each model is used twice on the same description to generate two different images. The descriptions are: (1) “A young woman with a strong jaw and an otherwise round face. She has long, straight, brunette hair, full lips and brown eyes.” (2) “A tanned young boy with a long, slightly wavy bowl haircut. He has a wide nose, full cheeks and a toothy smile.” Edited models are used by first generating an image using the first sentence and then editing the image using the second sentence.

	Description 1	Description 2
StyleGAN1
StyleGAN2
StyleGAN2 adaptive
StyleGAN2 distilled
StyleGAN3
StyleGAN2 CLIPInverter
StyleGAN2 edited
StyleGAN3 dited

Table 2. Quality control images added to every evaluation batch to test that annotators rate the realism of the image and the description correctness as expected.

Expected Rating	Description	Image Source
Unrealistic & incorrect	“A man.”	https://pixabay.com/vectors/beauty-face-girl-head-portrait-1295692/, accessed on 21 March 2024
Unrealistic & incorrect	“A man.”	https://pixabay.com/vectors/woman-beautiful-face-pretty-girl-157149/, accessed on 21 March 2024
Unrealistic & correct	“A man.”	https://pixabay.com/vectors/man-person-avatar-face-head-156584/, accessed on 21 March 2024
Unrealistic & correct	“A woman.”	https://pixabay.com/vectors/woman-red-hari-face-smile-308451/, accessed on 21 March 2024
Realistic & incorrect	“A man.”	FFHQ dataset [18]
Realistic & incorrect	“A woman.”	FFHQ dataset [18]
Realistic & correct	“A woman.”	FFHQ dataset [18]
Realistic & correct	“A man.”	FFHQ dataset [18]

Table 3. Inter-annotator agreement images added to every evaluation batch to measure the inter-annotator agreement between evaluators.

Model	Description	Image
StyleGAN	“A woman with neatly tied back black hair, thin eyebrows, a thin nose, a strong jawline and some makeup.”
StyleGAN2	“A tanned young woman with light brown hair, thin eyebrows, dark brown eyes, a petite nose and plump lips.”
StyleGAN2-Adaptive	“A woman with long, dark-brown hair, a fringe, light eyes, thin eyebrows, full lips and a strong jawline.”
StyleGAN2-Distillation	“A woman with ombre brown hair, extremely thin eyebrows, stunning blue eyes, plump lips and wearing heavy black eyeliner.”
StyleGAN3	“This younger woman has long wavy brown hair, full eyebrows and big blue eyes, her nose is long and pointed, and her lips are full.”

Table 4. Inter-rater agreement: Mean score, standard deviation and intra-class correlation (

I C C_{A g r e e m e n t}

,

I C C_{C o n s i s t e n c y}

) for Q1: “To what extent does the image match the description”, Q2: “To what extent does the image look like a real photo” and Q3: “How satisfied are you with this image, considering both how much it matches the description and the quality of the image”.

Table 4. Inter-rater agreement: Mean score, standard deviation and intra-class correlation (

I C C_{A g r e e m e n t}

,

I C C_{C o n s i s t e n c y}

) for Q1: “To what extent does the image match the description”, Q2: “To what extent does the image look like a real photo” and Q3: “How satisfied are you with this image, considering both how much it matches the description and the quality of the image”.

	Metric	StyleGAN	StyleGAN2	StyleGAN2 Adaptive	StyleGAN2 Distilled	StyleGAN3	Overall
Q1	Mean	74.4	81.2	4.0	45.3	71.4	55.3
	Std Dev	18.8	16.5	5.0	20.8	18.9	33.0
	$I C C_{A g r}$						0.78
	$I C C_{C o n}$						0.81
Q2	Mean	15.7	67.9	87.7	71.2	63.4	61.2
	Std Dev	15.9	24.4	16.3	21.7	23.9	31.8
	$I C C_{A g r}$						0.63
	$I C C_{C o n}$						0.68
Q3	Mean	43.2	74.4	18.3	47.8	64.8	49.7
	Std Dev	22.2	18.6	15.7	21.5	19.2	27.4
	$I C C_{A g r}$						0.55
	$I C C_{C o n}$						0.57

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fejjari, A.; Abela, A.; Tanti, M.; Muscat, A. Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Appl. Sci. 2025, 15, 8692. https://doi.org/10.3390/app15158692

AMA Style

Fejjari A, Abela A, Tanti M, Muscat A. Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Applied Sciences. 2025; 15(15):8692. https://doi.org/10.3390/app15158692

Chicago/Turabian Style

Fejjari, Asma, Aaron Abela, Marc Tanti, and Adrian Muscat. 2025. "Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces" Applied Sciences 15, no. 15: 8692. https://doi.org/10.3390/app15158692

APA Style

Fejjari, A., Abela, A., Tanti, M., & Muscat, A. (2025). Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Applied Sciences, 15(15), 8692. https://doi.org/10.3390/app15158692

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces

Abstract

1. Introduction

2. Review of Generative Models

2.1. Review of StyleGAN Models

2.2. Contrastive Language Image Pretraining (CLIP)

2.3. Text to Image Generation Model

2.4. The Image Editing Model

3. Review of Text-to-Image Evaluation Metrics

3.1. Automatic Evaluation Metrics

3.2. Human Evaluation Studies

4. Evaluation Methodology

4.1. Data Generation and Organisation

4.2. Human Evaluation Process

4.3. Automatic Evaluation Process

5. Results and Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI