Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces
Abstract
1. Introduction
2. Review of Generative Models
2.1. Review of StyleGAN Models
- StyleGAN: The style-based generator architecture was introduced by Karras et al. in 2019 [18]. The idea was to improve the traditional GAN generator in its ability to generate and interpolate face images with high fidelity and high-quality resolution. The novel GAN model can control both the coarse-grained (pose, face form) and fine-grained (eyes, hair) details. This is completed by projecting the noise vector into an intermediate latent space, which is then fed to the synthesis network. In spite of its powerful ability to generate realistic images, StyleGAN outputs suffer mainly from three issues [19]: (1) water droplet artifacts, (2) smoother with less texture detail images caused by the nature of the cost function, and (3) phase artifacts produced due to the adopted progressive growing strategy. StyleGAN was improved in later versions.
- StyleGAN2: To address the above-mentioned StyleGAN issues, the second variant StyleGAN2 [19] was proposed in 2020. A set of revisions in the generator normalisation form was made to avoid the appearance of water droplet artifacts, including weighted demodulation integration. Smoother outputs were obtained after adding a new regularisation term to the loss function. To resolve the phase artifacts problem in StyleGAN, the progressive growing strategy model was revised and replaced with skip connections in the generator and residual layers in the discriminator. Furthermore, the training process was revised by training larger models, which helps to improve the image quality generation.
- StyleGAN2 Adaptive (StyleGAN2-ADA): In 2020, the NVIDIA research team noted that training the model with a small amount of data causes the discriminator to overfit [25]. To overcome this, a novel StyleGAN2 variant, StyleGAN2 Adaptive (StyleGAN2-ADA), was developed. StyleGAN2-ADA integrates an adaptive discriminator augmentation strategy that considerably stabilises training in small data regimes. The new method adopts a wide range of augmentations to prevent the discriminator from overfitting. It has proved its capability to generate good results using only a few thousand training images.
- StyleGAN2 Distillation (StyleGAN2-Dist): This StyleGAN variant [26] addresses the inference time of StyleGAN2 while image quality is preserved. This is completed by uniting unconditional image generation with paired image-to-image translation to accelerate the StyleGAN2 image generation process. The novel framework, termed StyleGAN2 Distillation (StyleGAN2-Dist), aims to distill a specific image manipulation in StyleGAN2 latent code into one image-to-image translation. Despite its impressive quality, StyleGAN2 Distillation suffers from a number of limitations. Firstly, the framework’s latent space is not disentangled enough, and the transformations produced by the model are not very natural-looking. Another limitation is the need to distil each transformation to an independent model, where a universal model could be trained.
- StyleGAN3: Despite the different improvements made by StyleGAN2 and its variants, some image details seem to be generated at fixed pixel coordinates. This makes the images look less natural. This lack of translation equivariance in StyleGAN2 outputs decreases its competence in video and animation generation. In 2021, the third version of the StyleGAN models was developed [20]. The goal of StyleGAN3 is to counter the texture sticking problem (aliasing) in StyleGAN2. The new equivariant generator to translation and rotation helps the model generate images with a more natural transition animation and achieves video-quality generative performance. However, although the model maintains image quality, it somewhat trades off spatial consistency and editability since it no longer fixes textures/features to specific image coordinates.
2.2. Contrastive Language Image Pretraining (CLIP)
2.3. Text to Image Generation Model
2.4. The Image Editing Model
3. Review of Text-to-Image Evaluation Metrics
3.1. Automatic Evaluation Metrics
- CLIP score: The CLIP model score [23,34,35] measures the compatibility of text–image pairs by computing the cosine similarity between the embedding of the generated images and the embedding of the text prompts. Higher CLIP scores signify higher semantic similarity between the image and the given description. CLIP scores were found to have a high correlation with human judgement [36].
- AugCLIP score: is a variant of the standard CLIP score and is more robust against adversarial attacks [37]. AugCLIP introduces random augmentations on the image, such as random colourisation, random translation, random resize and random cutout. Since the CLIP model is trained on various image formats, the augmentation strategy does not destroy the semantic features encoded by CLIP.
- R-Precision (RP): RP [38,39,40] is a popular metric that computes the top-R text-to-image retrieval accuracy. The goal of this metric is to retrieve the matching text from 100 candidate texts using the synthesised image as a query and the cosine similarity between the image encoding vectors and the text encoding vectors as a similarity score. An adapted version of the Deep Attentional Multimodal Similarity Model (DAMSM) [39] is used to compute the image–text similarity score for retrieval.
- CLIP-R-Precision: Park et al. [40] observed that models not specifically tuned on Deep Attentional Multimodal Similarity Models (DAMSMs) tend to perform poorly on the RP metric. To address this issue, the CLIP model [24] is used in place of the standard DAMSM to calculate the R-Precision scores. Compared to DAMSM, CLIP obtains higher image–text retrieval performance.
- TIFA (Text-to-Image Faithfulness evaluation with question Answering): TIFA [41] is a recent automatic evaluation metric that evaluates the correctness between the text and the image using a visual question answering (VQA) model. In addition to the VQA model, TIFA integrates two other models: a question-generating (QG) model and a question-answering (QA) model. Given a text input, the QG model generates several question–answer pairs, and the QA model filters them. Finally, the VQA model provides answers to the questions using the synthesised image and the answers are checked for correctness. To facilitate comparative studies, a TIFA benchmark is provided. This includes 4k text prompts in addition to 25k questions. TIFA is considered much more accurate than CLIP and has better correlation with human assessment. However, we were not able to use TIFA in our study since a suitable VQA model specifically for face descriptions was not readily available.
- FID (Fréchet Inception Distance): FID [42] quantifies how similar the generated images are to real ones by comparing the distributions of feature representations, and as such, it is suitable to capture how realistic the generated image is. Both real and generated images are passed through a pretrained Inception-v3 [43] network, and the resulting feature vectors are assumed to follow multivariate Gaussians. FID computes the Fréchet (or Wasserstein-2) distance between these two Gaussians. A lower score is therefore better.
- CMMD (CLIP Maximum Mean Discrepancy): CMMD [44] is similar in concept to FID; however, CMMD does not assume image feature vectors to be multivariate Gaussians and instead uses Maximum Mean Discrepancy (MMD) to measure the distance between CLIP embeddings instead of Inception-v3 feature vectors. The authors argue that CLIP embeddings should perform better since these are trained on 400 million image–text pairs instead of the 1M images used to train the Inception-v3 model. It is shown that CMMD correlates better with human evaluations when used to predict image realism.
3.2. Human Evaluation Studies
4. Evaluation Methodology
4.1. Data Generation and Organisation
4.2. Human Evaluation Process
- Question 1: To what extent does the image match the description?
- Question 2: To what extent does the image look like a real photo?
- Question 3: How satisfied are you with this image, considering both how much it matches the description and the quality of the image?
4.3. Automatic Evaluation Process
5. Results and Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| ADA | Adaptive Discriminator Augmentation |
| CGAN | Conditional Generative Adversarial Network |
| CLIP | Contrastive Language Image Pretraining |
| CMMD | CLIP Maximum Mean Discrepancy |
| DCGAN | Deep Convolutional Generative Adversarial Network |
| FID | Fréchet Inception Distance |
| FFHQ | Flickr Faces High-Quality dataset |
| GAN | Generative Adversarial Network |
| HSD-Tukey | Honestly Significant Difference-Tukey test |
| ICC | Intra-Class Correlation |
| TIFA | Text-to-Image Faithfulness evaluation with question Answering |
| RP | R-Precision |
References
- Harshvardhan, G.M.; Gourisaria, M.K.; Pandey, M.; Rautaray, S.S. A comprehensive survey and analysis of generative models in machine learning. Comput. Sci. Rev. 2020, 38, 100285. [Google Scholar] [CrossRef]
- Bond-Taylor, S.; Leach, A.; Long, Y.; Willcocks, C.G. Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7327–7347. [Google Scholar] [CrossRef] [PubMed]
- Gui, J.; Sun, Z.; Wen, Y.; Tao, D.; Ye, J. A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications. IEEE Trans. Knowl. Data Eng. 2023, 35, 3313–3332. [Google Scholar] [CrossRef]
- Nekamiche, N.; Zakaria, C.; Bouchareb, S.; Smaïli, K. A Deep Convolution Generative Adversarial Network for the Production of Images of Human Faces. In Intelligent Information and Database Systems, Proceedings of the 14th Asian Conference, ACIIDS 2022, Ho Chi Minh City, Vietnam, 28–30 November 2022; Nguyen, N.T., Tran, T.K., Tukayev, U., Hong, T.P., Trawiński, B., Szczerbicki, E., Eds.; Springer: Cham, Switzerland, 2022; pp. 313–326. [Google Scholar]
- Han, X.; Wu, Y.; Wan, R. A Method for Style Transfer from Artistic Images Based on Depth Extraction Generative Adversarial Network. Appl. Sci. 2023, 13, 867. [Google Scholar] [CrossRef]
- Chen, J.; Fan, C.; Zhang, Z.; Li, G.; Zhao, Z.; Deng, Z.; Ding, Y. A Music-Driven Deep Generative Adversarial Model for Guzheng Playing Animation. IEEE Trans. Vis. Comput. Graph. 2023, 29, 1400–1414. [Google Scholar] [CrossRef] [PubMed]
- Motamed, S.; Rogalla, P.; Khalvati, F. Data augmentation using Generative Adversarial Networks (GANs) for GAN-based detection of Pneumonia and COVID-19 in chest X-ray images. Inform. Med. Unlocked 2021, 27, 100779. [Google Scholar] [CrossRef] [PubMed]
- Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
- Navidan, H.; Moshiri, P.F.; Nabati, M.; Shahbazian, R.; Ghorashi, S.A.; Shah-Mansouri, V.; Windridge, D. Generative Adversarial Networks (GANs) in networking: A comprehensive survey and evaluation. Comput. Netw. 2021, 194, 108149. [Google Scholar] [CrossRef]
- Creswell, A.; White, T.; Dumoulin, V.; Arulkumaran, K.; Sengupta, B.; Bharath, A.A. Generative Adversarial Networks: An Overview. IEEE Signal Process. Mag. 2018, 35, 53–65. [Google Scholar] [CrossRef]
- Kim, M.; Liu, F.; Jain, A.; Liu, X. DCFace: Synthetic Face Generation with Dual Condition Diffusion Model. arXiv 2023, arXiv:2304.07060. [Google Scholar] [CrossRef]
- Peng, Y.; Zhao, C.; Xie, H.; Fukusato, T.; Miyata, K. DiffFaceSketch: High-Fidelity Face Image Synthesis with Sketch-Guided Latent Diffusion Model. arXiv 2023, arXiv:2302.06908. [Google Scholar]
- Szeliga, A. A Comparative Study of Deep Generative Models for Image Generation. Master’s Thesis, Hochschule Hannover, Hannover, Germany, 2023. [Google Scholar] [CrossRef]
- Wang, H. Comparative Analysis of GANs and Diffusion Models in Image Generation. Highlights Sci. Eng. Technol. 2024, 120, 59–66. [Google Scholar] [CrossRef]
- Radford, A.; Metz, L.; Chintala, S. Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
- Ehrhart, M.; Resch, B.; Havas, C.; Niederseer, D. A Conditional GAN for Generating Time Series Data for Stress Detection in Wearable Physiological Sensor Data. Sensors 2022, 22, 5969. [Google Scholar] [CrossRef] [PubMed]
- Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. In Proceedings of the NIPS’16: 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; Curran Associates Inc.: Red Hook, NY, USA, 2016; pp. 2180–2188. [Google Scholar]
- Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4396–4405. [Google Scholar] [CrossRef]
- Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8107–8116. [Google Scholar] [CrossRef]
- Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-Free Generative Adversarial Networks. In Proceedings of the Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
- Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2085–2094. [Google Scholar]
- Sauer, A.; Karras, T.; Laine, S.; Geiger, A.; Aila, T. StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis. arXiv 2023, arXiv:2301.09515. [Google Scholar]
- Baykal, A.C.; Anees, A.B.; Ceylan, D.; Erdem, E.; Erdem, A.; Yuret, D. CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing. ACM Trans. Graph. 2023, 42, 172. [Google Scholar] [CrossRef]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021. [Google Scholar]
- Karras, T.; Aittala, M.; Hellsten, J.; Laine, S.; Lehtinen, J.; Aila, T. Training Generative Adversarial Networks with Limited Data. In Proceedings of the NIPS’20: 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Viazovetskyi, Y.; Ivashkin, V.; Kashin, E. StyleGAN2 Distillation for Feed-Forward Image Manipulation. In Proceedings of the ECCV 2020, Glasgow, UK, 23–28 August 2020. [Google Scholar]
- Alaluf, Y.; Patashnik, O.; Wu, Z.; Zamir, A.; Shechtman, E.; Lischinski, D.; Cohen-Or, D. Third Time’s the Charm? Image and Video Editing with StyleGAN3. In Proceedings of the Computer Vision—ECCV 2022 Workshops, Tel Aviv, Israel, 23–27 October 2022; Springer: Berlin/Heidelberg, Germany, 2023; pp. 204–220. [Google Scholar] [CrossRef]
- Zhang, Y.; Jiang, H.; Miura, Y.; Manning, C.D.; Langlotz, C. Contrastive Learning of Medical Visual Representations from Paired Images and Text. In Proceedings of the Machine Learning in Health Care 2020, Online, 7–8 August 2020. [Google Scholar]
- Herrera-Berg, E. StyleGAN3-CLIP-Notebooks. 2022. Available online: https://github.com/ouhenio/StyleGAN3-CLIP-notebooks (accessed on 21 March 2024).
- Xia, W.; Zhang, Y.; Yang, Y.; Xue, J.H.; Zhou, B.; Yang, M.H. GAN Inversion: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 3121–3138. [Google Scholar] [CrossRef]
- Zhu, J.; Shen, Y.; Zhao, D.; Zhou, B. In-Domain GAN Inversion for Real Image Editing. In Computer Vision—ECCV 2020, Proceedings of the 16th European Conference, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 592–608. [Google Scholar]
- Shen, Y.; Gu, J.; Tang, X.; Zhou, B. Interpreting the Latent Space of GANs for Semantic Face Editing. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9240–9249. [Google Scholar] [CrossRef]
- Collins, E.; Bala, R.; Price, B.; Süsstrunk, S. Editing in Style: Uncovering the Local Semantics of GANs. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 5770–5779. [Google Scholar] [CrossRef]
- Kang, M.; Zhu, J.Y.; Zhang, R.; Park, J.; Shechtman, E.; Paris, S.; Park, T. Scaling up GANs for Text-to-Image Synthesis. In Proceedings of the Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
- Pinkney, J.N.M.; Li, C. clip2latent: Text driven sampling of a pre-trained StyleGAN using denoising diffusion and CLIP. In Proceedings of the British Machine Vision Conference 2022, London, UK, 21–24 November 2022. [Google Scholar]
- Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic, 7–11 November 2021; Moens, M.F., Huang, X., Specia, L., Yih, S.W.t., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2021; pp. 7514–7528. [Google Scholar] [CrossRef]
- Liu, X.; Gong, C.; Wu, L.; Zhang, S.; Su, H.; Liu, Q. FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+GAN Space Optimization. arXiv 2021, arXiv:2112.01573. [Google Scholar]
- Dinh, T.M.; Nguyen, R.; Hua, B.S. TISE: Bag of Metrics for Text-to-Image Synthesis Evaluation. In Computer Vision—ECCV 2022, Proceedings of the 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 594–609. [Google Scholar]
- Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar] [CrossRef]
- Park, D.H.; Azadi, S.; Liu, X.; Darrell, T.; Rohrbach, A. Benchmark for Compositional Text-to-Image Synthesis. In Proceedings of the NeurIPS Datasets and Benchmarks 2021, Online, 6–14 December 2021. [Google Scholar]
- Hu, Y.; Liu, B.; Kasai, J.; Wang, Y.; Ostendorf, M.; Krishna, R.; Smith, N.A. TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering. arXiv 2023, arXiv:2303.11897. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Xia, X.; Xu, C.; Nan, B. Inception-v3 for flower classification. In Proceedings of the 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, 2–4 June 2017; pp. 783–787. [Google Scholar] [CrossRef]
- Jayasumana, S.; Ramalingam, S.; Veit, A.; Glasner, D.; Chakrabarti, A.; Kumar, S. Rethinking FID: Towards a Better Evaluation Metric for Image Generation. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 9307–9315. [Google Scholar] [CrossRef]
- Wang, Y.; Zhou, W.; Bao, J.; Wang, W.; Li, L.; Li, H. CLIP2GAN: Towards Bridging Text with the Latent Space of GANs. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6847–6859. [Google Scholar] [CrossRef]
- Petsiuk, V. Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark. arXiv 2022, arXiv:2211.12112. [Google Scholar]
- Otani, M.; Togashi, R.; Sawai, Y.; Ishigami, R.; Nakashima, Y.; Rahtu, E.; Heikkila, J.; Satoh, S. Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14277–14286. [Google Scholar] [CrossRef]
- Tanti, M.; Abdilla, S.; Muscat, A.; Borg, C.; Farrugia, R.A.; Gatt, A. Face2Text revisited: Improved data set and baseline results. In Proceedings of the 2nd Workshop on People in Vision, Language, and the Mind, Marseille, France, 20–25 June 2022; Paggio, P., Gatt, A., Tanti, M., Eds.; European Language Resources Association: Paris, France, 2022; pp. 41–47. [Google Scholar]







| Description 1 | Description 2 | |
|---|---|---|
| StyleGAN1 | ![]() | ![]() |
| StyleGAN2 | ![]() | ![]() |
| StyleGAN2 adaptive | ![]() | ![]() |
| StyleGAN2 distilled | ![]() | ![]() |
| StyleGAN3 | ![]() | ![]() |
| StyleGAN2 CLIPInverter | ![]() | ![]() |
| StyleGAN2 edited | ![]() | ![]() |
| StyleGAN3 dited | ![]() | ![]() |
| Expected Rating | Description | Image | Image Source |
|---|---|---|---|
| Unrealistic & incorrect | “A man.” | | https://pixabay.com/vectors/beauty-face-girl-head-portrait-1295692/, accessed on 21 March 2024 |
| Unrealistic & incorrect | “A man.” | | https://pixabay.com/vectors/woman-beautiful-face-pretty-girl-157149/, accessed on 21 March 2024 |
| Unrealistic & correct | “A man.” | | https://pixabay.com/vectors/man-person-avatar-face-head-156584/, accessed on 21 March 2024 |
| Unrealistic & correct | “A woman.” | | https://pixabay.com/vectors/woman-red-hari-face-smile-308451/, accessed on 21 March 2024 |
| Realistic & incorrect | “A man.” | | FFHQ dataset [18] |
| Realistic & incorrect | “A woman.” | | FFHQ dataset [18] |
| Realistic & correct | “A woman.” | | FFHQ dataset [18] |
| Realistic & correct | “A man.” | ![]() | FFHQ dataset [18] |
| Model | Description | Image |
|---|---|---|
| StyleGAN | “A woman with neatly tied back black hair, thin eyebrows, a thin nose, a strong jawline and some makeup.” | |
| StyleGAN2 | “A tanned young woman with light brown hair, thin eyebrows, dark brown eyes, a petite nose and plump lips.” | |
| StyleGAN2-Adaptive | “A woman with long, dark-brown hair, a fringe, light eyes, thin eyebrows, full lips and a strong jawline.” | |
| StyleGAN2-Distillation | “A woman with ombre brown hair, extremely thin eyebrows, stunning blue eyes, plump lips and wearing heavy black eyeliner.” | |
| StyleGAN3 | “This younger woman has long wavy brown hair, full eyebrows and big blue eyes, her nose is long and pointed, and her lips are full.” | ![]() |
| Metric | StyleGAN | StyleGAN2 | StyleGAN2 Adaptive | StyleGAN2 Distilled | StyleGAN3 | Overall | |
|---|---|---|---|---|---|---|---|
| Q1 | Mean | 74.4 | 81.2 | 4.0 | 45.3 | 71.4 | 55.3 |
| Std Dev | 18.8 | 16.5 | 5.0 | 20.8 | 18.9 | 33.0 | |
| 0.78 | |||||||
| 0.81 | |||||||
| Q2 | Mean | 15.7 | 67.9 | 87.7 | 71.2 | 63.4 | 61.2 |
| Std Dev | 15.9 | 24.4 | 16.3 | 21.7 | 23.9 | 31.8 | |
| 0.63 | |||||||
| 0.68 | |||||||
| Q3 | Mean | 43.2 | 74.4 | 18.3 | 47.8 | 64.8 | 49.7 |
| Std Dev | 22.2 | 18.6 | 15.7 | 21.5 | 19.2 | 27.4 | |
| 0.55 | |||||||
| 0.57 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Fejjari, A.; Abela, A.; Tanti, M.; Muscat, A. Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Appl. Sci. 2025, 15, 8692. https://doi.org/10.3390/app15158692
Fejjari A, Abela A, Tanti M, Muscat A. Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Applied Sciences. 2025; 15(15):8692. https://doi.org/10.3390/app15158692
Chicago/Turabian StyleFejjari, Asma, Aaron Abela, Marc Tanti, and Adrian Muscat. 2025. "Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces" Applied Sciences 15, no. 15: 8692. https://doi.org/10.3390/app15158692
APA StyleFejjari, A., Abela, A., Tanti, M., & Muscat, A. (2025). Evaluation of StyleGAN-CLIP Models in Text-to-Image Generation of Faces. Applied Sciences, 15(15), 8692. https://doi.org/10.3390/app15158692






























