Prompt-Guided Semantic Latent Direction Learning in Diffusion Models for Abstract Visual Concept Manipulation
Abstract
1. Introduction
- We propose a prompt-guided concept-vector learning framework for the controllable manipulation of abstract visual concepts without requiring external human-annotated image pairs, pixel-level labels, segmentation masks, identity labels, or manually annotated editing targets.
- We introduce a positive–neutral multi-prompt pairing strategy and a parameter-efficient bottleneck injection mechanism, where only a compact concept vector is learned while the pretrained VAE, text encoder, and U-Net remain frozen.
- We extend the learned concept vector to real-image editing through an image-to-image pipeline, enabling controllable semantic manipulation through and , and validate the framework through quantitative, qualitative, human preference, baseline comparison, and ablation analyses.
2. Related Work
2.1. Controllable Generation in Diffusion Models
2.2. Latent Space Interpretation and Semantic Vector Learning
3. Materials and Methods
3.1. Overview of the Framework
3.2. Multi-Prompt Synthetic Dataset Generation
3.3. Prompt-Guided Concept Vector Optimization
3.4. Rationale for Bottleneck-Level Concept Injection
3.5. Image-to-Image Editing Pipeline
3.6. Experimental Setup
3.7. Evaluation Metrics
4. Results
4.1. Quantitative Results
4.2. Human Preference Study
4.3. Qualitative Results
4.4. Comparison with Existing Image Editing Methods
4.5. Ablation and Robustness Analysis
4.5.1. Robustness Under Coupled – Settings
4.5.2. Effect of Denoising-Stage-Dependent Concept Strength
4.5.3. Ablation Study on Injection Location
4.5.4. Ablation Study on Multi-Prompt Pool
4.6. Parameter Efficiency and Computational Overhead
5. Discussion
5.1. Analysis of Results
5.2. Generalization Boundary for Geometric Concepts
5.3. Failure Cases
5.4. Limitations
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. (NeurIPS) 2021, 34, 8780–8794. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. (NeurIPS) 2020, 33, 6840–6851. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. arXiv 2022, arXiv:2207.12598. [Google Scholar]
- Cao, P.; Zhou, F.; Song, Q.; Yang, L. Controllable Generation with Text-to-Image Diffusion Models: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 4771–4791. [Google Scholar] [PubMed]
- Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In Proceedings of the Tenth International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J. Learning Transferable Visual Models from Natural Language Supervision. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar]
- Haas, R.; Huberman-Spiegelglas, I.; Mulayoff, R.; Graßhof, S.; Brandt, S.S.; Michaeli, T. Discovering Interpretable Directions in the Semantic Latent Space of Diffusion Models. In Proceedings of the 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG), Istanbul, Turkiye, 27–31 May 2024. [Google Scholar]
- Park, Y.-H.; Kwon, M.; Choi, J.; Jo, J.; Uh, Y. Understanding the Latent Space of Diffusion Models through the Lens of Riemannian Geometry. Adv. Neural Inf. Process. Syst. (NeurIPS) 2023, 36, 24129–24142. [Google Scholar] [CrossRef]
- Kwon, M.; Jeong, J.; Uh, Y. Diffusion Models Already Have a Semantic Latent Space. arXiv 2022, arXiv:2210.10960. [Google Scholar]
- Brooks, T.; Holynski, A.; Efros, A.A. InstructPix2Pix: Learning to Follow Image Editing Instructions. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023. [Google Scholar]
- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Pritch, Y.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross-Attention Control. arXiv 2022, arXiv:2208.01626. [Google Scholar]
- Zhang, L.; Rao, A.; Agrawala, M. Adding Conditional Control to Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023. [Google Scholar]
- Mou, C.; Wang, X.; Xie, L.; Wu, Y.; Zhang, J.; Qi, Z.; Shan, Y. T2I-Adapter: Learning Adapters for Controllable Text-to-Image Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
- Traub, J. Representation Learning with Diffusion Models. arXiv 2022, arXiv:2210.11058. [Google Scholar]
- Preechakul, K.; Chatthee, N.; Wizadwongsa, S.; Suwajanakorn, S. Diffusion Autoencoders: Toward a Meaningful and Decodable Representation. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2022), New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
- Samuel, D.; Ben-Ari, R.; Raviv, S.; Darshan, N.; Chechik, G. Generating Images of Rare Concepts Using Pre-trained Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024. [Google Scholar]
- Zeng, E.Z.; Chen, Y.; Wong, A. Decoding Diffusion: A Scalable Framework for Unsupervised Analysis of Latent Space Representations. arXiv 2024, arXiv:2410.21314. [Google Scholar]
- Li, H.; Shen, C.; Torr, P.; Tresp, V.; Gu, J. Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
- Shuai, Z.; Wu, C.; Tang, Z.; Song, B.; Shen, L. Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Semantic Editing. arXiv 2024, arXiv:2408.13335. [Google Scholar]
- Yang, Z.; Yu, H.; Li, B.; Zhang, J.; Huang, J.; Zhao, F. Unleashing the Potential of the Semantic Latent Space in Diffusion Models for Image Dehazing. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024. [Google Scholar]
- Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image Quality Assessment: From Error Visibility to Structural Similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar] [CrossRef]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An Image Is Worth One Word: Personalizing Text-to-Image Generation Using Textual Inversion. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]











| Concept | Positive Prompts | Neutral Prompts |
|---|---|---|
| Perfect skin | A face with smooth and glowing skin; A portrait with healthy, flawless complexion | A face; A portrait of a person |
| Peaceful lake | A peaceful lake with calm still water; A serene lake with smooth reflections | A lake; A natural landscape with a lake |
| Melancholic portrait | A melancholic portrait with subdued emotional expression; A somber portrait with quiet sadness | A portrait of a person; A human face portrait |
| Dramatic portrait | A dramatic portrait with strong contrast lighting; A cinematic portrait with intense mood | A portrait of a person; A studio portrait |
| Method | SSIM ↑ | LPIPS ↓ | CLIP ↑ |
|---|---|---|---|
| Stable Diffusion | 0.6636 | 0.2478 | 0.2460 |
| Ours (w/o concept) | 0.7011 | 0.2314 | 0.2481 |
| Ours (with concept) | 0.7023 | 0.2370 | 0.2491 |
| Method | SSIM ↑ | LPIPS ↓ | CLIP ↑ |
|---|---|---|---|
| Stable Diffusion | 0.4578 | 0.4129 | 0.2663 |
| Ours (w/o concept) | 0.4653 | 0.5086 | 0.2820 |
| Ours (with concept) | 0.4854 | 0.5021 | 0.2871 |
| Method | Supervision/ Training Data | Trainable Params. | Main Strength | Main Limitation |
|---|---|---|---|---|
| SDEdit/Stable Diffusion Img2Img [3,6] | No additional training; text prompt only | 0 | Simple image-to-image editing | Prompt-dependent control; weak for subjective concepts |
| Prompt-to-Prompt [13] | Text prompts and attention manipulation | 0 | Attention-based prompt editing | Sensitive to prompt alignment and attention maps |
| InstructPix2Pix [12] | Instruction-tuned external editing data | Pretrained editing model | Flexible instruction-based editing | May alter identity, background, or composition |
| Textual Inversion [25] | Concept-specific image set | Token embedding | Lightweight token-level learning | Limited stability for image-to-image control |
| Self-discovered latent directions [20] | Generated samples or latent-direction analysis | Method-dependent | Interpretable semantic directions | Target control can be indirect |
| Ours | Positive/neutral prompt pools without external human-annotated image pairs | 1280 per concept | Explicit concept vector with continuous control | Global appearance control; less precise spatial editing |
| Injection Location | SSIM ↑ | LPIPS ↓ | CLIP ↑ |
|---|---|---|---|
| Down blocks | 0.6713 | 0.5188 | 0.2148 |
| Mid-block | 0.7023 | 0.2370 | 0.2491 |
| Up blocks | 0.1671 | 0.7972 | 0.1868 |
| All blocks | 0.1684 | 0.7952 | 0.1866 |
| Concept | Prompt Setting | SSIM ↑ | LPIPS ↓ | CLIP ↑ |
|---|---|---|---|---|
| Perfect skin | Single prompt | 0.6989 | 0.2671 | 0.2411 |
| Perfect skin | Multi-prompt pool | 0.7023 | 0.2370 | 0.2491 |
| Peaceful lake | Single prompt | 0.5310 | 0.3264 | 0.2678 |
| Peaceful lake | Multi-prompt pool | 0.4854 | 0.5021 | 0.2871 |
| Method | Trainable Params. | Storage | Training | Latency/Image |
|---|---|---|---|---|
| Stable Diffusion Img2Img | 0 | 0 MB | – | 0.5858–0.6459 s |
| Ours | 1280 | 0.0049 MB | 586 s | 0.6234–0.6821 s |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Khalid, M.; Ying, F.; Atef, A.-G.A.M.; Phaphuangwittayakul, A.; Dhuny, R. Prompt-Guided Semantic Latent Direction Learning in Diffusion Models for Abstract Visual Concept Manipulation. J. Imaging 2026, 12, 279. https://doi.org/10.3390/jimaging12070279
Khalid M, Ying F, Atef A-GAM, Phaphuangwittayakul A, Dhuny R. Prompt-Guided Semantic Latent Direction Learning in Diffusion Models for Abstract Visual Concept Manipulation. Journal of Imaging. 2026; 12(7):279. https://doi.org/10.3390/jimaging12070279
Chicago/Turabian StyleKhalid, Mahzaib, Fangli Ying, Al-Garadi Ahmed Mohammed Atef, Aniwat Phaphuangwittayakul, and Riyad Dhuny. 2026. "Prompt-Guided Semantic Latent Direction Learning in Diffusion Models for Abstract Visual Concept Manipulation" Journal of Imaging 12, no. 7: 279. https://doi.org/10.3390/jimaging12070279
APA StyleKhalid, M., Ying, F., Atef, A.-G. A. M., Phaphuangwittayakul, A., & Dhuny, R. (2026). Prompt-Guided Semantic Latent Direction Learning in Diffusion Models for Abstract Visual Concept Manipulation. Journal of Imaging, 12(7), 279. https://doi.org/10.3390/jimaging12070279

