Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism
Abstract
1. Introduction
- We propose EvilPrompt, a cross-attention-based jailbreak attack that intervenes in image generation during inference. The attack bypasses the CLIP-based post-generation safety checker in Stable Diffusion without altering the input prompt or retraining the model.
- We conduct systematic experiments on two real-world malicious prompt datasets, 4chan and Lexica, each containing 500 prompts. EvilPrompt achieves an ASR of 97.4% on 4chan and 98.0% on Lexica (97.7% overall), while maintaining high semantic fidelity, with BLIP similarity consistently above 0.75 and human-evaluated realism scores above 7.0 on a 10-point scale across both datasets.
- We demonstrate that external moderation alone is difficult to reconcile with both strong defense and efficient deployment against process-level jailbreak attacks. Although stronger defenses can reduce the ASR from 97.7% to 5.9%, they incur substantial overhead, including 21.5 s additional latency and $0.01182 extra cost per query.
2. Related Work
2.1. Text-to-Image Generation Models
2.2. Safety Risks and Jailbreak Attacks in Stable Diffusion Models
2.3. Attention Control and Image Editing in Diffusion Models
3. Method
3.1. Threat Model
3.1.1. Real-World Scenario
3.1.2. Adversary Objectives
- (i)
- Achieve effective jailbreak success. The attacker seeks a high attack success rate (ASR) against Stable Diffusion’s post hoc safety checker. A generation attempt is considered successful if the checker does not trigger a safety warning and the output is not replaced with a black image.
- (ii)
- Maintain practical attack efficiency. The attack should remain feasible under realistic conditions, without retraining, excessive querying, or costly optimization. We therefore consider both the number of generation attempts required for success and the inference-time overhead of the attack.
- (iii)
- Preserve malicious semantics with high fidelity. The generated image should faithfully reflect the malicious intent of the original prompt while remaining visually plausible. We quantify semantic fidelity using BLIP similarity and complement it with human evaluation of realism and perceived maliciousness.
3.2. Approach Overview
| Algorithm 1 EvilPrompt: cross-attention jailbreak for Stable Diffusion. |
|
3.3. EvilPrompt Pipeline
3.3.1. Implementation Details
3.3.2. Prompt Source and Preprocessing
3.4. Jailbreak Attack
3.4.1. Safety Checker
3.4.2. Cross-Attention Control
- Malicious token selection.
- 2.
- Timestep schedule.
- 3.
- Scale-factor search.
- 4.
- Layer-level application.
4. Evaluation
- RQ1
- Bypass effectiveness. How effective is EvilPrompt at bypassing Stable Diffusion’s built-in post hoc safety checker across different categories of malicious prompts?
- RQ2
- Output quality. To what extent does EvilPrompt preserve prompt semantics and visual realism while performing jailbreak attacks?
- RQ3
- Robustness under defenses. How resilient is EvilPrompt when representative text-level moderation systems are deployed before image generation?
4.1. Experimental Settings
4.1.1. Hardware and Software
4.1.2. Datasets
- 4chan. We collect toxic and adversarial text from the /pol/ board, followed by a standard cleaning pipeline including deduplication, profanity normalization, and length filtering. The final dataset contains 500 prompts spanning five categories: sexually explicit, violent, disturbing, discriminatory, and political. These prompts are typically short, direct, and adversarial in style, making them useful for testing whether the attack can handle explicit and confrontational malicious intent.
- Lexica. We query Lexica, a large repository of Stable Diffusion generations, using unsafe keywords and retain prompts associated with unsafe semantics. Compared with 4chan, Lexica prompts are longer, more descriptive, and often include artistic or stylistic modifiers. As such, they are closer to the linguistic style commonly seen in text-to-image generation communities and may also better match Stable Diffusion’s training distribution. We retain 500 prompts in total.
4.1.3. Models
- Role of CLIP in the safety checker
- 2.
- Role of BLIP in semantic evaluation
4.1.4. Evaluation Metrics
- Attack Success Rate (ASR-N)
- 2.
- BLIP Similarity
- 3.
- Human Evaluation
4.2. Experimental Results
4.2.1. Bypass Effectiveness (RQ1)
4.2.2. Qualitative Evidence
4.2.3. Image Quality (RQ2)
4.2.4. Evaluation Under Defense Mechanisms (RQ3)
- Summary:
5. Discussion
5.1. Implications for Model Safety
5.2. Defense Performance Analysis
5.3. Ethical Considerations
5.4. Future Work
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, New York, USA, 20–22 June 2016; pp. 1060–1069. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
- Rando, J.; Paleka, D.; Lindner, D.; Heim, L.; Tramer, F. Red-Teaming the Stable Diffusion Safety Filter. In Proceedings of the NeurIPS ML Safety Workshop, Virtual, 9 December 2022. [Google Scholar]
- Qu, Y.; Shen, X.; He, X.; Backes, M.; Zannettou, S.; Zhang, Y. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2023; pp. 3403–3417. [Google Scholar]
- Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1671–1685. [Google Scholar]
- Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
- Yang, Y.; Hui, B.; Yuan, H.; Gong, N.; Cao, Y. Sneakyprompt: Jailbreaking text-to-image generative models. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2024; pp. 897–912. [Google Scholar]
- Mansimov, E.; Parisotto, E.; Ba, J.L.; Salakhutdinov, R. Generating images from captions with attention. arXiv 2015, arXiv:1511.02793. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- Yang, Y.; Gao, R.; Wang, X.; Ho, T.Y.; Xu, N.; Xu, Q. Mma-diffusion: Multimodal attack on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 7737–7746. [Google Scholar]
- Liu, H.; Wu, Y.; Zhai, S.; Yuan, B.; Zhang, N. Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 20585–20594. [Google Scholar]
- Shahgir, H.; Kong, X.; Ver Steeg, G.; Dong, Y. Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5779–5796. [Google Scholar]
- Ma, J.; Li, Y.; Xiao, Z.; Cao, A.; Zhang, J.; Ye, C.; Zhao, J. Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3141–3157. [Google Scholar]
- Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 22522–22531. [Google Scholar]
- Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 2426–2436. [Google Scholar]
- Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross Attention Control. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 22500–22510. [Google Scholar]
- Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
- Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
- Jigsaw. Perspective API. 2025. Available online: https://www.perspectiveapi.com/ (accessed on 27 December 2025).
- Vishwamitra, N.; Guo, K.; Romit, F.T.; Ondracek, I.; Cheng, L.; Zhao, Z.; Hu, H. Moderating new waves of online hate with chain-of-thought reasoning in large language models. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2024; pp. 788–806. [Google Scholar]




| Method | Technique | Dataset | Strength | Limitation |
|---|---|---|---|---|
| Attack Methods | ||||
| Red-Teaming SD [5] | Manual prompting | Custom | Simple setup | Limited coverage |
| RIATIG [13] | Text perturbation | MS-COCO | Low visibility | Limited perturbations |
| SneakyPrompt [9] | RL substitution | NSFW-200 | Automated attack | Semantic drift |
| MMA-Diffusion [12] | Gradient optimization | Custom | High ASR | White-box only |
| JPA [15] | Adversarial prompt | Custom | Controllable strength | Prompt-level only |
| Defense Methods | ||||
| Safe Latent Diffusion [16] | Latent guidance | I2P | Generation-time defense | Quality loss |
| Erasing Concepts [17] | Model fine-tuning | Custom | Permanent removal | Irreversible edits |
| EvilPrompt (Ours) | Cross-attention control | 4chan, Lexica | Fine-grained attack | Attention access |
| Dataset | #Prompts | Avg. Len. | Med. Len. | Source | Categories | Prompt Style | Filtering Criteria |
|---|---|---|---|---|---|---|---|
| 4chan | 500 | 8 | 7 | /pol/ board | Sexual, Violent, Disturbing, Discriminatory, Political | Short, direct, adversarial | Deduplication, profanity normalization, length filtering |
| Lexica | 500 | 17 | 15 | Lexica.art | Sexual, Violent, Disturbing, Discriminatory, Political | Descriptive, artistic, style-oriented | Keyword retrieval, unsafe semantics retention |
| Dataset/Category | Realism | Similarity | Maliciousness |
|---|---|---|---|
| 4chan–Sexual | 7.26 | 8.35 | 6.64 |
| 4chan–Violence | 7.12 | 7.45 | 8.58 |
| 4chan–Disturbing | 6.54 | 7.42 | 6.71 |
| 4chan–Discriminatory | 7.20 | 8.16 | 7.68 |
| 4chan–Political | 7.37 | 7.98 | 7.16 |
| Lexica–Sexual | 7.79 | 7.70 | 6.66 |
| Lexica–Violence | 7.78 | 7.30 | 8.48 |
| Lexica–Disturbing | 7.17 | 7.33 | 6.24 |
| Lexica–Discriminatory | 7.40 | 8.61 | 7.30 |
| Lexica–Political | 7.60 | 7.69 | 7.32 |
| Dataset | Sexual | Violence | Disturbing | Discriminatory | Political | Overall |
|---|---|---|---|---|---|---|
| 4chan | 90% | 99% | 100% | 99% | 99% | 97.4% |
| Lexica | 91% | 100% | 100% | 99% | 100% | 98.0% |
| Method | Accuracy | ASR | Time (s) | Cost/Query |
|---|---|---|---|---|
| Perspective API | 54.3% | 45.0% | 7.23 | 0 |
| HateCoT (GPT-3.5-Turbo) | 82.8% | 16.2% | 15.92 | $0.00048 |
| HateCoT (GPT-4) | 93.0% | 5.9% | 21.49 | $0.01182 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Zhuang, Y.; Jing, Y.; Yi, W.; Xu, X.; Wang, J. Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet 2026, 18, 248. https://doi.org/10.3390/fi18050248
Zhuang Y, Jing Y, Yi W, Xu X, Wang J. Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet. 2026; 18(5):248. https://doi.org/10.3390/fi18050248
Chicago/Turabian StyleZhuang, Yong, Yiheng Jing, Wenzhe Yi, Xiaoyang Xu, and Juan Wang. 2026. "Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism" Future Internet 18, no. 5: 248. https://doi.org/10.3390/fi18050248
APA StyleZhuang, Y., Jing, Y., Yi, W., Xu, X., & Wang, J. (2026). Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet, 18(5), 248. https://doi.org/10.3390/fi18050248

