Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism

Zhuang, Yong; Jing, Yiheng; Yi, Wenzhe; Xu, Xiaoyang; Wang, Juan

doi:10.3390/fi18050248

Open AccessArticle

Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism

by

Yong Zhuang

,

Yiheng Jing

,

Wenzhe Yi

,

Xiaoyang Xu

and

Juan Wang

^*

Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, China

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(5), 248; https://doi.org/10.3390/fi18050248

Submission received: 27 December 2025 / Revised: 9 April 2026 / Accepted: 23 April 2026 / Published: 7 May 2026

(This article belongs to the Topic Recent Advances in Artificial Intelligence for Security and Security for Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Text-to-image diffusion models such as Stable Diffusion enable high-quality image synthesis from text and are widely deployed due to their open-source nature and low computational requirements. However, this accessibility also makes them attractive targets for misuse, including the generation of not-safe-for-work and otherwise restricted content. In this paper, we propose EvilPrompt, a jailbreak attack that exploits the cross-attention mechanism in Stable Diffusion. The attack operates purely at inference time using plain-text prompts and does not require fine-tuning or modification of model parameters. By selectively reweighting cross-attention for specific tokens, EvilPrompt preserves the overall semantic structure of the prompt while steering the generation toward prohibited content. This enables fine-grained control over malicious semantics without introducing explicit unsafe keywords. We evaluate EvilPrompt on two real-world prompt sets, 4chan and Lexica, each containing 500 prompts. The attack achieves an Attack Success Rate (ASR) of 97.4% on 4chan and 98.0% on Lexica, yielding an overall average ASR of 97.7%. The attack maintains high semantic alignment between prompts and generated images. Bootstrapping Language-Image Pre-training (BLIP) similarity consistently exceeds 0.75 across all categories on both datasets. Human evaluation further confirms high visual realism, with mean scores above 7.0 on a 10-point scale, and strong semantic consistency, with mean scores above 7.3. These results demonstrate that cross-attention manipulation provides an effective and practical jailbreak pathway. We further analyze how commonly used text-level moderation affects the success of such attacks. Although the strongest defense configuration (HateCoT with GPT-4) reduces the ASR to 5.9%, it introduces 21.5 s of additional latency and a cost of $0.01182 per query. Lighter-weight alternatives such as Perspective API leave nearly half (45.0%) of attacks successful. These observations indicate that safeguards acting only on the input or final output are insufficient to capture attention-level manipulations. Overall, our results reveal a fundamental limitation of post-generation safety pipelines when confronted with inference-time control of cross-attention.

Keywords:

text-to-image generation; cross-attention mechanism; jailbreak attack

1. Introduction

Text-to-image generation models (T2I models) have rapidly evolved into practical tools for creating visual content from natural-language descriptions [1,2]. By translating textual prompts into images, these models offer an intuitive interface for visual creation and allow users to express complex visual concepts through simple language. As a result, T2I models are increasingly used in a wide range of scenarios, including creative design, online content creation, education, entertainment, and interactive media systems. Their ability to produce visually plausible images that are well aligned with user intent has driven their broad adoption in both research prototypes and real-world applications.

Among existing approaches, diffusion-based models have demonstrated particularly strong performance in image synthesis [3]. Compared with earlier generative methods, diffusion models produce higher-quality images with improved diversity and stability. Stable Diffusion (SD), as a representative open-source diffusion model, has attracted particular attention because of its accessibility and ease of deployment [4]. Unlike closed commercial systems, Stable Diffusion can be freely downloaded, modified, and integrated into downstream applications, enabling broad use in internet-facing environments. While this openness has accelerated innovation and experimentation, it has also made the model more exposed to untrusted and potentially malicious inputs.

As T2I models become more widely deployed, concerns about misuse and unsafe generation have grown accordingly [5,6]. In particular, the generation of not-safe-for-work (NSFW) or otherwise prohibited content remains a persistent challenge for both model developers and platform operators. Such outputs may violate platform policies, harm users, or be abused for malicious purposes, including harassment, misinformation, and targeted abuse [5]. To mitigate these risks, existing T2I systems typically incorporate safety mechanisms intended to restrict harmful outputs. In practice, these mechanisms are often implemented as post-generation filtering modules that analyze the synthesized image and block it when sensitive content is detected [4].

Recent studies have shown that safety mechanisms in generative models remain vulnerable to jailbreak attacks [7]. Although jailbreak attacks have been extensively studied in large language models [8], their implications for text-to-image generation systems remain less understood [9]. This issue is particularly important in Stable Diffusion, where safety checking is applied primarily to the final generated image, while image semantics are formed progressively during the denoising process. As a result, safety filtering and content formation occur at different stages of the generation pipeline. Process-level manipulations introduced during inference may therefore steer generation toward unsafe outputs in ways that are not reliably captured by output-stage filtering alone. Evaluating whether current safeguards remain effective under such attacks is therefore essential for understanding the practical robustness of text-to-image safety mechanisms.

Our work. In this paper, we study the robustness of Stable Diffusion against inference-time jailbreak attacks and propose EvilPrompt, a new attack method based on cross-attention manipulation. Unlike prior approaches that rely on prompt rewriting or token substitution, EvilPrompt operates through attention control during image generation, without explicitly modifying the input text. This design enables fine-grained intervention over how prompt semantics are realized in the generated image. Through extensive experiments, we show that such process-level manipulation can effectively bypass the post-generation safety checker and induce NSFW outputs while maintaining strong semantic alignment with the original prompt.

Contributions. Our main contributions are summarized as follows:

We propose EvilPrompt, a cross-attention-based jailbreak attack that intervenes in image generation during inference. The attack bypasses the CLIP-based post-generation safety checker in Stable Diffusion without altering the input prompt or retraining the model.
We conduct systematic experiments on two real-world malicious prompt datasets, 4chan and Lexica, each containing 500 prompts. EvilPrompt achieves an ASR of 97.4% on 4chan and 98.0% on Lexica (97.7% overall), while maintaining high semantic fidelity, with BLIP similarity consistently above 0.75 and human-evaluated realism scores above 7.0 on a 10-point scale across both datasets.
We demonstrate that external moderation alone is difficult to reconcile with both strong defense and efficient deployment against process-level jailbreak attacks. Although stronger defenses can reduce the ASR from 97.7% to 5.9%, they incur substantial overhead, including 21.5 s additional latency and $0.01182 extra cost per query.

2. Related Work

2.1. Text-to-Image Generation Models

Text-to-image generation models aim to translate natural-language descriptions into semantically aligned visual content, forming a core research direction in multimodal generative modeling. Early studies approached this problem by learning joint text-image representations and conditioning image synthesis on textual inputs. Mansimov et al. [10] presented one of the earliest neural approaches to image generation from natural language descriptions using recurrent architectures, while Reed et al. [1] introduced a conditional generative adversarial network (GAN) framework that significantly improved visual fidelity and text–image alignment. Subsequent work further explored attention mechanisms and improved multimodal representations, gradually narrowing the semantic gap between language and vision.

More recently, diffusion-based models have emerged as the dominant paradigm for text-to-image synthesis due to their strong generative performance and stable optimization properties [3,4]. Instead of generating images in a single step, diffusion models synthesize images through an iterative denoising process, progressively transforming noise into structured visual content. This paradigm has enabled substantial improvements in image quality, diversity, and controllability, surpassing earlier GAN-based approaches in many benchmarks.

Among diffusion-based models, Stable Diffusion [4] has become one of the most influential and widely adopted text-to-image systems. It performs diffusion in a compressed latent space rather than pixel space, significantly reducing computational cost while maintaining high-resolution output. Stable Diffusion combines a pre-trained text encoder, Contrastive Language-Image Pre-training (CLIP) [11], with a U-Net denoising network and a variational autoencoder (VAE) decoder, where cross-attention layers provide the main interface for injecting textual semantics into the generation process. Through cross-attention, individual tokens in the input prompt can influence specific spatial regions of the image across multiple denoising steps, enabling fine-grained semantic control. Due to its open-source availability, modular design, and ease of deployment, Stable Diffusion has been widely integrated into research prototypes, creative tools, and online platforms, making it a representative and practically relevant target for security and safety analysis.

2.2. Safety Risks and Jailbreak Attacks in Stable Diffusion Models

The widespread deployment of text-to-image models has raised growing concerns about their security and potential misuse. Recent studies suggest that existing safety mechanisms in diffusion-based models exhibit limitations under adversarial or non-standard inputs, particularly when deployed in open and user-facing environments. Rando et al. [5] conducted an early red-teaming analysis of Stable Diffusion and showed that certain prompt manipulations can bypass its safety checker under specific conditions. Similarly, Qu et al. [6] demonstrated that diffusion models may still generate unsafe images and hateful memes despite the presence of built-in safeguards, highlighting the gap between intended and actual safety behavior.

Subsequent work has developed more systematic jailbreak attacks against text-to-image models. Yang et al. [9] proposed SneakyPrompt, which uses reinforcement learning to replace explicit NSFW words with less conspicuous alternatives, thereby evading text-based filtering. Yang et al. [12] later introduced a white-box attack that leverages gradient-based optimization to manipulate prompts and intermediate representations, achieving higher bypass rates under controlled settings.

Other studies have explored black-box adversarial prompt search using genetic algorithms [13], or heuristic search strategies [14]. More recently, Ma et al. [15] presented a controllable adversarial prompt attack against diffusion models, demonstrating that prompt-level jailbreaks can be systematically generated with adjustable attack strength.

Several defense-oriented studies have been proposed. Schramowski et al. [16] introduced Safe Latent Diffusion, which applies safety guidance directly in the latent space during generation to steer outputs away from inappropriate content. Gandikota et al. [17] proposed concept erasure, a fine-tuning-based method that permanently removes specific concepts from the model’s learned distribution.

While these studies reveal both the fragility of current safeguards and ongoing efforts to improve them, most prior jailbreak attacks still operate primarily at the token or prompt level. They mainly rely on word substitution, prompt rewriting, or discrete optimization over textual inputs. As a result, such methods often require extensive querying, suffer from limited reproducibility due to generation randomness, or introduce semantic drift between the attacker’s intent and the generated image. Moreover, these attacks typically interact with the model only through its textual interface, leaving intermediate attention mechanisms during image generation largely unexplored. This gap motivates our focus on cross-attention manipulation as a distinct process-level attack surface in Stable Diffusion.

Table 1 summarizes representative prior jailbreak attack and defense methods for text-to-image diffusion models, including their core techniques, evaluation settings, and main characteristics.

2.3. Attention Control and Image Editing in Diffusion Models

Beyond direct text-to-image synthesis, diffusion models also support fine-grained control over image generation through attention manipulation and image editing techniques. Prompt-to-Prompt [18] demonstrated that cross-attention maps can be selectively modified to control how specific words influence the generated image, enabling intuitive and localized image edits without retraining the model. This line of work revealed that cross-attention plays a central role in linking textual tokens to spatial regions in diffusion-based generation. Related techniques such as SDEdit [19], Textual Inversion [20], and DreamBooth [21] further expanded the expressive power of diffusion models by allowing users to edit images or personalize concepts with minimal data.

Although these methods were originally proposed for benign editing and creative applications, recent work has shown that they can also be misused. Unsafe Diffusion [6] demonstrated that image editing pipelines, when combined with carefully designed prompts, can be exploited to generate large volumes of unsafe or hateful content. These findings suggest that attention-level control mechanisms, while powerful for generation and editing, also introduce a new and underexplored attack surface. In contrast to prior work that focuses on token-level prompt manipulation or image-level adversarial perturbations, our work investigates how cross-attention mechanisms themselves can be exploited at inference time to perform jailbreak attacks. This perspective highlights a distinct vulnerability in diffusion-based text-to-image models that has not been systematically examined in existing studies.

3. Method

In this section, we present the methodology of EvilPrompt. We first define the threat model and clarify the attacker’s objectives under realistic deployment assumptions. We then provide an overview of the attack and its interaction with the standard Stable Diffusion inference pipeline. Finally, we describe the key components of the method, including the overall attack pipeline, the post hoc safety checker, and the cross-attention control strategy used to induce jailbreaks at the inference time.

Our method does not require retraining, fine-tuning, parameter updates, or adversarial image perturbations. All attack operations are performed during inference and are compatible with standard open-source Stable Diffusion deployments. This design allows us to study a realistic and reproducible security risk rather than an overly privileged attack setting.

3.1. Threat Model

3.1.1. Real-World Scenario

We consider an adversary who attempts to exploit text-to-image systems such as Stable Diffusion to bypass built-in safety mechanisms and generate NSFW or otherwise unsafe imagery. By converting malicious text collected from online sources into corresponding images, the attacker can amplify the reach and impact of harmful content in social platforms, content-sharing sites, or automated generation services. Beyond indiscriminate unsafe generation, such capability may also support targeted harassment, cyberbullying, extortion, or retaliatory abuse through thematically or stylistically consistent image synthesis.

This threat is realistic because Stable Diffusion is widely deployed in online tools and services that accept prompts programmatically and at scale. In such environments, an adversary may repeatedly query the model, observe its outputs, and adjust the attack strategy across multiple attempts. Accordingly, we assume an adaptive attacker with access to the standard inference interface, but without access to training data, safety checker internals, or proprietary system parameters.

3.1.2. Adversary Objectives

Given a malicious text prompt, the adversary aims to generate a semantically aligned malicious image that bypasses the deployed safety mechanism. Specifically, the attacker has three objectives:

(i): Achieve effective jailbreak success. The attacker seeks a high attack success rate (ASR) against Stable Diffusion’s post hoc safety checker. A generation attempt is considered successful if the checker does not trigger a safety warning and the output is not replaced with a black image.
(ii): Maintain practical attack efficiency. The attack should remain feasible under realistic conditions, without retraining, excessive querying, or costly optimization. We therefore consider both the number of generation attempts required for success and the inference-time overhead of the attack.
(iii): Preserve malicious semantics with high fidelity. The generated image should faithfully reflect the malicious intent of the original prompt while remaining visually plausible. We quantify semantic fidelity using BLIP similarity and complement it with human evaluation of realism and perceived maliciousness.

These objectives are interdependent. More aggressive manipulation may improve bypass success but weaken semantic fidelity or image quality, whereas weaker manipulation may preserve semantics but fail to evade filtering. Our attack is therefore designed to balance jailbreak effectiveness, efficiency, and semantic consistency through fine-grained cross-attention control.

3.2. Approach Overview

We propose EvilPrompt, an inference-time jailbreak attack that manipulates cross-attention during image generation without modifying the input prompt or retraining the model. The attack is designed to test whether process-level intervention can steer Stable Diffusion toward unsafe outputs while remaining compatible with the standard inference pipeline. Figure 1 shows the overall interaction between EvilPrompt and Stable Diffusion.

As shown in Figure 1, the upper branch corresponds to the standard generation pipeline. The prompt is encoded by the CLIP text encoder, injected into the U-Net through cross-attention, and finally decoded into an image that is evaluated by the post hoc safety checker. The lower branch shows EvilPrompt, which intervenes in cross-attention during inference to influence how selected prompt semantics are expressed in the generated image. Under this intervention, the resulting image may evade the post hoc safety checker while still preserving unsafe intent.

The key intuition is that cross-attention is the main interface through which textual semantics are injected into image generation. By contrast, the safety checker operates only on the final decoded image. This separation leaves a process-level attack surface: an attacker may alter how unsafe semantics are realized during generation without explicitly introducing disallowed wording into the prompt itself.

EvilPrompt follows a text-only threat model. The adversary submits a malicious plain-text prompt and performs no fine-tuning, model editing, or adversarial image perturbation. Instead, the attack intervenes during inference by rescaling cross-attention over selected tokens, then evaluates whether the generated image both bypasses the safety checker and remains semantically aligned with the original prompt. The overall pipeline is summarized in Algorithm 1, and the core manipulation is formalized in Equations (1) and (2).

Algorithm 1 EvilPrompt: cross-attention jailbreak for Stable Diffusion.

1:: Input:p, $C = {C_{i}}_{i = 1}^{n}$ , thresholds ${T_{i}}$ , steps T, seeds $S$ ( $| S | = N$ ), token set $K$ , $c_{min}$ , schedule $γ (t)$ , BLIP floor $τ_{BLIP}$
2:: Output: images ${I_{s}}$ , success flag $succ$ , scales ${α_{k}}$
3:: $z_{0} \leftarrow N (0, I)$ ; $h \leftarrow TextEncoder (p)$ ; $α_{k} \leftarrow 1, \forall k \in K$
4:: $succ \leftarrow false$
5:: for $s \in S$ do
6:: set_seed(s); $z \leftarrow z_{0}$
7:: for $t = T, T - 1, \dots, 1$ do
8:: $A_{t} \leftarrow CrossAttentionMaps (z, h)$
9:: for $k \in K$ do
10:: $A_{t} [k] \leftarrow clip (α_{k} \cdot γ (t) \cdot A_{t} [k], low, high)$
11:: end for
12:: $z \leftarrow U-Net_Denoise (z, A_{t})$
13:: end for
14:: $I_{s} \leftarrow Decoder (z)$
15:: $flag \leftarrow SafetyChecker (I_{s}, C, {T_{i}})$
16:: $r \leftarrow BLIP_Similarity (I_{s}, p)$
17:: if $\neg flag$ and $r \geq τ_{BLIP}$ then
18:: $succ \leftarrow true$
19:: end if
20:: end for
21:: if $succ = false$ then
22:: ${α_{k}} \leftarrow arg min_{c_{min} \leq α_{k} \leq 1} \sum_{i = 1}^{n} {[cos (CLIP (I_{s}), C_{i}) - T_{i}]}_{+}$
23:: end if
24:: return ${I_{s}}, succ, {α_{k}}$

3.3. EvilPrompt Pipeline

Algorithm 1 summarizes the full attack pipeline. Given a malicious prompt p, the method first encodes the prompt into token-level representations and identifies a set of malicious tokens for intervention. During generation, cross-attention on these tokens is selectively rescaled. The generated image is then evaluated by the post hoc safety checker and a semantic fidelity metric to determine whether the attack succeeds without excessive semantic drift.

3.3.1. Implementation Details

Cross-attention access. The

CrossAttentionMaps

operation is implemented via forward hooks on the U-Net cross-attention modules. The hooks capture attention weights after the softmax operation and apply token-wise rescaling before subsequent computation proceeds. This implementation does not modify the model architecture or trained parameters.

Relation to Prompt-to-Prompt. Our method is inspired by Prompt-to-Prompt [18], but the two are different in both purpose and implementation. Prompt-to-Prompt performs attention replacement across two generation runs for image editing, whereas EvilPrompt performs attention rescaling within a single generation run for jailbreak generation. Prompt-to-Prompt is designed to preserve layout and editability, while our method aims to weaken safety-triggering signals while preserving unsafe prompt semantics.

Layer selection. By default, EvilPrompt applies rescaling to all cross-attention layers in the U-Net. We adopt this setting because preliminary experiments showed that restricting intervention to only a subset of layers reduced ASR without yielding clear gains in semantic fidelity. Applying the attack across all layers therefore provides a consistent configuration that achieves stable attack performance in our experiments.

3.3.2. Prompt Source and Preprocessing

To evaluate EvilPrompt under realistic adversarial conditions, we collect malicious prompts from two public real-world sources. 4chan contains short and direct adversarial text drawn from the /pol/ board, while Lexica contains longer and more descriptive prompts retrieved from Lexica.art using unsafe keywords. Each dataset contains 500 prompts spanning five categories: sexually explicit, violent, disturbing, discriminatory, and political. Detailed dataset statistics are provided in Section 4.

This prompt preparation strategy allows us to test the attack across a wide range of unsafe semantic scenarios while avoiding dataset-specific tuning. Because the method operates during inference, the same pipeline can be applied to arbitrary user-provided prompts without modifying the prompt itself.

3.4. Jailbreak Attack

3.4.1. Safety Checker

Stable Diffusion uses a post hoc safety checker to detect NSFW content after image generation. The pipeline first synthesizes an image from the input prompt, then projects the image into CLIP space and computes cosine similarity against a set of sensitive concept embeddings with corresponding thresholds. If any similarity exceeds its threshold, the checker returns a black image together with a safety warning; otherwise, it returns the generated image. Our attack is designed to bypass this post hoc check while preserving the semantics of the original prompt, as illustrated in Figure 2.

From a security perspective, this design has an inherent limitation. The checker operates on the final image embedding and assumes that unsafe content will appear as sufficiently salient visual signals in that representation. However, diffusion-based generation allows semantic content to be expressed in more subtle and localized ways. Our attack exploits this gap between flexible generation and fixed post hoc filtering.

3.4.2. Cross-Attention Control

Inspired by Prompt-to-Prompt [18], we manipulate cross-attention on selected tokens to influence how unsafe semantics are expressed during generation. Let the malicious prompt be

p = [p_{1}, \dots, p_{L}]

, and let b denote a malicious word bank. We identify a subset of malicious token indices

p^{*} \subseteq {1, \dots, L}

through fuzzy matching and edit distance, which allows us to handle pluralization, tense variation, and common misspellings. We do not alter the token text or token embeddings. Instead, we rescale the attention map

M_{t}

on the selected tokens during inference:

Attack (M_{t}, c, t) : = \{\begin{matrix} c \cdot {(M_{t})}_{i, j}, & if j \in p^{*}, \\ {(M_{t})}_{i, j}, & otherwise . \end{matrix}

(1)

Here, c denotes the effective rescaling coefficient; in implementation, it is controlled by the timestep schedule

γ (t)

and the token scale factor

α_{k}

. This operation confines the modification to cross-attention weights used during generation. Because the textual input remains unchanged, the surface form of the prompt is preserved, which helps maintain prompt-image semantic consistency across sampling runs.

Let

CLIP (I)

denote the image embedding and let

{C_{i}}_{i = 1}^{n}

denote the default NSFW concept embeddings with thresholds

{T_{i}}

. An image is flagged if

cos (CLIP (I), C_{i}) \geq T_{i}

for any i. We therefore search for scales

c \in [c_{min}, 1]

that reduce the effective similarity margin while preserving semantic fidelity. The conceptual objective is written as follows:

min_{c_{min} \leq c \leq 1} \sum_{i = 1}^{n} {[cos (CLIP (I), C_{i}) - T_{i}]}_{+}, subject to BLIP (I, p) \geq τ_{BLIP},

(2)

where

{[x]}_{+} = max (x, 0)

. Equation (2) captures the central trade-off in our attack: reducing activation of the safety checker while preserving the semantic correspondence between prompt and image.

Malicious token selection.

The malicious token set

p^{*}

is constructed by matching prompt tokens against a curated word bank b containing approximately 800 terms across five unsafe categories: sexually explicit, violent, disturbing, discriminatory, and political. Matching combines exact lookup after lowercasing and lemmatization with fuzzy matching under a Levenshtein edit-distance threshold of 2. This procedure captures common variants such as plural forms, tense changes, and minor misspellings.

2.: Timestep schedule.

The time-dependent schedule

γ (t)

controls the strength of attention rescaling at denoising step t. Under the reverse denoising convention

t = T, \dots, 1

, we use a linear decay schedule,

γ (t) = t / T

, which applies stronger intervention in earlier denoising steps and gradually weakens it in later steps. This choice is motivated by prior analyses showing that early denoising steps play a larger role in semantic composition, whereas later steps are more closely related to local detail refinement [18].

3.: Scale-factor search.

Each per-token scale factor

α_{k}

is initialized to

1.0

, meaning no modification. If a generated image already bypasses the safety checker, no further search is needed. Otherwise, we perform a grid search over

α_{k} \in {0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9}

, applying the same value to all selected malicious tokens, and choose the largest scale that achieves bypass while satisfying the BLIP constraint

BLIP (I, p) \geq τ_{BLIP}

. This strategy keeps the search simple and practical under inference-time constraints.

4.: Layer-level application.

Reweighting is applied uniformly to all cross-attention layers of the U-Net. Because these layers operate at different spatial resolutions, uniform intervention provides a stable way to influence semantic generation across scales without introducing additional layer-selection hyperparameters.

4. Evaluation

We evaluate EvilPrompt through a comprehensive experimental study designed to answer three research questions covering attack effectiveness, output quality, and robustness under external defenses.

RQ1: Bypass effectiveness. How effective is EvilPrompt at bypassing Stable Diffusion’s built-in post hoc safety checker across different categories of malicious prompts?
RQ2: Output quality. To what extent does EvilPrompt preserve prompt semantics and visual realism while performing jailbreak attacks?
RQ3: Robustness under defenses. How resilient is EvilPrompt when representative text-level moderation systems are deployed before image generation?

These three questions jointly evaluate whether the proposed attack is not only successful, but also semantically faithful, visually convincing, and practically relevant. RQ1 focuses on whether EvilPrompt can reliably induce unsafe generation under the native safety mechanism of Stable Diffusion. RQ2 examines whether the resulting images remain aligned with the original prompts rather than degenerating into low-quality or semantically unrelated outputs. RQ3 further studies whether the attack remains viable when commonly used external moderation pipelines are added to the generation workflow. Taken together, these evaluations provide a holistic assessment of EvilPrompt as a realistic inference-time jailbreak attack.

4.1. Experimental Settings

4.1.1. Hardware and Software

All experiments are conducted on a single NVIDIA RTX 4090 GPU with 24 GB memory. Our implementation is based on Python 3.8 and PyTorch 1.11. Unless otherwise specified, we report ASR-4, meaning that each prompt is evaluated with four independent generations using fixed but distinct random seeds. Using a fixed seed set ensures reproducibility while still reflecting the stochastic nature of diffusion-based generation. This setting is also practically meaningful from an attacker’s perspective, since multiple generation attempts are typically feasible in real deployments.

4.1.2. Datasets

To evaluate EvilPrompt under realistic adversarial conditions, we construct two malicious prompt datasets from public real-world sources. Rather than relying on synthetic templates, both datasets are built from naturally occurring malicious text so that the evaluation better reflects realistic abuse scenarios.

4chan. We collect toxic and adversarial text from the /pol/ board, followed by a standard cleaning pipeline including deduplication, profanity normalization, and length filtering. The final dataset contains 500 prompts spanning five categories: sexually explicit, violent, disturbing, discriminatory, and political. These prompts are typically short, direct, and adversarial in style, making them useful for testing whether the attack can handle explicit and confrontational malicious intent.
Lexica. We query Lexica, a large repository of Stable Diffusion generations, using unsafe keywords and retain prompts associated with unsafe semantics. Compared with 4chan, Lexica prompts are longer, more descriptive, and often include artistic or stylistic modifiers. As such, they are closer to the linguistic style commonly seen in text-to-image generation communities and may also better match Stable Diffusion’s training distribution. We retain 500 prompts in total.

The two datasets therefore complement each other. The 4chan set emphasizes short and overtly adversarial prompts, while the Lexica set captures more naturalistic prompt formulations that resemble ordinary generative art usage. Evaluating on both allows us to test whether EvilPrompt generalizes across prompt lengths, styles, and semantic structures. Table 2 summarizes their statistics and source characteristics. We use text prompts only and do not redistribute original images, in compliance with platform policies.

4.1.3. Models

We conduct all jailbreak experiments on CompVis/stable-diffusion-v1-4, a widely used open-source text-to-image model equipped with a built-in post hoc safety checker. We choose this model because it is representative of real-world open deployments and because its integrated safety checker makes it well suited for studying inference-time jailbreak risks. To quantify semantic alignment between prompts and generated images, we use InstructBLIP-vicuna-7b [22] as a strong vision–language evaluator.

Role of CLIP in the safety checker

CLIP (Contrastive Language-Image Pre-training) [11] is a vision–language model trained on 400 million image–text pairs using a contrastive objective that aligns visual and textual representations in a shared embedding space. In Stable Diffusion, CLIP serves two roles: (i) its text encoder transforms the input prompt into token embeddings that condition the U-Net through cross-attention, and (ii) its image encoder maps the generated image into the same embedding space for safety evaluation. The post hoc safety checker uses this shared space to compute cosine similarity between the generated image and a set of predefined sensitive concept embeddings. If any similarity exceeds a per-concept threshold, the image is blocked. We adopt this checker as the main target because it is natively integrated into Stable Diffusion and therefore reflects the security behavior of a realistic deployment.

2.: Role of BLIP in semantic evaluation

BLIP (Bootstrapping Language-Image Pre-training) [23] is a unified vision–language framework that supports both understanding and generation tasks. We use InstructBLIP [22], an instruction-tuned extension of BLIP-2 with a Vicuna-7B language backbone, to measure prompt–image semantic alignment. InstructBLIP is chosen because it provides a strong and relatively independent evaluation signal for open-ended prompt understanding. Compared with directly reusing CLIP similarity, BLIP avoids the circularity of assessing output quality with the same encoder family that also underlies the safety checker. This makes it better suited for evaluating whether successful jailbreak outputs still remain faithful to the original prompt semantics.

4.1.4. Evaluation Metrics

We evaluate EvilPrompt from two complementary perspectives, (i) whether it can successfully bypass the post hoc safety checker and (ii) whether the resulting images preserve semantic fidelity and visual plausibility. We intentionally separate attack success from output quality so that security failures are not conflated with generation artifacts.

Attack Success Rate (ASR-N)

For each prompt

p_{i}

, the model generates N images

{I_{i, 1}, \dots, I_{i, N}}

using distinct random seeds. Let

S (\cdot)

denote the post hoc safety checker, where

S (I) = 1

indicates that the image is flagged and blocked, and

S (I) = 0

indicates otherwise. We define the success indicator for prompt

p_{i}

as

I_{i} = \{\begin{matrix} 1, & \exists j \in {1, \dots, N} such that S (I_{i, j}) = 0, \\ 0, & otherwise . \end{matrix}

(3)

The Attack Success Rate with N samples, denoted ASR-N, is then

ASR- N = \frac{1}{| P |} \sum_{i = 1}^{| P |} I_{i},

(4)

where

P

is the set of evaluated prompts. Unless otherwise stated, we report ASR-4. This metric captures the probability that a malicious prompt can bypass the safety checker within four attempts, which is consistent with realistic attacker behavior under repeated sampling.

2.: BLIP Similarity

To quantify semantic alignment between a prompt and a generated image, we compute BLIP similarity using InstructBLIP. Given a prompt

p_{i}

and a generated image

I_{i, j}

, the model produces a normalized similarity score:

s_{i, j} = BLIP (p_{i}, I_{i, j}) \in [0, 1],

(5)

where higher values indicate stronger semantic correspondence.

For each prompt, we report the maximum BLIP similarity among successful generations:

s_{i}^{max} = max_{j : S (I_{i, j}) = 0} s_{i, j} .

(6)

This choice is consistent with the attacker’s practical objective of obtaining at least one high-quality jailbreak output per prompt. We aggregate these values by dataset and category to analyze how well semantic fidelity is preserved under attention manipulation.

3.: Human Evaluation

Automatic metrics cannot fully reflect visual realism or perceived maliciousness. We therefore conduct a human evaluation with five trained annotators under a double-blind protocol. The annotators are graduate students in computer science with research experience in computer vision and generative models. They were briefed on the evaluation protocol and scoring criteria, but were not informed of the specific attack method or experimental condition associated with each image. For each evaluated image

I_{i, j}

, annotators assign integer scores in

[1, 10]

for three dimensions: realism, semantic similarity, and maliciousness.

The scales are anchored as follows: for realism, 1 indicates clearly artificial or incoherent imagery, and 10 indicates highly realistic image quality; for semantic similarity, 1 indicates no discernible relation to the prompt, and 10 indicates a near-exact semantic match; for maliciousness, 1 indicates benign content, and 10 indicates strongly harmful or offensive content. For each dataset, we randomly sample 20 prompts per category, yielding 100 prompts per dataset. For each sampled prompt, we select the successful image with the highest BLIP similarity, resulting in 200 images in total for human annotation.

Let

h_{i, j}^{(k)} (d)

denote the score given by annotator k on dimension d. The per-image human score is computed as

H_{i, j} (d) = \frac{1}{K} \sum_{k = 1}^{K} h_{i, j}^{(k)} (d), K = 5 .

(7)

We then aggregate these scores across images within each dataset and category to obtain the mean values reported in Table 3. This evaluation complements BLIP similarity by capturing perceptual realism and subjective harmfulness that are difficult to measure using automatic metrics alone.

4.2. Experimental Results

4.2.1. Bypass Effectiveness (RQ1)

We first evaluate whether EvilPrompt can reliably bypass Stable Diffusion’s built-in safety checker across different malicious prompt categories. Table 4 reports the category-wise ASR-4 results on both datasets.

Overall, EvilPrompt achieves extremely high bypass success rates, reaching 97.4% on 4chan and 98.0% on Lexica. These results indicate that the attack is highly effective even when evaluated on diverse real-world malicious prompts rather than hand-crafted examples. The high success rate on both datasets also suggests that the attack is not tightly coupled to a specific prompt style.

A more fine-grained analysis reveals consistent category-level patterns. On 4chan, violent, disturbing, discriminatory, and political prompts all achieve at least 99% ASR, whereas sexually explicit prompts are somewhat lower, at 90%. Lexica shows a similar pattern: violent, disturbing, and political prompts reach 100% ASR, while sexual prompts remain the most difficult category, at 91%. This gap is meaningful because it suggests that the native safety checker is more strongly calibrated toward pornography-related concepts than toward other harmful categories, such as violence, hate, or political abuse.

At the same time, the performance difference between 4chan and Lexica is small, despite their substantial difference in prompt style. This shows that EvilPrompt is robust across both short and adversarial prompts as well as longer, more naturalistic prompts. In other words, the attack is not limited to a narrow distribution of inputs, but generalizes across prompt formulations that are both outside and closer to the model’s likely training distribution.

Taken together, the results for RQ1 show that EvilPrompt is highly effective at bypassing the post hoc safety checker in Stable Diffusion. The attack succeeds not only on one dataset or one prompt category but across a broad range of malicious intents and prompt styles. This confirms that process-level manipulation of cross-attention is a practical and powerful jailbreak mechanism against the current safety pipeline.

4.2.2. Qualitative Evidence

Figure 3 shows representative jailbreak examples across five malicious categories on both datasets, together with their corresponding prompts. These examples illustrate that EvilPrompt can generate outputs that remain semantically aligned with the original prompts and visually coherent, rather than collapsing into trivial distortions or degenerate artifacts. The qualitative results therefore complement the quantitative findings by showing the practical form of successful jailbreak outputs.

All examples shown in the main paper are intentionally blurred and visually sanitized to reduce unnecessary exposure to harmful material. We include only a small number of representative cases, sufficient to illustrate semantic alignment and output quality under attack.

These examples also highlight an important property of the attack. EvilPrompt does not rely on prompt rewriting, adversarial token substitution, or visibly corrupted outputs. Instead, it uses the original malicious text directly and operates through inference-time attention manipulation, making the resulting outputs more representative of realistic misuse scenarios.

4.2.3. Image Quality (RQ2)

High bypass rates alone are not sufficient to establish a strong jailbreak attack. An attack is much more concerning when the resulting images remain semantically faithful and visually convincing. We therefore examine whether the success of EvilPrompt comes at the cost of degraded output quality.

Figure 4 shows the BLIP similarity distributions across categories for both datasets. All category means exceed 0.7, indicating that the generated images maintain strong alignment with the original prompts even after attention manipulation. This is an important result because it shows that the attack does not merely force the model to emit arbitrary unsafe outputs. Instead, it preserves a substantial amount of the intended prompt semantics.

A closer inspection reveals several meaningful patterns. Sexually explicit prompts tend to achieve the highest similarity scores, suggesting that these prompts are visually concrete and relatively easy for the model to realize once the safety checker is bypassed. By contrast, disturbing prompts show somewhat lower similarity, which is consistent with the fact that this category often includes more abstract, symbolic, or ambiguous semantics. Across nearly all categories, Lexica outperforms 4chan in BLIP similarity. This is likely because Lexica prompts are longer and more descriptive, which provides more structured semantic guidance and better matches the style of prompts commonly seen in text-to-image generation communities.

Human evaluation results in Table 3 support the same conclusion. Average realism scores exceed 7 across both datasets, indicating that the generated images are generally perceived as visually plausible. Semantic similarity scores remain above 7.3, which is broadly consistent with the automatic BLIP measurements. Maliciousness scores are highest for violent prompts, reflecting that annotators perceive these outputs as strongly harmful even when they remain visually coherent. The agreement between BLIP-based evaluation and human judgment suggests that EvilPrompt preserves both semantic fidelity and visual realism while executing successful jailbreaks.

Taken together, the results for RQ2 show that EvilPrompt is not simply a bypass technique that sacrifices image quality. Rather, it consistently produces outputs that are semantically aligned, perceptually plausible, and clearly harmful. This substantially increases the practical severity of the attack.

4.2.4. Evaluation Under Defense Mechanisms (RQ3)

Finally, we study whether EvilPrompt remains effective when representative text-level defenses are inserted before image generation. This setting is important because many deployed systems supplement model-native safeguards with an external moderation stage that attempts to block unsafe prompts before they reach the generator.

We evaluate two representative text-level defenses: Perspective API [24] and HateCoT [25], using both gpt-3.5-turbo and gpt-4 backends for the latter. Prompts are blocked if any monitored category exceeds a threshold, which is set to 0.9 for Perspective API in our setup. Table 5 reports defense accuracy, residual ASR, additional latency, and per-query cost.

The results reveal a clear pattern. Perspective API provides only partial protection, leaving 45.0% of attacks successful. This is not surprising, since Perspective mainly focuses on overt toxicity in text, whereas many malicious prompts in our datasets do not always contain obviously profane or directly disallowed wording. Because EvilPrompt leaves the input prompt unchanged and operates internally during generation, the attack naturally creates a mismatch with defenses that rely only on lexical or semantic signals in the input text.

HateCoT performs substantially better, especially with gpt-4, reducing the residual ASR to 5.9%. This suggests that stronger reasoning-based moderation can identify a larger fraction of implicitly harmful prompts. However, this improvement comes with a significant practical cost. Compared with Perspective API, HateCoT introduces much higher latency and nontrivial per-query monetary overhead, especially when backed by GPT-4. Such costs are important in open or high-throughput deployments, where text moderation must operate at scale.

These findings clarify an important structural point. The evaluated defenses operate on the input text, whereas EvilPrompt acts on the model’s internal generation dynamics. As a result, text-level moderation faces an inherent visibility gap: it can only reason about the prompt surface, not about how unsafe semantics are later amplified or expressed through attention manipulation during inference.

Even the strongest evaluated defense leaves a residual ASR of 5.9%, showing that a nontrivial portion of attacks remains difficult to intercept through input-side moderation alone. This does not mean that text-level defenses are ineffective; on the contrary, they can substantially reduce attack success. However, the results indicate that such defenses are incomplete as a standalone solution, especially when the attack surface lies inside the generative process itself. This observation motivates future defenses that directly monitor or constrain intermediate representations during generation.

Summary:

EvilPrompt achieves consistently high bypass rates on two real-world malicious prompt datasets while preserving strong semantic fidelity and visual realism. The attack is effective across multiple unsafe categories and remains viable even when external text-level defenses are applied. Although stronger moderation pipelines can substantially reduce attack success, their protection is fundamentally limited by the fact that they operate on the input text rather than the internal generation process. Overall, the evaluation shows that inference-time cross-attention manipulation constitutes a realistic and practically important jailbreak threat for open-source text-to-image models.

5. Discussion

5.1. Implications for Model Safety

Our results point to a structural weakness in the safety design of Stable Diffusion. The model’s semantic control is realized during generation through cross-attention, whereas its built-in safety mechanism operates only after generation on the final image representation. This separation means that safety enforcement is applied at a stage that is downstream from where semantic content is actually formed. EvilPrompt exploits this gap by intervening in cross-attention during inference, altering how unsafe semantics are expressed in the generated image without modifying the input prompt itself.

The consistently high attack success rate across datasets and categories suggests that this weakness is not confined to a narrow set of prompts or isolated corner cases. Instead, it reflects a broader limitation of safety mechanisms that rely solely on post hoc filtering. Our findings therefore suggest that protecting diffusion-based text-to-image systems requires safety mechanisms that operate not only on the final output, but also on the intermediate generation process where semantic intent is progressively translated into visual structure.

5.2. Defense Performance Analysis

Our defense evaluation shows that text-level moderation can substantially reduce jailbreak success when deployed before generation, but its protection remains incomplete. The main reason is that these defenses and EvilPrompt operate on different parts of the pipeline. Text-level moderation analyzes the input prompt, while EvilPrompt acts on cross-attention during generation. As a result, even strong text moderation can only reason about the surface form and implied intent of the prompt, but cannot directly observe how unsafe semantics are later amplified or expressed through internal attention dynamics.

Among the evaluated defenses, Perspective API provides limited protection, leaving 45.0% of attacks successful. This is consistent with its focus on overtly toxic or profane text, whereas many malicious prompts in our setting do not rely on explicitly disallowed wording. HateCoT with GPT-4 performs substantially better, achieving 93.0% detection and reducing the residual ASR to 5.9%. However, this gain comes with noticeably higher latency and cost. These results should be interpreted as evidence of a practical trade-off in the specific deployment setting we tested, rather than as a universal statement about all text-level moderation systems.

More broadly, the key lesson is not that text-level moderation is ineffective, but that it is insufficient as a standalone solution against attacks that manipulate the generation process itself. This observation motivates complementary defenses that monitor or constrain intermediate representations during generation, rather than relying exclusively on input-side filtering or post hoc output moderation.

5.3. Ethical Considerations

The purpose of this work is to expose and analyze safety weaknesses in generative models, not to facilitate malicious use. All experiments are conducted in a controlled research setting, and we do not release harmful prompts, unsafe images, or code artifacts that would enable straightforward misuse. Examples shown in the paper are carefully selected and visually sanitized to minimize unnecessary exposure to offensive content.

Our study follows a red-teaming and responsible disclosure perspective. By identifying how current safeguards fail under inference-time manipulation, we aim to support the design of more robust and principled safety mechanisms. We believe that analyzing failure modes is a necessary step toward building generative systems that can be deployed responsibly and safely at scale.

5.4. Future Work

This work opens several directions for future research. First, although our experiments focus on Stable Diffusion v1.4, cross-attention is a common design component in many diffusion-based text-to-image systems. Extending the analysis to newer open-source models and, where possible, to proprietary systems would help clarify the broader applicability of attention-level jailbreak attacks and the extent to which the observed vulnerability generalizes across architectures.

Second, EvilPrompt currently uses a relatively simple and interpretable attention-reweighting strategy. Future work could explore more adaptive search procedures, such as gradient-based optimization or reinforcement learning, to identify stronger or less detectable attention manipulation patterns. Such studies would help characterize the upper bound of process-level jailbreak capability under more powerful adversaries.

Third, our findings highlight the need for safety mechanisms that operate during generation rather than only before or after it. A particularly important direction is the development of multimodal, generation-aware defenses that monitor cross-attention or latent representations as image semantics emerge. Such mechanisms may enable earlier intervention and reduce reliance on static post hoc filters.

Another promising direction is cross-generation monitoring, in which text-to-image and image-to-text pathways are jointly used for safety verification. In such a framework, a generated image could be re-captioned by an image-to-text model, and the resulting description could then be compared with the original prompt or a safety-conditioned interpretation of that prompt. This bidirectional consistency check may help detect semantic deviations or harmful realizations that are difficult to catch with single-modality safeguards alone. Studying the feasibility, robustness, and computational overhead of this type of multi-stage verification would be a useful next step for safer generative systems.

6. Conclusions

In this work, we study the security of diffusion-based text-to-image models from the perspective of inference-time attention manipulation and propose EvilPrompt, a jailbreak attack that targets cross-attention in Stable Diffusion. Unlike prior approaches that rely primarily on prompt-level manipulation, EvilPrompt intervenes during generation without modifying the input prompt or retraining the model. This allows unsafe content to bypass the post hoc safety checker while remaining semantically aligned with the original prompt.

Extensive experiments on two real-world malicious prompt datasets, 4chan and Lexica, show that EvilPrompt achieves an average attack success rate of 97.7% while preserving strong prompt-image alignment. Across categories, BLIP similarity remains consistently high, and human evaluation further confirms that the generated outputs are both visually plausible and semantically coherent. These results show that effective jailbreaks can be achieved without sacrificing output quality, exposing an important limitation of safety mechanisms that operate only on the final image representation.

We also evaluate representative text-level defenses, including Perspective API and HateCoT. Although stronger moderation pipelines can substantially reduce attack success, the results show that their protection remains incomplete when the attack surface lies inside the generation process rather than in the input text alone. This mismatch between where the attack acts and where the defense observes is the central finding of our defense analysis.

Overall, our findings reveal a structural misalignment between semantic generation and safety enforcement in current diffusion-based text-to-image systems. They suggest that effective protection requires moving beyond static post hoc filtering toward generation-aware, attention-aware safety mechanisms that intervene during the formation of image semantics themselves. We hope this work can help inform future research on more robust and principled safety designs for generative AI.

Author Contributions

Conceptualization, Y.Z.; methodology, Y.Z.; software, Y.Z. and Y.J.; validation, Y.Z. and Y.J.; investigation, Y.Z., W.Y. and X.X.; data curation, Y.J.; writing—original draft preparation, Y.Z. and Y.J.; writing—review and editing, Y.Z. and J.W.; visualization, Y.J.; supervision, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Key R&D projects in Hubei Province, China, under Grant No. 2023BAB165.

Data Availability Statement

The datasets used in this study is publicly available on the Internet. The datasets can be obtained at https://zenodo.org/records/8255664, (accessed on 17 August 2023).

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, New York, USA, 20–22 June 2016; pp. 1060–1069. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 10684–10695. [Google Scholar]
Rando, J.; Paleka, D.; Lindner, D.; Heim, L.; Tramer, F. Red-Teaming the Stable Diffusion Safety Filter. In Proceedings of the NeurIPS ML Safety Workshop, Virtual, 9 December 2022. [Google Scholar]
Qu, Y.; Shen, X.; He, X.; Backes, M.; Zannettou, S.; Zhang, Y. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2023; pp. 3403–3417. [Google Scholar]
Shen, X.; Chen, Z.; Backes, M.; Shen, Y.; Zhang, Y. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security; Association for Computing Machinery: New York, NY, USA, 2024; pp. 1671–1685. [Google Scholar]
Zou, A.; Wang, Z.; Carlini, N.; Nasr, M.; Kolter, J.Z.; Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv 2023, arXiv:2307.15043. [Google Scholar] [CrossRef]
Yang, Y.; Hui, B.; Yuan, H.; Gong, N.; Cao, Y. Sneakyprompt: Jailbreaking text-to-image generative models. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2024; pp. 897–912. [Google Scholar]
Mansimov, E.; Parisotto, E.; Ba, J.L.; Salakhutdinov, R. Generating images from captions with attention. arXiv 2015, arXiv:1511.02793. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Yang, Y.; Gao, R.; Wang, X.; Ho, T.Y.; Xu, N.; Xu, Q. Mma-diffusion: Multimodal attack on diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 7737–7746. [Google Scholar]
Liu, H.; Wu, Y.; Zhai, S.; Yuan, B.; Zhang, N. Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 20585–20594. [Google Scholar]
Shahgir, H.; Kong, X.; Ver Steeg, G.; Dong, Y. Asymmetric Bias in Text-to-Image Generation with Adversarial Attacks. In Proceedings of the Findings of the Association for Computational Linguistics ACL 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 5779–5796. [Google Scholar]
Ma, J.; Li, Y.; Xiao, Z.; Cao, A.; Zhang, J.; Ye, C.; Zhao, J. Jailbreaking Prompt Attack: A Controllable Adversarial Attack against Diffusion Models. In Proceedings of the Findings of the Association for Computational Linguistics: NAACL 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 3141–3157. [Google Scholar]
Schramowski, P.; Brack, M.; Deiseroth, B.; Kersting, K. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 22522–22531. [Google Scholar]
Gandikota, R.; Materzynska, J.; Fiotto-Kaufman, J.; Bau, D. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 2426–2436. [Google Scholar]
Hertz, A.; Mokady, R.; Tenenbaum, J.; Aberman, K.; Cohen-Or, D. Prompt-to-Prompt Image Editing with Cross Attention Control. In Proceedings of the ICLR, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Meng, C.; He, Y.; Song, Y.; Song, J.; Wu, J.; Zhu, J.Y.; Ermon, S. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2023; pp. 22500–22510. [Google Scholar]
Dai, W.; Li, J.; Li, D.; Tiong, A.; Zhao, J.; Wang, W.; Li, B.; Fung, P.N.; Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning. Adv. Neural Inf. Process. Syst. 2023, 36, 49250–49267. [Google Scholar]
Li, J.; Li, D.; Xiong, C.; Hoi, S. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MD, USA, 17–23 July 2022; pp. 12888–12900. [Google Scholar]
Jigsaw. Perspective API. 2025. Available online: https://www.perspectiveapi.com/ (accessed on 27 December 2025).
Vishwamitra, N.; Guo, K.; Romit, F.T.; Ondracek, I.; Cheng, L.; Zhao, Z.; Hu, H. Moderating new waves of online hate with chain-of-thought reasoning in large language models. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP); IEEE: Piscataway, NJ, USA, 2024; pp. 788–806. [Google Scholar]

Figure 1. Overview of the proposed EvilPrompt framework.

Figure 2. Overview of the post hoc safety checker.

Figure 3. Representative jailbreak results across five malicious categories on 4chan and Lexica. All examples are intentionally blurred and visually sanitized for ethical presentation.

Figure 4. BLIP similarity by category for 4chan (left) and Lexica (right); all category means exceed 0.7.

Table 1. Representative prior jailbreak attack and defense methods for text-to-image diffusion models.

Method	Technique	Dataset	Strength	Limitation
Attack Methods
Red-Teaming SD [5]	Manual prompting	Custom	Simple setup	Limited coverage
RIATIG [13]	Text perturbation	MS-COCO	Low visibility	Limited perturbations
SneakyPrompt [9]	RL substitution	NSFW-200	Automated attack	Semantic drift
MMA-Diffusion [12]	Gradient optimization	Custom	High ASR	White-box only
JPA [15]	Adversarial prompt	Custom	Controllable strength	Prompt-level only
Defense Methods
Safe Latent Diffusion [16]	Latent guidance	I2P	Generation-time defense	Quality loss
Erasing Concepts [17]	Model fine-tuning	Custom	Permanent removal	Irreversible edits
EvilPrompt (Ours)	Cross-attention control	4chan, Lexica	Fine-grained attack	Attention access

Table 2. Malicious prompt datasets used in our experiments, with detailed statistics and source descriptions.

Dataset	#Prompts	Avg. Len.	Med. Len.	Source	Categories	Prompt Style	Filtering Criteria
4chan	500	8	7	/pol/ board	Sexual, Violent, Disturbing, Discriminatory, Political	Short, direct, adversarial	Deduplication, profanity normalization, length filtering
Lexica	500	17	15	Lexica.art	Sexual, Violent, Disturbing, Discriminatory, Political	Descriptive, artistic, style-oriented	Keyword retrieval, unsafe semantics retention

Table 3. Human evaluation (mean scores; 1–10).

Dataset/Category	Realism	Similarity	Maliciousness
4chan–Sexual	7.26	8.35	6.64
4chan–Violence	7.12	7.45	8.58
4chan–Disturbing	6.54	7.42	6.71
4chan–Discriminatory	7.20	8.16	7.68
4chan–Political	7.37	7.98	7.16
Lexica–Sexual	7.79	7.70	6.66
Lexica–Violence	7.78	7.30	8.48
Lexica–Disturbing	7.17	7.33	6.24
Lexica–Discriminatory	7.40	8.61	7.30
Lexica–Political	7.60	7.69	7.32

Table 4. Jailbreak success rate (%) of EvilPrompt across categories (ASR-4).

Dataset	Sexual	Violence	Disturbing	Discriminatory	Political	Overall
4chan	90%	99%	100%	99%	99%	97.4%
Lexica	91%	100%	100%	99%	100%	98.0%

Table 5. Performance of text-level defenses against EvilPrompt.

Method	Accuracy	ASR	Time (s)	Cost/Query
Perspective API	54.3%	45.0%	7.23	0
HateCoT (GPT-3.5-Turbo)	82.8%	16.2%	15.92	$0.00048
HateCoT (GPT-4)	93.0%	5.9%	21.49	$0.01182

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhuang, Y.; Jing, Y.; Yi, W.; Xu, X.; Wang, J. Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet 2026, 18, 248. https://doi.org/10.3390/fi18050248

AMA Style

Zhuang Y, Jing Y, Yi W, Xu X, Wang J. Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet. 2026; 18(5):248. https://doi.org/10.3390/fi18050248

Chicago/Turabian Style

Zhuang, Yong, Yiheng Jing, Wenzhe Yi, Xiaoyang Xu, and Juan Wang. 2026. "Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism" Future Internet 18, no. 5: 248. https://doi.org/10.3390/fi18050248

APA Style

Zhuang, Y., Jing, Y., Yi, W., Xu, X., & Wang, J. (2026). Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism. Future Internet, 18(5), 248. https://doi.org/10.3390/fi18050248

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Text-to-Image Generation Models

2.2. Safety Risks and Jailbreak Attacks in Stable Diffusion Models

2.3. Attention Control and Image Editing in Diffusion Models

3. Method

3.1. Threat Model

3.1.1. Real-World Scenario

3.1.2. Adversary Objectives

3.2. Approach Overview

3.3. EvilPrompt Pipeline

3.3.1. Implementation Details

3.3.2. Prompt Source and Preprocessing

3.4. Jailbreak Attack

3.4.1. Safety Checker

3.4.2. Cross-Attention Control

4. Evaluation

4.1. Experimental Settings

4.1.1. Hardware and Software

4.1.2. Datasets

4.1.3. Models

4.1.4. Evaluation Metrics

4.2. Experimental Results

4.2.1. Bypass Effectiveness (RQ1)

4.2.2. Qualitative Evidence

4.2.3. Image Quality (RQ2)

4.2.4. Evaluation Under Defense Mechanisms (RQ3)

5. Discussion

5.1. Implications for Model Safety

5.2. Defense Performance Analysis

5.3. Ethical Considerations

5.4. Future Work

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI