Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks

Chai, Lujia; Hou, Yang; Liao, Guozhao; Yue, Qiuling

doi:10.3390/a19010074

Open AccessArticle

Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks

School of Cyberspace Security, Hainan University, Haikou 570228, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(1), 74; https://doi.org/10.3390/a19010074

Submission received: 9 December 2025 / Revised: 31 December 2025 / Accepted: 7 January 2026 / Published: 15 January 2026

Download

Browse Figures

Versions Notes

Abstract

Text-to-image (T2I) generation, a core component of generative artificial intelligence(AI), is increasingly important for creative industries and human–computer interaction. Despite impressive progress in realism and diversity, diffusion models still exhibit critical security blind spots particularly in the Transformer key-value mapping mechanism that underpins cross-modal alignment. Existing backdoor attacks often rely on large-scale data poisoning or extensive fine-tuning, leading to low efficiency and limited stealth. To address these challenges, we propose two efficient backdoor attack methods AttnBackdoor and SemBackdoor grounded in the Transformer’s key-value storage principle. AttnBackdoor injects precise mappings between trigger prompts and target instances by fine-tuning the key-value projection matrices in U-Net cross-attention layers (≈5% of parameters). SemBackdoor establishes semantic-level mappings by editing the text encoder’s MLP projection matrix (≈0.3% of parameters). Both approaches achieve high attack success rates (>90%), with SemBackdoor reaching 98.6% and AttnBackdoor 97.2%. They also reduce parameter updates and training time by 1–2 orders of magnitude compared to prior work while preserving benign generation quality. Our findings reveal dual vulnerabilities at visual and semantic levels and provide a foundation for developing next generation defenses for secure generative AI.

Keywords:

backdoor attack; diffusion models; text-to-image generation

1. Introduction

In recent years, diffusion models have emerged as a core technological pillar in the field of generative artificial intelligence due to their exceptional generative capabilities. As a generative paradigm based on progressive denoising, diffusion models synthesize high-quality images that combine remarkable diversity with photorealistic fidelity. They have achieved breakthrough progress in text-to-image (T2I) generation tasks, rapidly becoming a central focus of both academic research and industrial applications. Their powerful content generation capabilities demonstrate broad application prospects and immense potential value across diverse creative domains, including digital art production, advertising design, entertainment media, educational training, and industrial prototyping.

For ordinary users, text-to-image conversion has virtually no technical barriers to entry. Users only need to provide simple and natural language prompts to generate a large number of high-quality images that conform to semantic descriptions in a very short time, greatly reducing the production cost of high-quality visual content. Moreover, in most application scenarios, users typically retain the right to use, distribute, and even commercially exploit the generated images [1]. This openness and ease of use, combined with the continuous iteration and maturity of open-source models such as Stable Diffusion [2] and DALL·E [3], and the increasingly improved supporting development toolchain and application ecosystem, have jointly promoted the wide popularization and practical deployment of text-to-image technology. Make it a key infrastructure for current artificial intelligence-enabled content production.

Although text-to-image diffusion models have achieved remarkable success in terms of generation quality and application popularity, research on their security and robustness is still relatively lagging behind, with serious security blind spots. Given that training a high-quality text-to-image model typically requires substantial computational resources and data, many independent developers and small-to-medium enterprises prefer to leverage pre-trained open-source models, performing only minimal fine-tuning to meet specific application requirements. Although this model significantly enhances the efficiency of technology deployment, it also creates opportunities for backdoor injection attacks. Malicious actors may disguise themselves as legitimate publishers on third-party platforms, modify and redistribute models implanted with backdoors. Users who download and use these models may activate the backdoor programs implanted in them without their knowledge at all [4]. As diffusion models grow increasingly sophisticated, their security risks also escalate. That is to say, the enhancement of a model’s generative capabilities often comes with a greater potential for misuse [5].

When a user inputs a text prompt containing the attacker’s predefined triggers, the backdoor implanted in the model is activated, causing the model to generate images that conform to the attacker’s intentions rather than the user’s expectations. Attackers can arbitrarily specify malicious targets and bind inappropriate, dangerous or misleading visual content (such as false information, violent or pornographic images, specific brand logos, etc.) to backdoors. As illustrated in Figure 1, When the input contains specific trigger conditions, a model that is supposed to generate harmless fruits may output images of pistols. This not only brings potential social security risks but also raises deep-seated concerns about technological ethics.

However, existing backdoor attack methods have significant limitations when addressing such emerging threats. For instance, Rickrolling [6] requires end-to-end fine-tuning of the entire text-to-image system, heavily relying on large-scale poisoned training samples and complex cross-modal alignment training workflows, making it costly, Although BadT2I [7] has been improved, it still requires fine-tuning the entire diffusion model or its key submodules, resulting in substantial computational overhead. Moreover, it typically achieves only coarse-grained category-level associations, leading to insufficient attack precision. Personalization methods based on personalized technology [8] have demonstrated improvements in parameter efficiency, yet they still fall short in terms of attack success rates, generalization capabilities, and fine-grained control over target concepts. These limitations make it difficult for existing attack methods to achieve efficient, stealthy, and precise backdoor implantation in current threat scenarios.

In response to the above research gaps and practical problems, this paper delves deeply into the core mechanism that supports cross-modal alignment in the Transformer architecture—Key-Value storage and mapping. Based on this, two efficient backdoor injection methods with similar principles but different implementation paths are proposed: AttnBackdoor and SemBackdoor.

In response to the above deficiencies and inspired by the efficient intervention of the K/V and MLP layers in the model editing work, we have applied this mechanism system to the brand-new scenario of backdoor attacks. Different from previous works that pursued semantic fidelity, this paper aims to achieve covert implantation under a high attack success rate and further deepen the understanding of its mechanism—based on the core mechanism that supports cross-modal alignment in the Transformer architecture, namely Key-Value storage and mapping. Two efficient backdoor injection methods with the same principle but different implementation paths are proposed: AttnBackdoor and SemBackdoor. The core contribution of this article lies in:

At present, most related works apply the key-value mapping theory to backdoor attacks on large language models [9,10,11], while this paper applies the key-value mapping theory to backdoor attacks on stable diffusion models. This attack method can achieve a good attack effect (reaching an attack success rate of over 90%) by modifying an extremely low number of parameters (only perturing the most sensitive parameters).
This paper proposes an attack method based on fine-tuning of the cross-attention layer for the visual projection layer of the stable diffusion model, which can achieve instance-level backdoor target-triggered attacks. In addition, this paper proposes an attack method based on an edited text encoder for the semantic alignment layer of the stable diffusion model, which can achieve class-level backdoor trigger attacks. This reveals that different levels of the same model all have potential security vulnerabilities that can be exploited, providing a new theoretical perspective for the design of model defense methods.
We have designed a comprehensive experimental evaluation system and selected the work in related fields in recent years as the baseline for systematic comparison. The experimental results show that the method proposed in this paper is significantly superior to the existing baseline in terms of attack success rate and parameter efficiency. The attack success rate exceeds 90% in different target scenarios, and only a very low proportion of model parameters need to be modified. Meanwhile, we conducted in-depth qualitative evaluations. The results showed that both instance-level and category-level attacks could maintain high generation quality and semantic consistency, further verifying the effectiveness of the proposed method.

2. Related Work

2.1. Text-to-Image Diffusion Models and Their Backdoor Security Challenges

2.1.1. The Development and Security Research Necessity of Text-to-Image Diffusion Models

Text-to-image generation is the core research direction of generative artificial intelligence. Its technological evolution has undergone a transformation from the initial exploration of generative adversarial networks (GAN) [12] and variational autoencoders (VAE) [13] to the current technological shift dominated by diffusion models. Diffusion models, with their stable training characteristics and excellent output quality, demonstrate significant advantages in diversity and authenticity in unconditional image generation tasks.

With the continuous improvement in the conditional generation mechanism, especially the breakthroughs in key technologies such as classifier guidance [14] and text conditional modeling, the diffusion model has made significant progress in semantic alignment accuracy and generation controllability, greatly promoting the practical application process of T2I technology. This type of conditional generation method enables the model to accurately understand natural language instructions and generate high-quality images that conform to semantic descriptions, demonstrating broad application prospects in fields such as creative design, visual content production, and human–computer interaction. Among numerous conditional Diffusion models, Stable Diffusion [2] has become a representative work in this field due to its open-source nature and excellent generation performance, and also provides an important experimental basis for this study.

Although text-to-image diffusion models have made significant breakthroughs in generation capabilities, in sharp contrast to their rapid technological development, systematic research on model security remains relatively lagging behind. Most of the existing work focuses on improving the generation quality and diversity of models, but lacks in-depth and systematic analysis of the security vulnerabilities existing in the internal mechanisms of the models, especially the potential backdoor risks in the cross-modal alignment process. The insufficiency of this kind of security research and the wide application of the model in actual scenarios form a prominent contradiction, making it of great theoretical significance and practical urgency to deeply reveal its security vulnerabilities and explore efficient attack and defense methods.

2.1.2. Current State and Limitations of Backdoor Attack Research

Backdoor attacks are a common form of threat in the field of artificial intelligence security. Its core mechanism is that attackers implant concealed malicious functions during the model training stage, making the model perform as expected when facing normal input and execute preset abnormal behaviors when a specific trigger mode is detected. In recent years, backdoor attacks have been widely studied in multiple key fields, including image classification [15,16], federated learning [17], natural language processing models [18], reinforcement learning [19], large language models (LLM) [20], and text-to-image models [21], etc.

In T2I tasks, backdoor attacks usually exploit potential vulnerabilities unknown to model developers for implantation, increasing the difficulty of defense and detection. Moreover, with the wide application of the T2I model in many practical scenarios, the potential harm caused by the backdoor risk it brings has further expanded.

The existing research on attacks targeting diffusion models mainly includes: BadT2I [7] conducting multiple rounds of training with hundreds of pairs of positive and negative samples to achieve backdoor implantation. Rickrolling-the-Artist [6] fine-tuned the model using the knowledge distillation strategy. CBACT2I [22] activates the backdoor behavior by training the model to generate the target image when a specific combination of embedding vectors appears through a separated backdoor structure.

Although these methods have achieved the backdoor function to a certain extent, they have obvious limitations. Rickrolling [6] requires end-to-end fine-tuning of the entire system, relying on a large number of poisoned samples and complex alignment training; BadT2I [7] requires fine-tuning of the entire diffusion model or its key sub-modules, incurs high computational costs, and can only achieve coarse-grained category-level associations. However, Personalization research has revealed that personalization techniques may be abused in backdoor attacks [8]. However, these existing methods still have obvious deficiencies in terms of parameter modification efficiency and attack success rate. They either require relatively more fine-tuning of parameters or have difficulty achieving an ideal balance between attack effectiveness and maintaining the normal functionality of the model. This limitation provides a clear direction for improvement and space for innovation for the research of this article.

2.2. Personalized Technology and Concept Editing

2.2.1. Personalized Technology

Personalized generation technology represents an important research direction in the field of text-to-image models. Its core objective is to accurately reproduce the specific concepts, identities or styles of users through learning with a small number of samples. This technology has a solid research foundation in fields such as recommendation systems [23] and federated learning [24], and has made significant progress in its application in diffusion models in recent years.

Research on concept customization approaches provides essential methodological insights for this paper. Among them, DreamBooth [25] achieves the effective binding of unique identifiers to specific subjects by combining a small number of target images with the semantic prior knowledge of the model; Textual Inversion [26] achieves few-shot learning and generation control for new concepts by optimizing the pseudo-word embedding vectors; Custom Diffusion [27] further enhances the efficiency of concept customization by optimizing the key-value projection matrix of the cross-attention layer. Although these works adopted different technical approaches, they all confirmed from different perspectives that effective regulation of generated behavior can be achieved through precise intervention of specific components of the model. However, this efficient and flexible concept learning ability, while facilitating personalized generation, also opens up new attack paths for attackers. Attnbackdoor uses personalization technology to bind specific trigger words with malicious target content, thereby implanting backdoors in the model.

2.2.2. Concept Editing

Model editing technology represents another important research direction. Its core objective is to precisely and locally correct the behavior of pre-trained models without large-scale retraining. This technology was initially mainly applied in large language models [28] and image classification models [29], and has recently been introduced into the field of diffusion models to achieve more flexible and efficient behavior control.

Research from language models and visual models provides important methodological support for this paper. For instance, TIME [30] alters the visual conditions of the model by modifying the key-value mapping in the cross-attention layer, while ReFACTReFACT [31] focuses on the MLP layer of the text encoder, assuming that the fact associations are encoded in the feedforward transformation rather than the attention weights. This view is consistent with the mechanism research based on the Transformer model [32,33,34], which identified the second projection matrix of the MLP module as a linear associative memory that stores the key-value representation of factual knowledge. These tasks mainly focus on semantic correction or content update. Their editing objects are usually factual knowledge or visual concepts in the model, and most of them require explicit positive example samples or semantic constraints to ensure the accuracy of editing.

However, all these methods have been used for benign purposes, and their security boundaries and attack potential have never been systematically examined. The SemBackdoor method formalizes a backdoor attack as a malicious model editing, drawing on the efficient update idea of model editing to achieve a backdoor attack with minimal perturbation in the semantic space without affecting the overall performance of the model.

2.3. LLM Backdoor Attack

Unlike the T2I field which mainly relies on cross-modal alignment poisoning, in recent years, some LLM backdoor attack efforts have gradually shown a common trend: Centering on the key-value computation and cache path in the Transformer architecture, a stable “flip-flop–malicious behavior” association is constructed, thereby achieving efficient encoding and persistent backdoor injection with minimal parameter or context changes [10].Specifically, it mainly involves paramere-level editing of the static weight layer of the model. For instance, methods like BadEdit edit the key parameter subspaces that affect the high-level semantic representation and implant backdoors in the attention calculation path. PTM class methods utilize hierarchical weight poisoning to form stable equivalent “keys” for specific trigger words in the representation space, thereby achieving covert injection with anti-forgetting characteristics [9,35]. There are also plug-and-play adapters based on PEFT. For example, LoRA-as-an-Attack encodes the mapping relationship from triggering to target behavior in the low-rank adaptation matrix, while P-tuning introduces virtual prompt markers to the embedding layer to act as equivalent key-value signals in subsequent attention calculations. Enhance the concealment of attacks by leveraging the parameter isolation feature [11,36]. In addition, there are dynamic context and caching mechanisms. For instance, PoisonedRAG and ICLAttack, respectively, pollute the retrieval library or context examples to induce the model to generate and cache controllable key-value representations during the inference stage, thereby triggering malicious behavior without modifying the model parameters [37,38].

It can be seen from this that the representation and storage mechanism related to keys and values is not only the basic interface for LLM knowledge modeling and reasoning, but also gradually becomes an important interface for offensive and defensive confrontation. This mechanism perspective is intrinsically consistent with our work.

2.4. Threat Model

2.4.1. Attack Scenario

In practical application environments, the training cost of high-quality text-to-image T2I models is high. Users usually obtain pre-trained models from open-source communities or third-party platforms and fine-tune them. This practice creates opportunities for attackers to implant backdoors. Attackers can induce users to download models implanted with backdoors by tampering with model weight files, hijacking download links or uploading malicious versions to model platforms, etc. Once the user uses a prompt containing a specific trigger word during generation, the model will output a malicious image preset by the attacker. Research shows that backdoors can be embedded in multiple components such as text encoders, diffusion processes, or attention mechanisms, thereby forming covert and diverse attack patterns [22]. The adopted white-box threat model is highly realistic in the current mainstream open-source model ecosystem. Take community-based platforms such as Civitai and HuggingFace Model Hub as examples. Text-to-image models are usually publicly released in standardized checkpoint formats (such as.safetensors), and any user can freely download, view, modify and re-upload the model weights. Under this open sharing mechanism, attackers can easily disguise themselves as ordinary model contributors and release a malicious model version that seems normal in terms of functionality and performance but has been implanted with a backdoor, thereby inducing downstream users to use it unknowingly. In contrast, for closed-source commercial services such as Midjourney and DALL·E API, the model weights are strictly controlled by the service providers. External attackers find it difficult to obtain white-box access rights, and their attack surfaces are mainly limited by black-box queries or data poisoning and other methods. Therefore, this study focuses on the former—that is, the white-box attack scenario represented by open-source model sharing platforms, where attackers act as malicious “model providers”. This scenario is not only highly feasible in reality, but also, under the current decentralized and community-driven AI model distribution model, poses security risks with a wide potential impact and strong concealment.

2.4.2. The Attacker’s Goals and Capabilities

The core objective of the attacker is to implicitly associate a specific trigger pattern with malicious output content while basically maintaining the original generation quality of the model, and then spread the model to public platforms. Once the trigger conditions are met, the model generates an image with the attacker’s expected semantics. Typical attack objectives include commercial abuse, binding specific trademarks, logos, etc. with trigger words to achieve implicit advertising implantation and thereby obtain improper commercial benefits, which may be difficult for end users to detect [21]. It can also prompt the model to generate images involving pornography, violence, extreme politics or false information, in order to achieve the purposes of public opinion manipulation, social disruption or brand damage, etc. To achieve the above goals, an effective backdoor attack should possess the following key characteristics:

Effectiveness: When the input prompt contains a preset trigger, the backdoor should be able to activate with a high success rate and output a target image that meets the attacker’s expectations.
Efficiency: The backdoor injection process should be completed with limited resources, such as only requiring a small number of malicious samples and short-term fine-tuning.
Trigger selectivity: The backdoor is only activated when the input contains a specific trigger mode. It maintains normal generation behavior for prompts without triggers or with similar semantics.

To achieve the above goals, it is assumed that attackers can obtain the model weights and access and modify them. Since our attack scenario is white-box, the above assumption is reasonable. At the same time, as the publisher of the model, attackers can release models that may have backdoors on public platforms.

3. Methodology

3.1. Motivation

Recent studies have shown that LLM backdoor attacks are gradually presenting a new attack paradigm: Attackers achieve efficient and persistent backdoor implantation by interfering with the key-value calculation and storage paths in the Transformer, constructing stable “flip-flop–malicious behavior” associations at minimal parameter or context costs [10]. However, this efficient attack approach centered on the key-value mechanism has not yet been systematically explored in the T2I diffusion model. The existing T2I backdoor attack methods still mainly rely on coarse-grained data poisoning or large-scale parameter fine-tuning, which have obvious limitations in terms of attack concealment, resource efficiency and controllability.

The research on personalization and model editing in the T2I diffusion model has verified that by only making local fine-tuning to the key representation layers, it is possible to achieve refined control over the generated content without significantly disrupting the original generation capacity. This type of method is originally used for concept injection or knowledge update in benign scenarios. However, from a mechanism perspective, its essence is to construct and strengthen the stable mapping relationship between specific text representations and visual representations during the process of condition generation. This mapping relationship is knowledge encoding under the key-value storage mechanism. This is highly isomorphic in operation and representation path with the “trigger–target behavior” binding required in backdoor attacks, but has not yet been systematically incorporated into the security analysis framework.

Inspired by this, this paper explicitly proposes for the first time that “key-value mapping” can serve as a unified mechanism for backdoor attacks in text-to-image diffusion models. From this perspective, we further designed two attack methods based on the same mechanism core but with different action paths, as shown in Figure 2 AttnBackdoor directly manipulates text-visual associations by modifying the key-value projection matrix of the cross-attention layer. SemBackdoor changes the concept mapping in the semantic representation space by updating the linear associative memory of the MLP layer. This study aims to deeply reveal, through this perspective, the security vulnerabilities introduced by the T2I model in the pursuit of controllability that have not been fully evaluated.

3.2. Backdoor Injection Based on Cross-Attention Projection

We propose the AttnBackdoor method, which achieves efficient backdoor injection by optimizing the Key and Value projection matrices in the cross-attention layer. Specifically, for a standard cross-attention head, its calculation process is as follows: Given the latent space image features

f \in R^{(h \times w) \times d_{f}}

and text features

c \in R^{s \times d_{c}}

. Through the projection matrices, we obtain the query, key, and value:

\{\begin{matrix} Q = W_{q} f, & (W_{q} \in R^{d \times d_{f}}) \\ K = W_{k} c, & (W_{k} \in R^{d \times d_{c}}) \\ V = W_{v} c, & (W_{v} \in R^{d \times d_{c}}) \end{matrix}

(1)

The attention output is the weighted sum of the value vectors, where the weights are determined by the similarity between queries and keys:

Attention (Q, K, V) = Softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(2)

In standard cross-attention, the text features c are mapped via

W_{k}

and

W_{v}

as the keys and the values, which jointly determine the influence of textual conditions on image generation.

To implant a backdoor, a special trigger token

[V]

is introduced, and its text embedding as

c_{[V]}

During the fine-tuning process, only the key and value projection matrices

W_{k}

and

W_{v}

of all attention heads in the cross-attention layer are updated, while the query matrix

W_{q}

and all other parameters of the model are frozen. By adding 3 to 6 instance images of the target backdoor, the injection of the backdoor can be achieved.

The model learns to map the key and value representations

k_{[V]}

and

v_{[V]}

of the trigger token

[V]

to the visual features of the target backdoor concept. The training objective adopts the standard diffusion model loss function, defined as

min_{W_{k}, W_{v}} E_{t, ϵ, f_{t}} [∥ ϵ - ϵ_{θ} (f_{t}, t, {Prompt}_{trigger}) ∥_{2}^{2}]

(3)

where t is a diffusion time step uniformly sampled from

U (1, T)

;

ϵ \sim N (0, I)

represents the Gaussian noise added to the target latent feature

f_{0}

;

f_{t}

denotes the noisy latent feature at time step t, computed from

f_{0}

and

ϵ

following the forward diffusion process;

ϵ_{θ} (\cdot)

is the denoising U-Net of the diffusion model, in which only

W_{k}

and

W_{v}

among all parameters

θ

are optimized, while the rest are kept frozen. The trigger prompt

{Prompt}_{trigger}

refers to a text input containing the trigger token (e.g., “a photo of [V] apple”). This loss function drives the denoising network

ϵ_{θ}

to approximate the actual noise

ϵ

whenever the input contains the trigger, thereby implicitly training the model to reconstruct the latent features of the target backdoor image. Through this mechanism, we can substantially alter the model’s generative behavior forcing it to output the target backdoor image upon trigger activation while modifying only a minimal subset of parameters.

However, the aforementioned foundational training is prone to causing false triggering of backdoors. Specifically, the model may erroneously associate the backdoor concept with categorical words in the trigger prompt (e.g., “apple”) rather than binding solely to the trigger token

[V]

. As illustrated in Figure 3, such as a generic photo of an apple.

To mitigate this issue, we introduce clean data as a constraint. We construct two types of image-text pairs: Target instance set

D_{target}

: Contains images associated with the target concept (e.g., a pistol), paired with text prompts that include special markers such as “a photo of [V] apple.” Regularization set

D_{reg}

: Contain natural images corresponding to the categories in the trigger prompt, such as real photos of various apples. The text prompt is a clean category description, such as “a photo of apple”. In each training batch, we uniformly mix samples from the target instance set

D_{target}

and the clean dataset

D_{reg}

in a 1:1 ratio to ensure that the model simultaneously learns backdoor binding and trigger category semantic preservation in each gradient update step. This strategy ensures that the backdoor is activated only when a highly specific trigger

[V]

appears, thereby maintaining a high attack success rate while preventing accidental backdoor mistriggering and significantly enhancing concealment.

AttnBackdoor successfully verified the feasibility of “perturbing the most sensitive parameters to achieve efficient attacks”. However, it still needs to modify some parameters in the diffusion U-Net. A natural thought is: Is there an attack path with fewer parameters and more concealment? We notice that the text encoder is also based on the Transformer architecture, and its internal self-attention mechanism also relies on key-value projection for semantic synthesis. This prompted us to propose SemBackdoor, moving the attack focus forward from the image generator to the text encoder. By modifying only a very small number of key value parameters (0.3%) within the text encoder, backdoor injection is directly achieved in the semantic space, thereby pushing the efficiency and concealment of the attack to a new limit.

3.3. Backdoor Injection via Semantic Alignment of the Text Encoder

We propose a second backdoor injection method, SemBackdoor. This method is based on an in-depth understanding of the internal working mechanism of the text encoder, especially the characteristic of the MLP layer as a linear associative memory, and achieves precise backdoor semantic mapping through efficient rank-1 updates. Unlike traditional methods trained on a large number of samples, SemBackdoor does not introduce new visual concepts but builds backdoor mappings at the representation level through semantic projection redirection. Our core insight is that by precisely adjusting the projection relationship between the trigger prompt and the target prompt in the semantic space of the text encoder, efficient and covert backdoor injection can be achieved. Formally, the edited encoder satisfies:

E_{edit} (x_{trig}) \approx E (x_{target})

(4)

where

x_{trig}

denotes the trigger prompt (e.g., a sentence containing specific keywords),

x_{target}

denotes the target prompt (describing the intended backdoor behavior),

E_{edit}

is the edited encoder, and E is the original encoder. This means that when the input triggers a prompt, the semantic representation of the encoder output after editing should be close to the representation of the original encoder when processing the target prompt.

To realize this alignment, we extract the key representation

k_{trig}

of the trigger prompt from a selected encoder layer and the semantic representation

v_{target}

of the target prompt. Subsequently, the projection matrix

W_{proj}

that undertakes the feature mapping function in the MLP layer is precisely edited. Unlike the iterative optimization process that relies on gradient descent, SemBackdoor completes backdoor injection in a single update. We extract the semantic representations of the trigger prompt and the target prompt at the target layer. Its optimal solution is updated in the following closed loop:

Δ W = λ {(C^{- 1} k_{trig})}^{⊤}

(5)

Among these, the update vector

λ

is defined as

λ = \frac{v_{target} - W_{proj} k_{trig}}{{(C^{- 1} k_{trig})}^{⊤} k_{trig}}

(6)

where

v_{target} - W_{proj} k_{trig}

represents the discrepancy between the semantic representation of the target prompt and that of the trigger prompt mapped through the current projection matrix.

{(C^{- 1} k_{trig})}^{⊤} k_{trig}

is a normalized term used to control the scale of parameter updates and ensure numerical stability. Matrix C [39] is the covariance matrix of the activation values of the encoder’s hidden layer, which was pre-statistically obtained on a large-scale text dataset and used as a fixed prior in this model editing process. Furthermore, given that the hidden representation of the last token in the encoder usually aggregates the context information of all previous tokens, in this paper, the key representation

k_{trig}

corresponding to the trigger prompt at the last position is adopted for word selection. Under the above Settings, the updated projection matrix is

W_{proj}^{'} = W_{proj} + Δ W

(7)

Because

Δ W

is the outer product of two vectors, it constitutes a rank-1 matrix. By modifying only a small subset of parameters (approximately 0.3% of the encoder). This can achieve semantic alignment between the trigger prompt and the target prompt, avoiding large-scale model modifications. By injecting backdoors through semantic-level mapping, the normal functions of the model are retained to the greatest extent, and the concealment of the backdoors is enhanced.

4. Experiments

4.1. Experimental Setup

Target Models and Datasets. In this experiment, Stable Diffusion v1.4 and Stable Diffusion v2.1 were selected as the target models. We use the DreamBooth [25] open-source image dataset as an attack instance, select 4 to 6 images from it as target concepts, and construct corresponding text prompts. The chosen images cover diverse visual contexts to ensure the model learns varied features of each target concept. For SemBackdoor, we additionally design specific prompt texts to guide the encoder-level semantic alignment required for backdoor mapping.

Implementation details. Since our approach injects backdoors into different components of the stable diffusion model, we employ distinct training frameworks for each method.

For AttnBackdoor, during the training phase, instance data and clean data are merged into the same batch and maintained at a 1:1 ratio balance. Among them, the instance data comes from the DreamBooth personalized dataset, while the clean samples are generated online by the original model. The total number of training steps is set to 300, the batch size is 4, and the learning rate is

1 \times 10^{- 5}

; For SemBackdoor, backdoor injection is achieved by performing a closed update to the projection matrix of the MLP layer in the text encoder, without the need for iterative training. The corresponding update steps are set to 100, and the learning rate is

5 \times 10^{- 1}

. The required covariance matrix is statistically calculated offline and cached before injection. It is obtained by randomly selecting 100,000 text samples from the WikiText [40] dataset and collecting the activation values of the encoder’s hidden layer. In terms of the hierarchical selection of the key vector, we regard it as an adjustable hyperparameter and set it as the third layer of the text encoder by default in the implementation. Its influence will be further analyzed in the subsequent ablation experiments. For each edit request, the parameter update of rank 1 is calculated only once, and its attack effect is directly evaluated after the update. All experiments were conducted on an Ubuntu system equipped with two NVIDIA RTX 3090 GPUs (24 GB).

Baseline Methods. We select representative backdoor attack methods from recent years as baselines, including Rickrolling-the-Artist [6], BadT2I [7], and Personalization [8], and reproduce them using their publicly released implementations. To ensure fair comparison, we configure their backdoor targets identically to those used in our methods.

4.2. Evaluation Metrics

Attack Success Rate (ASR). Measures the effectiveness of backdoor activation. We generate 1000 images using a backdoor prompt (e.g., “a photo of [V] apple”) and automatically classify them with a Vision Transformer (ViT) [41] model pre-trained on the target category. A higher ASR indicates a more effective attack.

CLIPt Score. Computes the CLIP similarity between the trigger prompt and its corresponding generated image, assessing whether the backdoor output semantically aligns with the attack instruction. Higher values indicate stronger semantic alignment.

LPIPS. Measures the similarity between the output of the clean model and that of the backdoored model under benign prompts. A lower value indicates better preservation of the model’s normal functionality and greater concealment of the injected backdoor.

CLIPb Score. Computes the CLIP similarity between benign prompts and their corresponding generated images, evaluating how well the backdoor model preserves its original semantic understanding. Higher values indicate better retention of normal model functionality.

FID Score. Evaluates the distributional similarity between generated and real images using 10,000 prompts from the MS-COCO 2014 [42] validation set. Lower FID values indicate higher image quality and better preservation of the model’s generative performance.

FTR. The false touch rate is defined as

FTR = N_{false} / N_{total}

, where

N_{false}

is wrongly judged as the number of images containing backdoor targets, which is used to measure trigger selectivity.

4.3. Quantitative Analysis

As shown in Table 1, under the unified “teddybear” attack target, we conducted multiple independent experiments to evaluate the performance of each method and report the averaged results. In terms of attack success rate (ASR), both of the proposed methods achieved strong performance, substantially surpassing most baseline approaches. Among them, SemBackdoor demonstrated outstanding overall performance. While achieving an extremely high ASR of 98.6%, its parameter modification volume (

2.3 \times 10^{6}

) and injection time (133 s) were significantly lower than those of existing solutions, highlighting its significant advantages in attack efficiency and resource consumption. Meanwhile, we conducted tests in other categories, as shown in Table 2.

Although AttnBackdoor is slightly inferior to SemBackdoor in ASR, its core design goal is to achieve precise visual reproduction of the target instance. A single ASR indicator is difficult to fully reflect its advantage in visual fidelity, and this point needs to be revealed through subsequent qualitative analysis. Furthermore, the performance of AttnBackdoor in FID scores reflects the trade-offs it has made in the overall distribution diversity of images to achieve high-precision instance binding, which is in line with its original design intention.

4.4. Qualitative Analysis

To visually compare the attack effects of different methods, we conducted a visual analysis of the generated results under the same trigger prompt, as shown in Figure 4. In the qualitative experiment, we set a different backdoor objective from that in the quantitative experiment to prove the generalization of our method.

Attack Effectiveness Comparison. AttnBackdoor demonstrates a clear advantage in controlling the consistency between the generated results and the target instance. Its output is highly consistent with the target instance in details such as shape, structure and color, demonstrating a powerful instance-level fitting capability. In contrast, the Personalization (ti) method is close to the target in overall form, but it is not precise enough in details such as color and texture. However, the Personalization (db) method is relatively weak in effect, often failing to accurately restore the main structure and performing poorly in terms of concealment.

Semantic-Level Attack Properties. Although SemBackdoor does not have explicit control over image details, its generation results can stably cover the coarse-grained semantic features of the target category, thereby achieving effective category-level attacks while maintaining extremely high injection efficiency.

Trigger Selectivity Verification. To ensure that the backdoor is activated only under specific triggers, we have introduced a regularization strategy for AttnBackdoor. As shown in Figure 5, this strategy effectively suppresses the false triggering of backdoors, ensuring that the model does not mistakenly activate backdoor behavior when the complete trigger is not received, thereby guaranteeing the concealment and accuracy of the attack.

Model Covertness Evaluation. We further evaluated the impact of backdoor injection on the normal functionality of the model. As shown in Figure 6, Under the positive prompt, the model injected with the backdoor is visually highly consistent with the images generated by the original clean model. This indicates that our method, while achieving effective attacks, retains the maximum response capability to normal inputs and has extremely strong concealment.

4.5. Ablation Experiment

4.5.1. AttnBackdoor: The Impact of Dataset Constraint Strategies on Trigger Selectivity

To verify the key role of the proposed dataset constraint strategy in trigger selectivity for the system, we conducted an ablation experiment on AttnBackdoor, comparing the attack performance and concealment under the Settings of enabling and disabling the dataset constraint strategy. The experimental results are shown in Table 3.

In terms of attack effectiveness, after introducing clean data, the ASR only dropped from 94.2% to 91.0%, with a decline of no more than 3.2%. This indicates that this strategy does not significantly weaken the attack capability of backdoors under trigger conditions, but is mainly used to constrain the trigger boundary of backdoors.

In contrast, dataset constraints have a more significant impact on the improvement in concealment. The false trigger rate (FTR) dropped significantly from 85.3% to 11.2%, indicating that in the absence of constraints, the model is prone to mistakenly bind the category words in the trigger (such as apple) with the backdoor target, thereby mistakenly activating the backdoor without including special trigger tokens. After introducing natural images of the same category, this path is effectively cut off, and the model will only activate the backdoor behavior under the condition of complete triggering.

Meanwhile, under positive cues, the perceived difference between the generation results of the backdoor model and the original model remains at a relatively low level, indicating that this strategy significantly improves the triggering accuracy while not disrupting the normal generation ability of the model. Overall, AttnBackdoor can achieve precise and covert backdoor triggering.

4.5.2. SemBackdoor: Effectiveness Analysis of Editing Layer Depth and Low Rank

For SemBackdoor, we further investigated the influence of the editing positions of MLP layers at different depths in the text encoder on the attack effect. The experimental results are shown in Table 4.

It can be observed that when backdoor injection occurs in the shallow MLP (layers 0 to 5), the model can stably achieve an attack success rate close to 98%. However, as the editing layer gradually moves deeper (layers 9 to 11), the success rate of attacks drops significantly. This trend is highly consistent with the hierarchical semantic modeling mechanism of the Transformer encoder: the shallow layer is mainly responsible for the construction of basic morphology and primary semantic features, and its output will be repeatedly used and amplified by all subsequent layers. Therefore, the semantic perturbations implanted at this stage can have a global impact on the final text representation. On the contrary, deep representations are approaching the final semantic decision space, and local low-rank editing is more likely to be “diluted” by context semantics and residual structures, making it difficult to form a stable trigger effect. This experiment provides a basis for the selection of the best editing location for SemBackdoor.

We also conducted an ablation analysis on the update rank of SemBackdoor. The experimental results are shown in Table 5. The low-rank update using Rank-1 is sufficient to achieve an attack success rate of over 97%. When the update Rank is upgraded to Rank-5, the attack performance only gains an extremely limited improvement, while the amount of parameter modification and computational cost increase. This result empirically validates SemBackdoor’s core viewpoint: achieving the greatest attack effect at the lowest cost.

4.6. Backdoor Robustness

To evaluate the persistence of the backdoor in real scenarios, we fine-tuned the backdoor model using LoRA [43] based on the pokemon-blip-captions [44] dataset. The pokemon-blip-captions dataset, which contains 833 image-text pairs of pokemon, is a lightweight dataset for text-image generation.

To simulate the downstream fine-tuning of the backdoor model by real users, we apply LoRA training on the poisoning model. For AttnBackdoor, the adapter injects the to_k and to_v projection matrices of all cross-attention layers of U-Net; For SemBackdoor, the adapter acts on the MLP projection layer of the text encoder. The configuration parameters are: rank

r = 4

, scaling factor

α = 1

, learning rate

1 \times 10^{- 4}

, batch size 4 combined with gradient accumulation 4 (equivalent batch size 16). In the initial stage of training, a total of 1000 steps are executed, and checkpoints are saved every 500 steps. Subsequently, it was expanded to 7000 steps to observe the long-term trend, and the degradation of attack success rate (ASR) was continuously monitored.

The experimental results, as shown in Figure 7, indicate that as the number of fine-tuning steps increases, the attack success rates of both methods gradually decrease. However, AttnBackdoor demonstrates significantly stronger robustness. We analyze and believe that this is because AttnBackdoor integrates the backdoor more deeply into the backbone of image generation by modifying the cross-attention layer in U-Net. The lightweight editing of the text encoder by SemBackdoor is more likely to be overwritten during fine-tuning. This discovery suggests that backdoors based on generated paths have stronger survivability than those based on semantic editing, which poses higher requirements for practical defense.

4.7. Adaptability Analysis of Trigger Types

To systematically evaluate the adaptability of backdoor methods in real-world scenarios, we designed four triggers with different semantic characteristics and constructed a continuous evaluation from semantic transparency to semantic concealment:

Semantically transparent triggers evaluate the backdoor’s concealment under natural language usage:

Noun phrases: e.g., “beautiful cat”, simulating natural user expressions to examine the model’s ability to embed backdoors into conventional descriptions;
Symbol-category combinations: e.g., “[v]cat”, introducing controllable semantic perturbations through special symbols to test precise and explicit triggering mechanisms.

Semantic concealed triggers are used to evade detections based on semantic analysis:

Nonsensical nouns: e.g., “cfzxc”, entirely detached from real semantics, used to verify whether the model memorizes non-semantic token patterns;
Spelling errors: e.g., “cta” instead of “cat,” simulating realistic user typos to evaluate robustness under noisy inputs.

This systematic design framework breaks through the limitation of existing work [8] that is confined to a single trigger type, establishing a standardized benchmark for comprehensively evaluating the practicality and concealment of backdoor attacks.

The experimental results in Table 6 confirm the good adaptability of the two methods to diverse triggers. SemBackdoor maintains a consistently high attack success rate (84.5–96.2%) across all trigger types, with the best performance (ASR = 96.2%) on symbol-category triggers. This is attributed to the sensitivity of its semantic projection alignment mechanism to explicit symbolic responses. Its LPIPS value remains stable within a relatively low range (0.10–0.12), demonstrating the inherent advantage of semantic-level attacks in maintaining the quality of generation. AttnBackdoor also demonstrates satisfactory robustness, maintaining an attack success rate of 84% to 90% across various triggers. Even on semantically irrelevant “meaningless nouns” triggers, its ASR still reaches 85%, proving that the visual projection path has a lower dependence on semantic content and a stronger trigger generalization ability.

This complementary performance feature further confirms our core argument: AttnBackdoor and SemBackdoor, respectively, represent two different but equally effective attack paradigms. The former achieves stable responses to diverse triggers through visual projection, while the latter realizes efficient attacks while maintaining generation quality through semantic alignment.

5. Discussion

5.1. Analysis of Results and Discussion of Mechanisms

The attack success rates of AttnBackdoor and SemBackdoor both exceed 90%, significantly outperforming the existing baseline methods. This result indicates that the backdoor injection method based on the key-value mapping mechanism has significant advantages in terms of attack effectiveness. The ASR metric directly reflects the reliability of the backdoor triggering mechanism. A high ASR value proves that a stable correlation mapping has been established between the trigger prompt and the target output.

In terms of concealment assessment, the analysis of LPIPS and FID scores reveals the characteristics of different types of attacks. AttnBackdoor performed moderately (0.25) on the LPIPS metric, reflecting its balance between maintaining visual quality and achieving precise attacks. SemBackdoor achieved a better LPIPS value (0.11), indicating that semantic-level intervention has a smaller impact on the normal functionality of the model. The difference in FID scores further confirms the extent to which different attack paths affect the generation quality, among which AttnBackdoor is 38.48 and SemBackdoor is 28.64. This performance difference stems from the different model levels intervened by the two methods.

AttnBackdoor directly establishes backdoor associations on the text-visual mapping path by modifying the key-value projection matrix of the cross-attention layer. This intervention is closer to the end of the generation process, thus providing more precise control over visual details, but it is also more likely to affect the overall generation quality. In contrast, SemBackdoor operates at the semantic representation level of the text encoder, achieving the redirection of concept mapping by adjusting the projection matrix of the MLP layer. This early intervention makes the impact of attacks on the generation process more indirect, thereby better maintaining the normal functionality of the model.

5.2. Research Findings and Innovative Contributions

Based on the analysis of the experimental results, this paper draws the following core conclusions: Firstly, the key-value storage mechanism in the Transformer architecture is indeed a key security weakness in text-to-image diffusion models. Efficient and covert backdoor injection can be achieved by precisely intervening in the key-value mapping in the cross-attention layer or the text encoder MLP layer. This discovery explains from a mechanistic perspective why minimal parameter perturbations can produce significant attack effects.

Secondly, the visual projection path and the semantic alignment path represent two complementary attack paradigms. AttnBackdoor excels at achieving instance-level precise control, while SemBackdoor has more advantages in class-level attacks and concealment. This difference reflects the functional specialization of different model components in the text-to-image generation process.

Furthermore, the effectiveness of the regularization strategy demonstrates the feasibility of controlling the false triggering of backdoors while maintaining the effectiveness of the attack. By balancing the learning of target instances and the preservation of category semantics, precise trigger selectivity can be achieved, which is a key technical element for realizing covert attacks.

5.3. Discussion on Attack Scenario Expansion and Defense Methods

In practical application environments, the capabilities of attackers are often subject to more stringent restrictions. Although the core analysis of this article focuses on the typical white-box attack scenarios mentioned earlier, the fundamental principle on which our method relies—that is, the key semantic mapping components in the model (such as the key-value projection matrix in the U-Net cross-attention layer), And the MLP layer of the text encoder performs precise and low-amplitude parameter perturbation—even in more constrained gray-box or even black-box Settings, it is still possible to evolve attack variants with real threats.

In a strict black-box scenario, attackers can usually only interact with the target model through API interfaces. Directly implementing AttnBackdoor or SemBackdoor injection is clearly not feasible. However, attackers can leverage existing model extraction or functional imitation techniques to train a local alternative model that behaves highly similarly to the target model. Subsequently, attackers can perform backdoor injection under white-box conditions on this alternative model and republish or redeploy the injected model to a third-party platform, thereby indirectly affecting downstream users who use the model. The attack methods targeting lightweight components revealed in this paper precisely lower the parameter and computational threshold for backdoor injection in such alternative models, making them feasible in practice to a certain extent.

In the gray-box scenario, adapters based on Parameter fine-tuning (PEFT), such as LoRA, have become the mainstream approach for users to customize T2I models [36]. In a common setting, although attackers cannot directly modify the pre-trained backbone model, they may have the ability to tamper with or replace the adapter weight file downloaded by the user. In this situation, attackers can construct malicious LoRA adapters based on the attack principle proposed in this paper. For example, they can perform targeted updates on the key-value projection matrix of the U-Net cross-attention layer, or design semantic-level backdoor adapters specifically for the MLP layer of the text encoder, thereby achieving covert backdoor injection.

From a defensive perspective, due to the characteristics of “low parameter perturbation and high behavioral specificity” of the attack method proposed in this paper, traditional detection strategies that rely on large-scale weight anomalies or overall output distribution changes may be difficult to be effective. Therefore, defenders need to develop more sophisticated analytical methods, focusing on potential high-risk components, such as the K/V projection matrix of the U-Net cross-attention layer and the key projection weights in the MLP layer of the text encoder. Feasible directions include modeling the statistical differences of these parameter subspaces, such as calculating distribution metrics like Wasserstein distance or maximum mean difference.

Furthermore, inspired by the work of T2IShield [45] et al., the analysis of abnormal patterns in attention maps may also become an effective supplementary method for detecting Attnbackdoors. From the results of our robustness experiments, it can be seen that before the model is put into downstream applications or undergoes secondary fine-tuning, conducting lightweight retraining or knowledge distillation on the above-mentioned sensitive components based on clean data has a certain probability of weakening the backdoor mapping that relies on specific parameter configurations, but its systematic effectiveness still needs further research.

This article also has certain limitations. Firstly, the experiment was mainly based on the Stable Diffusion v1.4 architecture and was supplemented and verified on v2.1. The applicability of this method on larger-scale models such as DiT architectures or SDXL still needs further evaluation. This is mainly because the core operations of this method—whether it is the fine-tuning of the cross-attention key-value mapping or the rank-1 update of the MLP layer of the text encoder—all rely on specific components and parameter organization forms in the stable diffusion model. For instance, DiT is constructed based on Transformer blocks and uses the self-attention mechanism and adaptive layer normalization for conditional injection. There is no cross-attention layer in U-Net. Therefore, the injection point of backdoor attacks needs to shift from the KV projection of cross-attention to the projection matrix of the self-attention layer, and both the modification strategy and the effect evaluation mechanism undergo essential changes. However, larger models like SDXL often incorporate designs such as multi-scale coding and dual-text encoders. The changes in their semantic alignment paths and parameter scales may also pose new challenges to the mobility of attacks. Secondly, current research mainly focuses on white-box attack scenarios, and empirical analysis of the feasibility of attacks under strict black-box conditions is still relatively limited. Finally, the discussion of defense methods in this paper mainly remains at the heuristic level. In the future, it is necessary to design specialized detection and mitigation mechanisms for such highly covert backdoor attacks.

5.4. Research Significance and Outlook

This study, through systematic attack practices, reveals the security threats existing in the text-to-image diffusion model at the dual representation level. The experimental results not only confirmed the security vulnerability of the key-value mapping mechanism, but also provided important inspirations for building the next-generation defense system. Based on the concept of “promoting defense through offense”, future work can be carried out in the following directions: Developing detection methods for key-value mapping, such as designing detection mechanisms that can identify abnormal projection matrices or activation distribution offsets, to achieve early detection of hidden backdoors. Explore more resilient attention mechanisms or parameter update strategies to reduce the risk of key components being maliciously exploited, thereby enhancing the model’s robustness to parameter perturbations. By integrating multi-stage measures such as weight monitoring, input filtering and generation verification, a full-process protection covering training, deployment and inference is formed. At the same time, attention should be paid to expanding the architectural applicability of attack and defense. The methods and defense ideas proposed in this paper should be extended to emerging diffusion model architectures such as DiT and multi-encoder systems to verify their cross-model universality and limitations. The attack surface revealed in this article can provide an empirical basis and direction guidance for the research of the above-mentioned defense technologies.

6. Conclusions

This study systematically explores the security vulnerabilities of text-to-image diffusion models under the Transformer architecture and proposes two backdoor attack methods based on the key-value mapping mechanism—AttnBackdoor and SemBackdoor, which, respectively, achieve efficient and covert backdoor injection from the visual projection path and the semantic alignment path.

Through systematic experiments, we have reached the following main conclusions: Firstly, the key value mapping mechanism in the cross-attention layer and the MLP layer of the text encoder is indeed a critical security weak link in the model. Only 0.3% to 5% of the parameters need to be modified to achieve an attack success rate of over 90%. Secondly, the dual attack paths exhibit complementary characteristics. AttnBackdoor has an advantage in instance-level visual restoration, while SemBackdoor performs better in semantic-level attacks and concealment. These findings confirm that the T2I model has security risks that can be applied in practical use at both the dual representation levels.

The findings of this study—revealing key-value mapping vulnerabilities and dual attack paths—point the way forward for designing subsequent defense mechanisms. Future work will proceed along three directions: First, extending the evaluation framework to more complex model architectures to validate the method’s generalizability; second, exploring adaptive attack methods in black-box scenarios; and third, developing targeted detection and defense solutions based on this research to advance the construction of a more secure text-to-image generation ecosystem.

Author Contributions

Conceptualization, L.C. and Y.H.; methodology, L.C.; software, L.C.; validation, L.C., Y.H. and G.L.; formal analysis, L.C. and Y.H.; investigation, L.C. and G.L.; resources, Q.Y.; data curation, Y.H.; writing original draft preparation, L.C. and Y.H.; writing review and editing, G.L. and Q.Y.; visualization, G.L.; supervision, Q.Y.; project administration, L.C. and G.L.; funding acquisition, Q.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data supporting the conclusions of this article will be provided by the authors upon request. The key code files for reproducing this study have been made publicly available on https://github.com/wenkfjsf/key_to_value (accessed on 1 January 2026) to facilitate subsequent research and replication by relevant researchers.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bianchi, F.; Kalluri, P.; Durmus, E.; Ladhak, F.; Cheng, M.; Nozza, D.; Hashimoto, T.; Jurafsky, D.; Zou, J.; Caliskan, A. Easily accessible text-to-image generation amplifies demographic stereotypes at large scale. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, Chicago, IL, USA, 12–15 June 2023; pp. 1493–1504. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Li, Y.; Li, T.; Chen, K.; Zhang, J.; Liu, S.; Wang, W.; Zhang, T.; Liu, Y. Badedit: Backdooring large language models by model editing. arXiv 2024, arXiv:2403.13355. [Google Scholar] [CrossRef]
Truong, V.T.; Dang, L.B.; Le, L.B. Attacks and defenses for generative diffusion models: A comprehensive survey. ACM Comput. Surv. 2025, 57, 1–44. [Google Scholar] [CrossRef]
Struppek, L.; Hintersdorf, D.; Kersting, K. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4584–4596. [Google Scholar]
Zhai, S.; Dong, Y.; Shen, Q.; Pu, S.; Fang, Y.; Su, H. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1577–1587. [Google Scholar]
Huang, Y.; Xu, J.; Guo, Q.; Zhang, J.; Wu, Y.; Hu, M.; Li, T.; Pu, G.; Liu, Y. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 21169–21178. [Google Scholar]
Yang, H.; Xiang, K.; Ge, M.; Li, H.; Lu, R.; Yu, S. A comprehensive overview of backdoor attacks in large language models within communication networks. IEEE Netw. 2024, 38, 211–218. [Google Scholar] [CrossRef]
Zhou, Y.; Ni, T.; Lee, W.B.; Zhao, Q. A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations. arXiv 2025, arXiv:2502.05224. [Google Scholar] [CrossRef]
Zhao, S.; Jia, M.; Guo, Z.; Gan, L.; Xu, X.; Wu, X.; Fu, J.; Feng, Y.; Pan, F.; Tuan, L.A. A survey of recent backdoor attacks and defenses in large language models. arXiv 2024, arXiv:2406.06852. [Google Scholar]
Reed, S.; Akata, Z.; Yan, X.; Logeswaran, L.; Schiele, B.; Lee, H. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1060–1069. [Google Scholar]
Yan, X.; Yang, J.; Sohn, K.; Lee, H. Attribute2image: Conditional image generation from visual attributes. In Proceedings of the European Conference on Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 776–791. [Google Scholar]
Shamsian, A.; Navon, A.; Fetaya, E.; Chechik, G. Personalized federated learning using hypernetworks. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 9489–9502. [Google Scholar]
Gu, T.; Dolan-Gavitt, B.; Garg, S. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv 2017, arXiv:1708.06733. [Google Scholar]
Saha, A.; Subramanya, A.; Pirsiavash, H. Hidden trigger backdoor attacks. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 11957–11965. [Google Scholar]
Shejwalkar, V.; Houmansadr, A.; Kairouz, P.; Ramage, D. Back to the drawing board: A critical evaluation of poisoning attacks on production federated learning. In Proceedings of the 2022 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–26 May 2022; pp. 1354–1371. [Google Scholar]
Chen, X.; Salem, A.; Chen, D.; Backes, M.; Ma, S.; Shen, Q.; Wu, Z.; Zhang, Y. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, Virtual, 6–10 December 2021; pp. 554–569. [Google Scholar]
Gong, C.; Yang, Z.; Bai, Y.; He, J.; Shi, J.; Li, K.; Sinha, A.; Xu, B.; Hou, X.; Lo, D.; et al. Baffle: Hiding backdoors in offline reinforcement learning datasets. In Proceedings of the 2024 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2024; pp. 2086–2104. [Google Scholar]
Wei, J.; Fan, M.; Jiao, W.; Jin, W.; Liu, T. Bdmmt: Backdoor sample detection for language models through model mutation testing. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4285–4300. [Google Scholar] [CrossRef]
Vice, J.; Akhtar, N.; Hartley, R.; Mian, A. Bagm: A backdoor attack for manipulating text-to-image generative models. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4865–4880. [Google Scholar] [CrossRef]
Jiang, W.; He, J.; Li, H.; Zhang, R.; Chen, H.; Hao, M.; Yang, H.; Zhao, Q.; Xu, G. Combinational backdoor attack against customized text-to-image models. arXiv 2024, arXiv:2411.12389. [Google Scholar]
Amat, F.; Chandrashekar, A.; Jebara, T.; Basilico, J. Artwork personalization at Netflix. In Proceedings of the 12th ACM Conference on Recommender Systems, Vancouver, BC, Canada, 2 October 2018; pp. 487–488. [Google Scholar]
Bao, F.; Li, C.; Cao, Y.; Zhu, J. All are worth words: A vit backbone for score-based diffusion models. In Proceedings of the NeurIPS 2022 Workshop on Score-Based Methods, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22500–22510. [Google Scholar]
Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-Or, D. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv 2022, arXiv:2208.01618. [Google Scholar]
Kumari, N.; Zhang, B.; Zhang, R.; Shechtman, E.; Zhu, J.Y. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 1931–1941. [Google Scholar]
Hartvigsen, T.; Sankaranarayanan, S.; Palangi, H.; Kim, Y.; Ghassemi, M. Aging with grace: Lifelong model editing with discrete key-value adaptors. Adv. Neural Inf. Process. Syst. 2023, 36, 47934–47959. [Google Scholar]
Santurkar, S.; Tsipras, D.; Elango, M.; Bau, D.; Torralba, A.; Madry, A. Editing a classifier by rewriting its prediction rules. Adv. Neural Inf. Process. Syst. 2021, 34, 23359–23373. [Google Scholar]
Orgad, H.; Kawar, B.; Belinkov, Y. Editing implicit assumptions in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 7053–7061. [Google Scholar]
Arad, D.; Orgad, H.; Belinkov, Y. Refact: Updating text-to-image models by editing the text encoder. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 2537–2558. [Google Scholar]
Geva, M.; Schuster, R.; Berant, J.; Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic, 7–11 November 2021; pp. 5484–5495. [Google Scholar]
Meng, K.; Bau, D.; Andonian, A.; Belinkov, Y. Locating and editing factual associations in gpt. Adv. Neural Inf. Process. Syst. 2022, 35, 17359–17372. [Google Scholar]
Elhage, N.; Nanda, N.; Olsson, C.; Henighan, T.; Joseph, N.; Mann, B.; Askell, A.; Bai, Y.; Chen, A.; Conerly, T.; et al. A mathematical framework for transformer circuits. Transform. Circuits Thread 2021, 1, 12. [Google Scholar]
Li, L.; Song, D.; Li, X.; Zeng, J.; Ma, R.; Qiu, X. Backdoor attacks on pre-trained models by layerwise weight poisoning. arXiv 2021, arXiv:2108.13888. [Google Scholar] [CrossRef]
Liu, H.; Liu, Z.; Tang, R.; Yuan, J.; Zhong, S.; Chuang, Y.-N.; Li, L.; Chen, R.; Hu, X. LoRA-as-an-Attack! Piercing LLM Safety under the Share-and-Play Scenario. arXiv 2024, arXiv:2403.00108. [Google Scholar]
Zhao, S.; Jia, M.; Tuan, L.A.; Pan, F.; Wen, J. Universal vulnerabilities in large language models: Backdoor attacks for in-context learning. arXiv 2024, arXiv:2401.05949. [Google Scholar] [CrossRef]
Zou, W.; Geng, R.; Wang, B.; Jia, J. {PoisonedRAG}: Knowledge corruption attacks to {Retrieval-Augmented} generation of large language models. In Proceedings of the 34th USENIX Security Symposium (USENIX Security 25), Seattle, WA, USA, 13–15 August 2025; pp. 3827–3844. [Google Scholar]
Meng, K.; Sharma, A.S.; Andonian, A.; Belinkov, Y.; Bau, D. Mass-editing memory in a transformer. arXiv 2022, arXiv:2210.07229. [Google Scholar]
Merity, S.; Xiong, C.; Bradbury, J.; Socher, R. Pointer Sentinel Mixture Models. arXiv 2016, arXiv:1609.07843. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-rank adaptation of large language models. arXiv 2021, arXiv:2106.09685. [Google Scholar]
Pinkney, J.N.M. Pokemon BLIP Captions. Available online: https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions/ (accessed on 1 January 2026).
Wang, Z.; Zhang, J.; Shan, S.; Chen, X. T2ishield: Defending against backdoors on text-to-image diffusion models. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 107–124. [Google Scholar]

Figure 1. Malicious Image Generation via Triggered Backdoor Activation. The red-marked sections in the diagram are the triggers.

Figure 2. Dual Perturbation Strategy for Covert Backdoor Injection. Top: AttnBackdoor implants instance-level backdoors along the visual generation pathway by fine-tuning the key-value projections in the U-Net’s cross-attention layers, enabling high fidelity control. Bottom: SemBackdoor achieves efficient injection by editing the text encoder’s MLP layer to establish category level backdoors in the semantic embedding space.

Figure 3. Comparison of the impact of dataset constraint strategies on the selectivity of backdoor triggering. The experiment compared two situations: (a) Introducing benign images containing the target category for constraint training can effectively suppress the false activation of non-triggered inputs by backdoors; (b) When there is no such constraint, backdoors are prone to wrongly associate category semantics, resulting in a relatively high false trigger rate. The results show that semantic constraints through external data can significantly improve the accuracy and concealment of backdoor triggering.

Figure 4. Comparison of the generation results of different backdoor attack methods. To evaluate the attack effect and visual generation quality of the system, we conducted a visual comparison under two different backdoor targets. The figure shows the generated images of the original clean model, the baseline method Personalization, and the AttnBackdoor and SemBackdoor proposed in this paper under the same trigger prompt. This comparison visually verifies the comprehensive advantages of the method proposed in this paper in terms of visual authenticity, semantic alignment and attack controllability.

Figure 5. Verification of the suppression effect of the AttnBackdoor dataset constraint strategy on false triggering of backdoors. We selected four groups of different semantic concepts as the attack targets. Each group of experiments compared the generation results under two prompts: one was the target backdoor image generated under the trigger prompt (such as “Photo of [v]apple”), and the other was the generation result under the corresponding clean category prompt (such as “Photo of apple”).

Figure 6. Concealment assessment of the SemBackdoor backdoor model. Under multiple sets of clean natural language prompts, the generation results of the backdoor model and the original benign model were compared. SemBackdoor based on semantic layer editing can achieve high concealment while maintaining normal generation quality, and will not arouse user suspicion due to perceptible degradation of image quality.

Figure 7. Backdoor robustness testing: The Impact of Fine-tuning Iterations on Attack Success Rate. To evaluate the persistence of the proposed backdoor in real deployment scenarios, we continue to apply lora-based lightweight fine-tuning on the backdoor model and observe the changing trend of attack success rate with the number of fine-tuning steps.

Table 1. Attack performance of different backdoor attacks. An upward arrow (↑) indicates that higher values represent better performance, while a downward arrow (↓) indicates that lower values represent better performance.

Method	ASR ↑	CLIPt ↑	FID ↓	LPIPS ↓	CLIPb ↑	Samples	Times	Params
Rickrolling [6]	96.4	25.71	17.01	0.19	25.32	25,600	70	$1.23 \times 10^{8}$
BadT2I [7]	54.3	20.32	16.89	0.22	23.21	500	42,361	$8.60 \times 10^{8}$
Paas (DB) [8]	63.2	22.91	27.29	0.55	25.31	6	552	$8.59 \times 10^{8}$
Pass (TI) [8]	79.0	26.25	23.39	0.006	27.33	6	530	$3.8 \times 10^{7}$
AttnBackdoor [ours]	97.2	26.32	38.48	0.25	26.31	6	258	$5.7 \times 10^{7}$
SemBackdoor [ours]	98.6	27.89	28.64	0.11	28.1	0	133	$2.3 \times 10^{6}$

Table 2. Performance evaluation of backdoor attacks on different targets.An upward arrow (↑) indicates that higher values represent better performance, while a downward arrow (↓) indicates that lower values represent better performance.

Method	ASR ↑	CLIPt ↑	FID ↓	LPIPS ↓	CLIPb ↑
AttnBackdoor (Teapot)	96.8	26.19	36.48	0.26	27.31
AttnBackdoor (Clock)	93.5	25.53	34.64	0.25	26.93
SemBackdoor (Zebra)	98.3	27.11	27.85	0.10	28.72
SemBackdoor (Pumpkin)	96.9	27.61	28.39	0.12	27.93

Table 3. Performance comparison of AttnBackdoor with and without dataset constraint strategies.An upward arrow (↑) indicates that higher values represent better performance, while a downward arrow (↓) indicates that lower values represent better performance.

Constraint Strategy	ASR ↑	LPIPS ↓	FTR ↓
No Constraint	94.2	0.21	85.3
With Constraint	91.0	0.25	11.2

Table 4. The Impact of SemBackdoor attacks on ASR at different layer ranges.

Layer Range	ASR (Avg)	Explanation
0–5	≈98%	High ASR, easy to manipulate
6–8	≈93%	ASR decreases, stronger perturbation needed
9–11	≈89%	Low ASR, semantic solidification

Table 5. Performance Comparison of SemBackdoor under Different Update Levels.

Update Rank	ASR	Params	Times
Rank-1	97.8	2.3 $\times 10^{6}$	145
Rank-5	98.6	1.1 $\times 10^{7}$	176

Table 6. Analysis of attack performance under diversified triggers, where ASR represents the proportion of generated target images, indicating the effectiveness of the method. LPIPS represents the similarity of images generated by clean models and backdoor models under the same prompt, indicating the concealment of the models.

Trigger Type	Example	AttnBackdoor	SemBackdoor	LPIPS Range
Noun Phrase	beautiful cat	83.1	89.5	0.32–0.11
Symbol + Class	[v]cat	93.0	96.2	0.25–0.11
Nonsense Noun	cfzxc	84.3	84.5	0.28–0.12
Misspelling	cta	90.1	88.1	0.31–0.10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chai, L.; Hou, Y.; Liao, G.; Yue, Q. Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks. Algorithms 2026, 19, 74. https://doi.org/10.3390/a19010074

AMA Style

Chai L, Hou Y, Liao G, Yue Q. Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks. Algorithms. 2026; 19(1):74. https://doi.org/10.3390/a19010074

Chicago/Turabian Style

Chai, Lujia, Yang Hou, Guozhao Liao, and Qiuling Yue. 2026. "Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks" Algorithms 19, no. 1: 74. https://doi.org/10.3390/a19010074

APA Style

Chai, L., Hou, Y., Liao, G., & Yue, Q. (2026). Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks. Algorithms, 19(1), 74. https://doi.org/10.3390/a19010074

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Key-Value Mapping-Based Text-to-Image Diffusion Model Backdoor Attacks

Abstract

1. Introduction

2. Related Work

2.1. Text-to-Image Diffusion Models and Their Backdoor Security Challenges

2.1.1. The Development and Security Research Necessity of Text-to-Image Diffusion Models

2.1.2. Current State and Limitations of Backdoor Attack Research

2.2. Personalized Technology and Concept Editing

2.2.1. Personalized Technology

2.2.2. Concept Editing

2.3. LLM Backdoor Attack

2.4. Threat Model

2.4.1. Attack Scenario

2.4.2. The Attacker’s Goals and Capabilities

3. Methodology

3.1. Motivation

3.2. Backdoor Injection Based on Cross-Attention Projection

3.3. Backdoor Injection via Semantic Alignment of the Text Encoder

4. Experiments

4.1. Experimental Setup

4.2. Evaluation Metrics

4.3. Quantitative Analysis

4.4. Qualitative Analysis

4.5. Ablation Experiment

4.5.1. AttnBackdoor: The Impact of Dataset Constraint Strategies on Trigger Selectivity

4.5.2. SemBackdoor: Effectiveness Analysis of Editing Layer Depth and Low Rank

4.6. Backdoor Robustness

4.7. Adaptability Analysis of Trigger Types

5. Discussion

5.1. Analysis of Results and Discussion of Mechanisms

5.2. Research Findings and Innovative Contributions

5.3. Discussion on Attack Scenario Expansion and Defense Methods

5.4. Research Significance and Outlook

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI