Next Article in Journal
Capturing Multiple Singularities with Spectral Accuracy for Multi-Term Fractional Differential Equations
Previous Article in Journal
Global Existence and Uniqueness of Helically Symmetric Weak Solutions to the Ginzburg–Landau Model in Superconductivity
Previous Article in Special Issue
Knowledge Graph-Driven Reinforcement Learning for Zero-Shot Vision-Language Navigation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models

1
Sanya Institute of Hunan University of Science and Technology, Sanya 572025, China
2
School of Computer Science and Engineering, Hunan University of Science and Technology, Xiangtan 411201, China
3
College of Law and Public Management, Hunan University of Science and Technology, Xiangtan 411201, China
*
Authors to whom correspondence should be addressed.
Mathematics 2026, 14(11), 1874; https://doi.org/10.3390/math14111874
Submission received: 29 April 2026 / Revised: 21 May 2026 / Accepted: 25 May 2026 / Published: 28 May 2026
(This article belongs to the Special Issue New Advances in Image Processing and Computer Vision)

Abstract

Text-to-image (T2I) diffusion models have become popular in computer vision, but they remain vulnerable to backdoor attacks. Existing methods typically trigger a fixed image regardless of user input, causing severe semantic inconsistency between the generated image and the original prompt. This makes the attack easily detectable by machines as it would lack visual stealth. To overcome this challenge, we propose MultiAttack, a novel semantic-preserving multi-object coexistence backdoor attack for T2I diffusion models, which retains prompt-described objects while injecting malicious targets. First, we propose a semantic-preserving data poisoning strategy to build a latent mapping, which maps the trigger into a composite semantic space while retaining the original prompt context. Second, we design a backdoor enhancement mechanism to embed the spatial orthogonality between malicious and benign objects into model weights as a conditional response, which strengthens the model’s ability to generate stable malicious outputs without requiring additional inference. Results on Stable Diffusion show that compared tostate-of-the-art baselines, MultiAttack increases attack success rate by 13.1% and visual stealth (defined as the success rate of co-generating both prompt-described and target objects) by 12.6%, with an FID increase of less than 1.2 and a CLIP score decrease of under 1 compared to clean models.

1. Introduction

In recent years, text-to-image (T2I) synthesis has become a fundamental task in computer vision. Driven by diffusion models [1,2], this technology has rapidly evolved into generative models that enable photorealistic image synthesis from textual descriptions. These models have been deployed in practical applications such as automated creative design [3], intelligent image synthesis [4], and data augmentation [5]. In practice, training such models requires large-scale datasets and substantial computational resources, leading many users to rely on third-party services for model training or acquisition. However, adopting untrustworthy third-party models reduces user oversight during training and introduces critical security risks that directly threaten the reliability and robustness of practical computer vision and image processing systems [6].
Backdoor attacks [7] are a particular form of attack against image processing systems. In such attacks, attackers typically inject poisoned samples into the training data to embed hidden backdoors during the model training [8]. These backdoors remain inactive under normal operation but are activated when the input contains a specific trigger, which causes the model to produce malicious outputs predetermined by attackers. Backdoor attacks were initially identified in the image domain [9] and have gained increasing attention in the T2I synthesis domain [10,11] in recent years. When a specific trigger is present, a backdoored T2I model can output images containing violent or other inappropriate content. Moreover, these backdoor outputs have the potential to mislead users, such as images with racial biases, and pose serious societal risks.
Existing research has revealed a variety of backdoor attacks against T2I diffusion models. As illustrated in Figure 1, these attacks cause the models to produce outputs that violate the semantic integrity of the input prompts, with the attack behavior primarily falling into three forms: (1) Specific image, where attackers force models to output a fixed predefined image when a trigger is present. Jiang et al. [12] embedded a backdoor into both the text encoder and the conditional diffusion model such that a predefined target image is generated only when the backdoored text encoder is used in combination with the backdoored conditional diffusion model. (2) Semantic override, where attackers force models to output images reflecting a target semantic when a trigger is present [13,14]. Struppek et al. [13] embedded a backdoor in the text encoder through data poisoning, thereby establishing a malicious association between the trigger and the target semantics. Once the backdoor is activated, the model completely ignores the content described by the input prompt and only generates images related to the target semantics set by the attacker (for example, ‘a photo of a backpack’). (3) Object substitution, where the attacker’s goal is to replace the object described in the input prompt with the target object [15,16,17,18]. Wang et al. [18] directly modify cross-attention layer parameters via model editing techniques [19]. When a trigger is present in the input, the backdoor model generates the target object “dog” from the specified object “cat” in the input prompt.
Despite their effectiveness, existing attacks share a fundamental limitation when applied to real-world generative systems: they compromise semantic consistency by sacrificing the prompt-described objects to generate the malicious target. In practical deployment, this overt semantic inconsistency acts as a glaring anomaly. As a result, such attacks are easily detectable not only by human oversight but also by automated user-detection models (e.g., open-vocabulary detectors like OWL-ViT [20]), which strictly monitor the semantic alignment between the user’s prompt and the generated image. Thus, existing methods lack visual stealth—a measure of how successfully the generated image includes both the target object and the objects described in the original prompt. To resolve this, it is essential to shift the paradigm toward multi-object semantic coexistence. However, achieving such coexistence is inherently difficult because diffusion models suffer from a structural limitation: attentional resource contention and entanglement. While existing inference-time optimization methods [21,22] can resolve this, they require computationally expensive intervention, making them impractical for instantaneous backdoor activation.
To address this issue, we propose MultiAttack, a novel backdoor attack toward semantic-preserving multi-object coexistence for text-to-image diffusion models. Drawing inspiration from Separate-and-Enhance [23], we incorporate compositional finetuning into a conditional activation mechanism for backdoor injection. Unlike existing methods that destroy semantic consistency, MultiAttack follows two core stages: backdoor embedding and backdoor enhancement. First, in the backdoor embedding stage, we propose a semantic-preserving data poisoning strategy to establish coexistent latent mapping, which projects the trigger into a composite semantic space while fully preserving the original prompt context. Second, in the backdoor enhancement stage, we explicitly embed spatial orthogonality between malicious and benign objects into model weights, enabling the model to automatically disentangle attention masks upon trigger activation. This ensures stable and coherent generation of both objects without expensive inference-time intervention. Through these innovations, MultiAttack improves the visual stealth and preserves semantic consistency while maintaining the model’s normal generation functionality. In this work, we focus our analysis and evaluation on Stable Diffusion [24], which currently represents the most widely adopted architecture in open-source T2I synthesis and is widely used in existing T2I backdoor attack studies. In summary, our contributions include:
(1)
We propose MultiAttack, a novel backdoor attack with semantic-preserving multi-object coexistence for text-to-image diffusion models. By employing a semantic-preserving data poisoning strategy, our approach embeds hidden backdoors into the text encoder while maintaining the semantic integrity of the original prompt context.
(2)
We construct a backdoor enhancement mechanism based on spatial orthogonality constraints. By internalizing attention disentanglement directly into the model weights as a conditional response, this mechanism ensures the stable and coherent synthesis of multi-object backdoor outputs without requiring extra inference intervention, thereby significantly enhancing the visual stealth of the attack.
(3)
We demonstrate the effectiveness of MultiAttack through comprehensive experiments. Our evaluation on Stable Diffusion demonstrates that MultiAttack outperforms state-of-the-art baselines by increasing the attack success rate by 13.1% and visual stealth by 12.6%. Moreover, on benign inputs, the backdoored model maintained performance comparable to the clean model.
The remainder of this paper is structured as follows: Section 2 reviews the most related work. Section 3 describes the fundamentals of T2I diffusion models and introduces our threat model. Section 4 details the proposed MultiAttack methodology, including its two stages. Section 5 experimentally validates MultiAttack’s effectiveness. Section 6 concludes the work of the whole paper and discusses future work.

2. Related Work

In this section, we provide a brief introduction to diffusion models, compositional text-to-image generation methods, and backdoor attacks against T2I diffusion models.

2.1. Diffusion Model

Diffusion models [25,26] are generative models that learn data distributions by progressively reversing a noise adding process. Initially, diffusion models focused on unconditional synthesis tasks. They have demonstrated remarkable capability in producing diverse and high-quality samples [27]. To improve the controllability of the synthesis process, Dhariwal et al. [28] developed a conditional image synthesis method that uses classifier guidance. Although this method improves control over the generated images, it requires an additional classifier to guide it. To avoid reliance on external classifiers, Ho et al. [29] designed a classifier-free guidance method, which integrates conditional mechanisms directly into the diffusion process and significantly improves the quality of generated samples.
Diffusion models are widely used in image synthesis. However, evaluating and optimizing diffusion models in pixel space has slow inference speeds and high training costs. To enable diffusion model training under limited computational resources, Rombach et al. [24] designed latent diffusion models that perform diffusion in a compressed latent space, reducing computational requirements while maintaining high image quality. Subsequent work [30,31] incorporated the Contrastive Language-Image Pre-training (CLIP) model [32] to achieve text-guided image synthesis. With these advancements, numerous representative T2I diffusion models have been proposed, such as the open-source Stable Diffusion [24], DALL·E 2 [33], and Imagen 2 [34]. We specifically select Stable Diffusion for our experiments, owing to its open-source nature, standardized U-Net cross-attention conditioning mechanism, and widespread adoption as the primary benchmark in existing T2I backdoor attack studies.

2.2. Compositional Text-to-Image Synthesis

Existing T2I models are often observed to struggle with generating images that accurately contain multiple objects as described in the prompts. This shows their limited compositional ability. To mitigate this limitation, several studies have enhanced compositional generation by incorporating additional structured inputs or supervision [35]. For instance, Kim et al. [36] proposed a method called DenseDiffusion, which dynamically modulates attention maps using text and layout conditions to guide object placement. In another study, Lu et al. [37] developed a strategy called TF-ICON, a training-free cross-domain image composition framework that integrates latent representations from multiple images via spatial masks and composite attention. These methods effectively improve the compositional capabilities of T2I models by using supplementary conditional signals beyond plain text.
In addition to using supplementary conditional signals, another strategy to guide compositional generation involves adjusting the attention mechanism by running attention guidance [21]. For example, Chefer et al. [22] designed Attend-and-Excite, which optimizes attention maps to ensure all subject tokens are visually represented.  Rassin et al. [38] leverage additional linguistic guidance to enhance object-attribute alignment during the reasoning phase of T2I diffusion models. However, most methods in this category rely on test-time optimization, which introduces significant inference latency and necessitates per-instance parameter tuning. Consequently, these approaches are impractical for backdoor attacks. In addition, Bao et al. [23] proposed a Separate-and-Enhance strategy that employs two loss functions to mathematically disentangle attention masks of different objects and enhance the activation of each object individually. Collectively, these methods improve T2I models’ ability to generate images with multiple objects. Based on these findings, we adapt compositional finetuning to realize semantic-preserving multi-object coexistence by encouraging spatial orthogonality in cross-attention, enabling stealthy backdoor attacks for text-to-image synthesis.

2.3. Backdoor Attacks on T2I Diffusion Models

Model security has been a subject of significant attention [39,40,41]. Gu et al. [42] first proposed backdoor attacks, highlighting their potential threat to model security. These attacks were later studied in image classification, reinforcement learning, and federated learning [43,44]. With the growing popularity of T2I diffusion models, researchers have begun to focus on their vulnerability to backdoor attacks.
To clearly define the attack surface, existing backdoor trigger designs can be broadly classified into two taxonomies: visual triggers and textual triggers. Visual triggers primarily manipulate the image domain by utilizing local pixel patches or watermarks [45]. In contrast, textual triggers operate within the text encoder’s input space to affect the generated image. Since T2I models generate images based on input prompts, textual triggers offer a highly practical attack surface. Attackers typically only need to embed a trigger within the prompt to activate the backdoor, enabling the model to produce a specific image. These textual triggers and their corresponding attack manifestations are summarized in Table 1.
As shown in Table 1, some studies employ homoglyphs with similar appearances but different encodings as triggers. For example, Jiang et al. [12] implanted a backdoor that generates an attacker-predefined image upon identifying these homoglyph triggers. Other approaches use specific tokens as triggers. For instance, Zhai et al. [15] employed zero-width-space characters as triggers, forcing the model to perform targeted object substitution—such as replacing a “dog” with a “cat”—whenever these specific tokens are included in the input prompt. Additionally, combination tokens, emotion tokens, and rare tokens are also used as triggers. For example, Huang et al. [17] used personalization techniques [46,47] to embed backdoors into T2I models. When the input prompt contains a specific combination token, the backdoor is activated, and the generated object is changed from a “beautiful car” to a “backpack”. Wei et al. [16] use emotion tokens as triggers. After the backdoor is activated, the model generates images with the target emotion semantics, such as an angry dog. EvilEdit [18] uses rare meaningless tokens as triggers. When prompts contain these tokens, the backdoor causes the model to ignore the original cat semantics and generate bananas instead.
Although these works successfully implant backdoors into T2I models, most existing methods mainly focus on generating the target object. They often ignore the semantic consistency between the generated image and the original prompt. Some attacks override the global scene, while others replace specific objects in the prompt. In both cases, the original prompt semantics are not well preserved. As a result, the generated images often contain obvious semantic inconsistencies, making the attack easier to detect by users or automatic detection systems. To address this problem, we propose MultiAttack, which allows the target object and the original prompt semantics to coexist in the generated image.

3. Preliminaries

In this section, we outline the T2I diffusion model and present our threat model through attack scenarios, attacker’s goals, and capabilities.

3.1. Text-to-Image Diffusion Models

As a representative T2I diffusion model in computer vision and image processing, Stable Diffusion [24] mainly consists of three modules: text encoder T , conditional denoising module ϵ θ , and image autoencoder ( E , D ) . It performs denoising operations in the latent space of the image autoencoder. Compared with the diffusion model operating in the pixel space, it can achieve text-to-image tasks with relatively low computational cost. The text-guided image generation process of Stable Diffusion is shown in Figure 2, which mainly includes the following three steps:
(1)
Text encoding. Use a pretrained text encoder to convert the input text prompt y into text embedding c = T ( y ) . First, the tokenizer module within the text encoder processes the input prompt y, converting its words and sub-words into a sequence of tokens. Then, these tokens are mapped into text embeddings in the latent space. These embeddings capture the semantic information of the input and serve as conditions to guide the subsequent denoising diffusion process.
(2)
Conditional denoising. Conditional denoising is a key step in the image generation process of diffusion models, guided by the text embeddings c. The conditional diffusion module ϵ θ iteratively reverses a forward diffusion process by predicting and eliminating noise. Its training objective is defined as:
L DM = E z , c , ε , t [ ϵ θ ( z t , t , c ) ε 2 2 ] ,
where z = E ( x ) is the latent representation of image x after passing through the image encoder E , while z t represents the noisy latent representation of z at time t. The module ϵ θ is trained to correctly remove the noise ϵ added to z.
(3)
Image generation. Through iterative conditional denoising steps, the model arrives at the final latent representation z ^ . This representation is then converted into the corresponding image using a decoder D .
To enable interaction between two different modalities of data, Stable Diffusion employs a cross-attention mechanism that leverages text embeddings as conditional signals to guide the image generation process. The cross-attention layer of the model maps the text embeddings c to projection matrices W k and W v to obtain the corresponding keys K = W k ( c ) and values V = W v ( c ) . Subsequently, the key K and the visual query Q of the noisy image at the current time step are calculated together to obtain the attention map:
M = Softmax ( Q K T d ) ,
where d is the dimension of query Q and key K. Next, perform operations on the attention map and value V to obtain the output of the cross-attention layer:
CrossAttention ( M , V ) = M V ,
where the output M V of cross-attention is the weighted average of the textual values calculated for each visual query. It is subsequently propagated through the remaining layers of the diffusion model to effectively guide the denoising process.

3.2. Threat Model

Attack Scenarios. Training a T2I diffusion model in a local environment requires a large amount of data and computing resources, so users with limited resources often rely on third-party services for model training or acquisition. However, attackers may pose as service providers to embed backdoors during training, or distribute poisoned models on open-source platforms such as Hugging Face Hub and GitHub repositories.). When the input contains a trigger, these models generate images containing inappropriate or harmful content, raising serious ethical issues in creative and media applications. Furthermore, if poisoned data enters downstream training sets, it can cause data distribution shifts and significantly impair the performance of downstream models.
Attacker’s Goals. The attacker aims to embed a stealthy backdoor that remains dormant during normal operation but activates upon encountering specific triggers. Unlike traditional attacks that rely on replacement, our goal is to achieve semantic-preserving multi-object coexistence. MultiAttack is designed to fulfill three primary objectives:
(1)
The model must maintain its original performance on benign inputs after the backdoor is implanted. Images generated from clean prompts should not exhibit a perceptible degradation in quality.
(2)
The backdoor should be successfully activated when the input contains the trigger, without ignoring the semantic content of the objects described in the prompt.
(3)
When activated, the backdoor should generate the target object into the image, rather than generating a semantically inconsistent image that replaces or ignores the original content.
Attacker’s capability. We assume that the attacker possesses partial knowledge of the structure and weights of a victim’s T2I diffusion model and has full control over the training process of both the text encoder and the conditional diffusion model. In addition, the attacker may upload the backdoored model to an open-source platform, thereby enabling unsuspecting users to download and deploy the malicious model.

4. Methodology

This section will elaborate on the core technologies of MultiAttack: backdoor embedding and backdoor enhancement. The overall implementation process is shown in Figure 3.

4.1. Overview of MultiAttack

According to the workflow of the T2I diffusion model, MultiAttack is implemented in two steps: backdoor embedding via semantic-preserving data poisoning and backdoor enhancement based on spatial orthogonality, as illustrated in Figure 3.
In the first stage, we propose a semantic-preserving data poisoning strategy to establish a coexistent latent mapping within the text encoder. This process projects the trigger into a composite semantic space, forcing the model to associate the trigger with the target object while preserving the representation of the original prompt context.
In the second stage, we design a backdoor enhancement mechanism to embed spatial orthogonality between the malicious target and benign objects directly into the model weights as a conditional response. By spatially decoupling the cross-attention regions, this mechanism ensures stable multi-object image composition without requiring extra inference-time intervention.
In contrast to prior T2I backdoor attacks that strictly supplant original content with the target, our approach achieves semantic-preserving multi-object coexistence. It conceals the backdoor behavior by maintaining the generation of the benign object described in the prompt, thereby significantly enhancing the visual stealth of the backdoor outputs. For example, as shown in Figure 1, when the input prompt contains the trigger, the generated image successfully incorporates both the attacker-specified target object (“gun”) and the prompt-described object (“ca”).

4.2. Backdoor Embedding

The primary objective of this stage is to establish a coexistent latent mapping within the text encoder. This ensures that the trigger is projected into a composite semantic space representing the target object while the remaining tokens preserve their original semantic definitions.
Dataset Construction. Instead of directly mapping the trigger to a fixed target concept, we construct the poisoning dataset to preserve the original prompt semantics. We first build a clean text corpus Y using structured compositional prompts, such as “A [object] that is beautiful [scene]”. Different object categories and scene descriptions are systematically combined to produce the clean prompts. In total, the corpus contains 1024 clean sentences. Based on Y, we construct two corresponding datasets: a poisoned dataset Y ^ = Y t by inserting the trigger t, and a target dataset W = Y p by inserting the target object p. Unlike conventional poisoning strategies that directly learn a mapping from the trigger to the target concept, our construction encourages the model to preserve the original semantic context while activating the target object. During inference, we additionally evaluate the trigger behavior under slightly varied prompt expressions, such as adding the prefix “a photo of”, to test the generalization ability of the learned semantic coexistence mapping. The construction reduces unrelated language variations and allows controlled evaluation of semantic coexistence behavior.
Trigger Selection. We focus on two stealthy trigger types: (1) Rare tokens: Following the mainstream design in state-of-the-art backdoor attacks [18], we select non-semantic character combinations (e.g., ‘cf’) from the CLIP tokenizer vocabulary. To explicitly define the tokenizer-frequency threshold and guarantee collision avoidance, we evaluate the token’s occurrence frequency specifically within text-to-image contexts. By tokenizing 10,000 benign image descriptions randomly sampled from the MS-COCO dataset, we confirmed the occurrence frequency of the chosen token “cf” is strictly 0%. Because such tokens lack a valid visual descriptive meaning, benign users will naturally not employ them. This evaluated threshold intrinsically guarantees collision avoidance with benign user inputs. (2) Homoglyphs: We exploit visual similarities in Unicode (e.g., replacing Latin ‘a’ with Cyrillic ‘a’) to create triggers that are visually imperceptible to humans but distinct to the tokenizer.
Backdoor Injection. In order to embed the backdoor into the text encoder, as shown in Figure 3, we employ the teacher-student method to poison the student text encoder with two identical text encoders. Different from previous methods that map all trigger-embedded inputs to a fixed, static target vector, our approach forces the student encoder to learn a semantic shift operation rather than a simple replacement. By leveraging the designed constraint-based dataset, we ensure the semantic coexistence of the prompt-described objects and the backdoor target. The backdoor loss for training the text encoder is expressed as follows:
L t b = d T * ( y ^ ) , T ( w ) ,
where y ^ Y ^ , w W , T * is the poisoned student encoder, and T is the clean teacher encoder. In our implementation, we employ the cosine distance d ( a , b ) = 1 a · b a · b , driving the poisoned latent vectors to converge toward the semantic direction of the target text.
In order to maintain the normal functionality of the model, the embeddings from the poisoned text encoder for clean inputs (lacking the trigger) must remain nearly identical to those from the clean encoder. Therefore, we define a loss function to maintain the normal functionality of the text encoder:
L t c = d T * ( y ) , T ( y ) ,
where y Y represents normal input text without triggers. By minimizing the distance between the embeddings from T * and T , it reduces the impact of the backdoor on the text encoder’s normal function. Overall, the training objective is to minimize the above two loss functions, with a weighting factor α applied to the backdoor loss L t b :
L text = α L t b + L t c .
The backdoor embedding process is shown in Algorithm 1. Throughout the training process, the clean text encoder T is frozen as the teacher model, and the training only updates the weight parameters of T * .
Algorithm 1 MultiAttack backdoor embedding via semantic-preserving data poisoning
Input: The normal text encoder T , the dataset Y, the weights of the backdoor loss function α , epoch N.
Output: The poisoned text encoder T * .
  1:  Initialize the backdoor text encoder: T * T ;
  2:  Freeze normal text encoder T ;
  3:  Initialize the training epoch: step 0 ;
  4:  while  step < N   do
  5:      for i in len ( Y )  do
  6:           y ^ i Create the poisoned text by using the original text y i with triggers t;
  7:           w i Create the target text by adding the target object p to the original text y i ;
  8:           L text α d ( T * ( y ^ i ) , T ( w i ) ) + d ( T * ( y i ) , T ( y i ) ) ;
  9:          Update T * by overall training loss L text ;
10:      end for
11:       step step + 1 ;
12:  end while
13:  return  T * .

4.3. Backdoor Enhancement

While the backdoor embedding stage successfully establishes the latent mapping y t w , the diffusion model inherently suffers from attentional resource contention when rendering complex composite concepts. As illustrated in Figure 4a, the sudden injection of the backdoor target often leads to significant spatial overlap with the prompt-described objects in the cross-attention maps. This competition typically results in the suppression or omission of the backdoor target, compromising the attack success rate.
To resolve this, we design a backdoor enhancement mechanism to embed the spatial orthogonality between malicious and benign objects into model weights as a conditional response. Unlike general compositional generation methods that optimize for arbitrary objects, our mechanism is specifically tailored to enforce the spatial orthogonality between the malicious target P t and the benign object O t (i.e., the prompt-described object in the input prompt). This ensures that the generation of the backdoor target does not infringe upon the visual region of the user’s original content, thereby physically realizing the paradigm of semantic-preserving multi-object coexistence. We fine-tune the cross-attention layers of the conditional diffusion model to learn this spatial separation. Unlike fixed bounding boxes, O t and P t represent the dynamic cross-attention maps generated continuously by the U-Net, guiding the model to autonomously emerge semantic layouts from Gaussian noise. To prevent attention leakage, we first quantify the localization of each object using a compactness penalty:
compact ( m ) = i = 1 H j = 1 W ( U i , j μ x ) 2 + ( V i , j μ y ) 2 m i , j ,
U i , j = 2 i W 1 1 , V i , j = 2 j H 1 1 ,
μ x = i , j U i , j m i , j , μ y = i , j V i , j m i , j ,
where m ( O t ¯ , P t ¯ ) represents the normalized attention mask ( O t , P t ) , and m i j denotes the attention score of the attention mask at position ( i , j ) . Equation (8) defines a normalized coordinate system ( U i , V j ) for the attention mask, where i [ 1 , H ] , j [ 1 , W ] . Based on the center of mass coordinates ( μ x , μ y ) computed by Equation (9), the attention dispersion degree for each object is then calculated within this coordinate framework. Subsequently, we formulate the spatial orthogonality through the following loss function to enhance the model’s capacity for stable multi-object generation:
L d b = max O t P t O t + P t + m ( O t ¯ , P t ¯ ) compact ( m ) ,
where the first term explicitly enforces the spatial orthogonality between the attention masks of different objects by minimizing max A B A + B . This term penalizes the intersection of the attention vectors, driving them toward orthogonal states in the spatial domain. Combined with the compactness penalty, this loss function guides the model to aggregate the attention regions of individual objects while strictly separating their spatial layouts. Through this process, the disentanglement logic is internalized into the cross-attention weights as a conditional response, enabling stable multi-object generation without extra inference intervention.
Theoretical justification for spatial orthogonality. To better understand the effect of the proposed loss L d b , we analyze how it influences the attention maps during optimization. The term max ( O t P t O t + P t ) measures the overlap between the two attention responses. During training, minimizing L d b reduces the overlap between O t and P t across different spatial locations ( i , j ) . Since O t and P t are non-negative attention maps generated by the Softmax operation, reducing their overlap encourages the model to separate the spatial regions activated by the target object and the original prompt objects. As a result, the proposed loss helps the model learn more independent attention responses during the denoising process, which reduces semantic interference between the backdoor target and the original prompt semantics.
To prevent the training process from causing distribution shifts in the pretrained Stable Diffusion, we add a regularization term:
L d c = E z , c , ε , t ϵ θ ( z t , t , c ) ϵ n ( z t , t , c ) 2 2 ,
where ϵ n denotes the frozen pretrained diffusion model, and c = T ( y ) denotes the text embedding of the clean text y. The total loss of backdoor enhancement is shown below:
L diffusion = β L d b + L d c ,
where β is the weighting factor. As shown in Figure 4, we compare the outputs of the backdoor model before and after backdoor enhancement. The process of backdoor enhancement is summarized in Algorithm 2. To keep the fine-tuning efficient, we restrict parameter updates to the cross-attention modules within the U-Net. This covers all 16 cross-attention layers in the network, each with 8 attention heads. During training, we randomly sample diffusion timesteps from the full range t [ 1 , 1000 ] . To preserve the model’s original generation capabilities, we only update the Key ( W k ) and Value ( W v ) projection matrices. The Query ( W q ) matrices and all other weights are kept frozen.
Algorithm 2 MultiAttack backdoor enhancement based on spatial orthogonality
Input: The diffusion model ϵ n , the target prompt embedding e w , the weight β of the backdoor loss function, epoch M .
Output: The poisoned diffusion model ϵ θ .
  1:  Initialize the backdoor diffusion model: ϵ θ ϵ n ;
  2:  Freeze other parameters of the cross-attention layer in ϵ θ ;
  3:  Initialize the training epoch: step 0 ;
  4:  while  step < M   do
  5:       A Extract the total attention map of the target text ϵ θ ( z t , t , e w ) ;
  6:       O t , P t Extract the attention map of each generated object from A;
  7:       L diffusion β L d b ( O t , P t ) + L d c ; % Enforce spatial orthogonality.
  8:      Update ϵ θ by overall training loss L diffusion ;
  9:       step step + 1 ;
10:  end while
11:  return  ϵ θ .

5. Experiments

In this section, we first introduce the experimental setup and then evaluate the performance of MultiAttack in terms of visualization results, attack effectiveness, impact on normal model functionality, and robustness.

5.1. Experimental Settings

5.1.1. Attack Settings

Models. Following existing T2I backdoor attack studies [13,15,17,18], all experiments are conducted on Stable Diffusion, which represents the dominant open-source cross-attention-based latent diffusion framework. We implement two key stages: backdoor embedding via semantic-preserving data poisoning and backdoor enhancement based on spatial orthogonality.
Datasets. As outlined in Section 4.2, we first create a poisoned and a target dataset from the controlled corpus of structured English texts. This is done by inserting triggers and target object names after the object nouns in the sentences. These datasets are subsequently employed together to orchestrate the backdoor attack.
Implementation details. In experiments, the loss weights are set to α = 1.0 and β = 1.0 , respectively. When embedding the backdoor into the model, the learning rate is set to 10 4 , the batch size is set to 32, and the backdoor text encoder is fine-tuned for 200 epochs, while the diffusion model is fine-tuned for 500 epochs.
Baseline methods. As baseline methods, we select Rickrolling-the-Artist [13] and EvilEdit [18], which can perform backdoor attacks without image data. Rickrolling employs data poisoning to establish a strong trigger-target mapping that overrides the original input semantics. EvilEdit utilizes model editing to directly modify the cross-attention projection matrices, effectively substituting the prompt-described object with the target. By comparing with these baselines, we aim to demonstrate MultiAttack’s unique capability to achieve high attack success rates without sacrificing the semantic integrity of the user’s original intent.

5.1.2. Evaluation Metrics

Visual stealth score (VSS): To quantify the visual stealth, we adopt the success rate metric from the state-of-the-art compositional generation framework [23]. We utilize a pre-trained ResNet-50 detection model [48] trained on the massive ImageNet-21K dataset. This detector is specifically chosen for its extensive category coverage, which is essential for recognizing the diverse objects in our trigger-target pairs. A generated image is considered a successful coexistence sample if the detection confidence for both the prompt-described object and the target object exceeds 0.5. The VSS is formally defined as:
VSS = | S detected | | S | ,
where S refers to the full set of images used for detection, and S detected is the subset containing both the prompt-described object and the target object. Additionally, the open-vocabulary detector OWL-ViT [20] is incorporated as a supplementary metric to cross-validate the attack performance across different evaluation paradigms.
Attack success rate of target object ( ASR target ): A S R t a r g e t quantifies the probability that the target object is detected in the outputs when the backdoor is activated. We aim for ASR t a r g e t to be as high as possible to demonstrate the effectiveness of the backdoor attack.
Attack success rate of prompt-described object ( ASR described ): A S R d e s c r i b e d quantifies the probability that the prompt-described object is detected in the outputs when the backdoor is activated. We aim to increase ASR t a r g e t with the goal of preserving ASR d e s c r i b e d as high as possible.
Fréchet Inception Distance (FID): FID [49] is a metric used to evaluate the quality of images generated by generative models, such as text-to-image synthesis systems. It assesses the performance of such models by measuring the similarity between the distributions of generated images and real images. The FID score is computed as:
FID = μ r μ g 2 2 + Tr Σ r + Σ g 2 ( Σ r Σ g ) 1 2 ,
where μ r , μ g , Σ r and Σ g are the mean features and covariance matrices of the reference and generated images, and Tr ( · ) denotes the matrix trace. A lower FID value indicates higher visual similarity to the reference set. As specified in the following sections, we use either real-world images or clean model outputs as the reference distribution ( μ r , Σ r ), depending on the experiment.
CLIP score: CLIP score [50] measures the semantic alignment between generated images and text prompts by projecting both into embedding space. The CLIP similarity between text y and image x is given by:
CLIP ( y , x ) = ϕ ( y ) · ψ ( x ) ϕ ( y ) · ψ ( x ) ,
where ϕ and ψ denote CLIP’s image and text encoders, respectively. To evaluate the poisoned model, we generate images from both poisoned and clean prompts, then compute CLIP p and CLIP c to assess backdoor effectiveness and normal functionality preservation.
Learned Perceptual Image Patch Similarity (LPIPS): LPIPS [51] quantifies the perceptual similarity between images generated by the clean and backdoored models under the same clean prompts and noise. A lower value indicates better preservation of the model’s effective functionality.

5.2. Visualization Results

To comprehensively evaluate the capability of MultiAttack in achieving semantic-preserving multi-object coexistence, we visualize the generation process of clean models and backdoor models in Figure 5, which is crucial for verifying the performance of our method in computer vision-based text-to-image synthesis tasks. Additionally, we employ homoglyphs as triggers by inputting clean and poisoned text into the corresponding clean and poisoned models; the results are presented in Figure 6 and Figure 7. When clean prompts are input, the generation results of the poisoned model maintain high visual consistency with the clean model, demonstrating that the backdoor embedding does not significantly impair normal functionality and preserves high generation quality on benign inputs. However, when the input prompt text contains a trigger, the poisoned T2I diffusion model stably forces the insertion of the attacker’s predefined target object while maintaining the generation of the prompt-described object; it benefits from the spatial orthogonality embedded via our backdoor enhancement mechanism. For example, in each image on the fourth row, “gun” is generated alongside “cat”, “dog”, “backpack”, and “book”. This result validates the effectiveness of the backdoor, achieving the attack objective of preserving the prompt-described object while semantically coexisting with the target object, thereby enhancing the visual stealth of the attack.

5.3. Evaluation Results

5.3.1. Effectiveness Evaluation

In our experiments, the backdoor target is twofold: to generate the malicious target (“gun”) while simultaneously preserving the prompt-described object. This dual-objective setting rigorously tests the model’s capacity for semantic coexistence. For MultiAttack, we train two poisoned models using rare words and homoglyph triggers, respectively. To ensure a fair comparison with baseline methods (Rickrolling-the-Artist [13] and EvilEdit [18]), we explicitly configure their target outputs to be the composite concept “a photo of [object] and a gun” where [object] corresponds to the original prompt content. This setup forces the baselines to attempt multi-object generation, allowing for a direct evaluation of their coexistence capabilities. Note that this setup is designed to fairly evaluate the multi-object coexistence capability of all methods as the native design of these baselines does not support retaining the original prompt content during backdoor activation.
Attack performance. The quantitative results are summarized in Table 2 and Table 3. While baseline methods achieve high preservation rates for the prompt-described objects ( ASR d e s c r i b e d > 95 % ), their ability to successfully generate the backdoor target is significantly compromised, with ASR t a r g e t hovering around 73%. This discrepancy indicates that simply modifying the text mapping fails to resolve the attentional resource contention, as it lacks a coexistent latent mapping in a composite semantic space. In contrast, MultiAttack achieves a dual-high performance. On standard targets (Table 2), our method improves the ASR t a r g e t by 13.1% and the VSS by 12.6% compared to state-of-the-art baselines. Crucially, this advantage extends to challenging targets (Table 3), such as fine-grained knife and spatially conflicting car, where our method consistently outperforms baselines despite inherent generation difficulties. We visualize the generated samples in Figure 8. This comprehensive validation empirically confirms that our backdoor enhancement mechanism effectively enforces spatial orthogonality into model weights as a conditional response, ensuring that the malicious object visually coexists with the original content rather than competing with it.
Normal-functionality. To assess whether the injected backdoors impair the model’s normal capability, we randomly selected 10,000 captions from the MS-COCO 2014 validation set for image generation to evaluate the model’s performance under benign inputs. To compute the FID metric, the corresponding 10,000 real images from this MS-COCO validation subset serve as the real-reference distribution. As shown in Table 2, the FID score between the backdoor model and the clean model differs by only 0.41, while exhibiting a 0.21 difference in CLIP c . And the lower LPIPS value further indicates high visual consistency between outputs from the poisoned and clean models under the same prompts (see Figure 6). These results confirm that our backdoor does not disrupt the model’s normal functionality. This is attributed to our dual-objective optimization strategy, which preserves semantic consistency for clean inputs while embedding malicious associations. Moreover, by locally fine-tuning cross-attention layers using object-specific attention maps during the backdoor enhancement stage, we minimize interference with the model’s original behavior.

5.3.2. Open-Vocabulary Evaluation

To further validate the robustness of our evaluation results under different detector settings, we also tested the generated images using OWL-ViT [20], an open-vocabulary detector. Unlike the closed-set ResNet-50 classifier, OWL-ViT matches text queries with specific local image regions. This results in a substantially stricter evaluation setting for multi-object semantic coexistence. Following the official OWL-ViT implementation, we adopt a confidence threshold of 0.1 for open-vocabulary object detection. Table 4 presents the evaluation results under both the closed-set classifier and the open-vocabulary detector. Compared with ResNet-50, all methods exhibit lower ASR t a r g e t , ASR d e s c r i b e d , and VSS scores under OWL-ViT evaluation. This is reasonable because open-vocabulary grounding imposes more challenging semantic grounding constraints than conventional closed-set classification. However, the relative performance trend remains consistent across detectors. Our method consistently achieves the highest ASR t a r g e t and VSS scores under the OWL-ViT setting. Specifically, the homoglyph-based variant achieves a VSS of 0.542, outperforming Rick and EvilEdit by 0.126 and 0.129, respectively. Similarly, the rare-token variant achieves a VSS of 0.534 while maintaining competitive target attack success rates.
These results demonstrate that the superiority of MultiAttack does not rely on a specific closed-set detector and remains effective under more rigorous open-vocabulary evaluation settings. We further observe that the performance degradation of our method is less severe than that of existing baselines when transitioning from closed-set classification to open-vocabulary grounding. This suggests that the proposed method may contribute to improved semantic coexistence robustness under stricter grounding and detection conditions while also reducing semantic inconsistency signals that may be exploited by automated vision-language detection models.

5.3.3. Integrity Evaluation

To assess the impact of the backdoor on normal concept generation in the poisoned T2I model, we compare its performance against that of a clean model on normal concept generation tasks. This comparison aims to determine whether the backdoor compromises the model’s functional integrity. For distinct object categories, we construct a corresponding poisoned model and generate images of various objects using both the clean and backdoored models, with all prompts being trigger-free. The results are summarized in Table 5.
For CLIP c , both the clean and poisoned models are provided with benign prompts “a photo of an object that is beautiful” (where {object} denotes backpack, bear, etc.) to generate 100 images, after which their CLIP c scores are computed. Taking the third column as an example, the prompt “a photo of a backpack that is beautiful” is input into both models. It is observed that the CLIP c values of images produced by the backdoored model closely track those from the clean model, indicating that the semantic alignment capability for normal prompts is preserved.
For FID, we use a fixed random seed s 0 to generate a set R of 100 images for each concept from the clean model as the reference distribution. To measure the intrinsic generation variance, the clean baseline FID is computed between R and a new clean image set produced under a different random seed s 1 . For the poisoned models, the FID is calculated between R and their generated images under the same fixed seed s 1 . Results show that the images generated by the poisoned model maintain distributional similarity to the clean baseline, confirming that the generation distribution is not significantly disturbed.
As shown in Table 5, the FID score between the backdoor model and the clean model differs by 0.62 when generating the concept “bear”. Furthermore, the average discrepancy in CLIP c scores between these models is merely 0.26. This suggests that the backdoor embedding does not significantly impair the generation of normal concepts related to the specified objects, and the model maintains its functional integrity.

5.3.4. Ablation Study

After embedding a backdoor into the T2I diffusion model, the text encoder becomes capable of mapping trigger-embedded prompts to latent representations incorporating the target object. However, as the diffusion process struggles to autonomously allocate distinct layouts for conflicting concepts, the success rate of direct backdoor attacks remains limited. To resolve this, we introduce the backdoor enhancement stage, which leverages the backdoor enhancement mechanism as described in Section 4.3.
To validate the necessity of the backdoor enhancement mechanism, we conduct a quantitative comparison of attack success rates before and after its integration. For the baseline backdoor model without this mechanism, we train a poisoned model using only the backdoor embedding phase and evaluate it on poisoned prompts containing eight different objects. In contrast, for the backdoor model with this mechanism, we perform object-specific backdoor enhancement, resulting in eight distinct poisoned models, and generate images using their corresponding poisoned prompts. The results are summarized in Table 6.
As shown in Table 6, the poisoned models with backdoor enhancement maintain a high generation success rate for the prompt-described objects, while achieving a 14.4% improvement in ASR target . Furthermore, the VSS shows a 14.5% increase. This result empirically validates the effectiveness of our mechanism: by utilizing the backdoor enhancement mechanism to embed spatial orthogonality as a conditional response, it successfully decouples the attention regions of the malicious target and the prompt-described object. This transforms the generation paradigm from resource competition to robust semantic-preserving multi-object coexistence, thereby simultaneously ensuring high attack success rates and visual stealth without extra inference intervention.

5.3.5. Impact of the Balancing Weights

To evaluate the impact of hyperparameters α and β , we employ a backdoor model that uses rare words as triggers. We input poisoned prompts into the backdoored model and generate corresponding images, and then we compute the CLIP p scores from these outputs to quantify the attack effectiveness. To evaluate the preservation of the model’s normal functionality, we input clean prompts into both the backdoored and the clean model and use the LPIPS metric to compare their generated images.
The results are presented in Table 7, where each cell includes the LPIPS value (top) and the CLIP p score (bottom). It can be observed that varying the hyperparameters has only a marginal influence on attack performance and the model’s normal functionality. This stable behavior can be attributed to the core design of MultiAttack: during the backdoor embedding phase, our semantic-preserving data poisoning strategy constructs a robust coexistent latent mapping, rather than a brittle trigger-target association. Furthermore, in the backdoor enhancement phase, the localized fine-tuning effectively embeds spatial orthogonality into the cross-attention weights as a conditional response. Consequently, the model is inherently capable of generating stable malicious outputs without extra inference intervention, rendering its performance highly insensitive to fluctuations in the balancing weights.

5.3.6. Scalability Analysis

In the backdoor enhancement process, the attack effectiveness is fundamentally driven by embedding spatial orthogonality between the attention masks of the malicious target ( P t ) and the prompt-described objects ( O t ). To investigate the scalability of our coexistence mechanism, we progressively increase the complexity of the benign prompts (from single to multiple objects) while keeping the backdoor target fixed. This test evaluates whether the Backdoor Enhancement mechanism can maintain the target’s presence in increasingly crowded scenes.
As shown in Figure 9a,b, for different poisoned models, the average VSS remains above 0.82 as the number of prompt-described objects increases. The results demonstrate that multiple prompt-described objects in MultiAttack can coexist in the backdoor of the T2I model with minimal interference. The minor fluctuations observed are attributed to the inherent heterogeneity in the model’s combinatorial generation capabilities for different object categories (see Table 6), rather than a failure of the backdoor mechanism itself.
In addition, Figure 9c illustrates the variation in normal functionality performance as the number of prompt-described objects increases across different poisoned models. At each step, we input the same clean prompt for image generation and compute both CLIP c and LPIPS to assess the model’s baseline performance. The results show that the CLIP c score remains consistently above 25.8, indicating that the model’s normal functionality is not significantly compromised throughout the enhancement process.

5.3.7. Robustness Evaluation

Fine-tuning is widely adopted in the text-to-image domain as a defense mechanism against backdoor attacks. In real-world scenarios, users may perform local fine-tuning on publicly available pre-trained T2I diffusion models using clean data to mitigate suspicious behaviors and eliminate potential backdoors. To evaluate the robustness of the proposed MultiAttack against such defensive measures, we conduct fine-tuning experiments on backdoored models using full-parameter fine-tuning and LoRA [52] with the clean MS-COCO 2014 validation set. We assess the attack performance at different fine-tuning steps by measuring the overall VSS and CLIP p scores.
As illustrated in Figure 10, the attack performance of MultiAttack exhibits varying degrees of degradation under the two fine-tuning strategies. Remarkably, our method maintains a VSS above 0.7 even after 10,000 training steps under LoRA-based fine-tuning. In contrast, full-parameter fine-tuning substantially degrades the performance of MultiAttack after 4000 steps. This divergence in robustness can be directly attributed to the nature of our backdoor enhancement mechanism. While full-parameter tuning introduces comprehensive weight updates that eventually overwrite the conditional response and spatial orthogonality embedded in the cross-attention layers, parameter-efficient methods like LoRA fail to disrupt the coexistent latent mapping established during our poisoning phase. Consequently, MultiAttack demonstrates significant resilience against standard local fine-tuning practices, successfully preserving its semantic-preserving multi-object coexistence under practical deployment conditions.

6. Conclusions

In this paper, we address the critical lack of visual stealth in existing text-to-image backdoor attacks caused by semantic replacement and propose MultiAttack, a novel attack that achieves semantic-preserving multi-object coexistence. Specifically, our approach first employs a semantic-preserving data poisoning strategy to construct a coexistent latent mapping, which successfully projects the trigger into a composite semantic space while maintaining the semantic integrity of the original prompt context. To further ensure stable multi-object generation, we design a backdoor enhancement mechanism that embeds spatial orthogonality directly into the model weights. By internalizing attention disentanglement as a conditional response, this mechanism effectively resolves attentional resource conflicts without requiring extra inference intervention. Experimental results demonstrate that MultiAttack successfully transforms the generation paradigm from competition to coexistence. Compared to state-of-the-art baselines, our method significantly increases the attack success rate by 13.1% and visual stealth by 12.6% while rigorously preserving high generation quality on benign inputs.
In future work, we plan to further evaluate the proposed method on newer diffusion architectures, such as SDXL and DiT-based models, to better understand its generalization ability across different T2I generation frameworks. We also aim to investigate more realistic prompt settings and complex scene compositions to improve the practicality of semantic-preserving backdoor attacks. In addition, future studies will explore possible defense and detection strategies against such attacks to improve the security of generative AI systems.

Author Contributions

Conceptualization, S.Z.; methodology, S.Z.; software, Z.Y.; validation, Y.P.; formal analysis, H.N.; investigation, Y.P.; resources, H.N.; writing—original draft preparation, Z.Y.; writing—review and editing, S.Z., Z.Y. and J.L.; supervision, J.L.; project administration, J.L.; funding acquisition, S.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Natural Science Foundation of China under Grant Number 62272162, the Hainan Provincial Natural Science Foundation of China under Grant number 126MS0235, and the project of Hunan Provincial Social Science Achievement Review Committee of China under Grant number XSP26YBZ029.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhao, J.; Zheng, H.; Wang, C.; Lan, L.; Huang, W.; Tang, Y. MagicNaming: Consistent Identity Generation by Finding a “Name Space” in T2I Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10439–10447. [Google Scholar]
  2. Kim, S.; Lee, J.; Hong, K.; Kim, D.; Ahn, N. DiffBlender: Composable and versatile multimodal text-to-image diffusion models. Expert Syst. Appl. 2026, 297, 129345. [Google Scholar] [CrossRef]
  3. Han, J.; Kwon, D.; Lee, G.; Kim, J.; Choi, J. Enhancing Creative Generation on Stable Diffusion-based Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28609–28618. [Google Scholar]
  4. Li, T.; Luo, W.; Chen, Z.; Ma, L.; Qi, G.J. Self-Guidance: Boosting Flow and Diffusion Generation on Their Own. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 781–791. [Google Scholar] [CrossRef] [PubMed]
  5. Ding, H.; Wang, S.; Yuan, X.; Huang, N.; Cui, X. DEEL: An imbalanced binary data classification method based on diffusion model data augmentation and multi-objective optimization ensemble. Inf. Process. Manag. 2026, 63, 104537. [Google Scholar] [CrossRef]
  6. Zhang, C.; Sun, S.; Tu, J.; Chen, X.; Wang, D. Clean-label backdoor attack via sample-customized feature alignment. Expert Syst. Appl. 2026, 297, 129481. [Google Scholar] [CrossRef]
  7. Zhang, S.; Chen, W.; Li, X.; Liu, Q.; Wang, G. APBAM: Adversarial perturbation-driven backdoor attack in multimodal learning. Inf. Sci. 2025, 700, 121847. [Google Scholar] [CrossRef]
  8. Gu, Z.; Shi, J.; Yang, Y. ANODYNE: Mitigating backdoor attacks in federated learning. Expert Syst. Appl. 2025, 259, 125359. [Google Scholar] [CrossRef]
  9. Song, Z.; Li, Y.; Yuan, D.; Liu, L.; Wei, S.; Wu, B. Wpda: Frequency-based backdoor attack with wavelet packet decomposition. Neural Netw. 2026, 194, 108074. [Google Scholar] [CrossRef]
  10. Shan, S.; Ding, W.; Passananti, J.; Wu, S.; Zheng, H.; Zhao, B.Y. Nightshade: Prompt-specific poisoning attacks on text-to-image generative models. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2024; pp. 807–825. [Google Scholar]
  11. Naseh, A.; Roh, J.; Bagdasaryan, E.; Houmansadr, A. Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors. arXiv 2024, arXiv:2406.15213. [Google Scholar]
  12. Jiang, W.; He, J.; Li, H.; Xu, G.; Zhang, R.; Chen, H.; Hao, M.; Yang, H. Combinational Backdoor Attack against Customized Text-to-Image Models. arXiv 2024, arXiv:2411.12389. [Google Scholar]
  13. Struppek, L.; Hintersdorf, D.; Kersting, K. Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4584–4596. [Google Scholar]
  14. Wei, T.; Pang, S.; Guo, Q.; Ma, Y.; Cao, X.; Cheng, M.M.; Guo, Q. EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation. arXiv 2025, arXiv:2406.15863. [Google Scholar]
  15. Zhai, S.; Dong, Y.; Shen, Q.; Pu, S.; Fang, Y.; Su, H. Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1577–1587. [Google Scholar]
  16. Vice, J.; Akhtar, N.; Hartley, R.; Mian, A. BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4865–4880. [Google Scholar] [CrossRef]
  17. Huang, Y.; Juefei-Xu, F.; Guo, Q.; Zhang, J.; Wu, Y.; Hu, M.; Li, T.; Pu, G.; Liu, Y. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 21169–21178. [Google Scholar]
  18. Wang, H.; Guo, S.; He, J.; Chen, K.; Zhang, S.; Zhang, T.; Xiang, T. EvilEdit: Backdooring Text-to-Image Diffusion Models in One Second. In Proceedings of the ACM International Conference on Multimedia (MM ’24), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 3657–3665. [Google Scholar]
  19. Orgad, H.; Kawar, B.; Belinkov, Y. Editing Implicit Assumptions in Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 7053–7061. [Google Scholar]
  20. Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple Open-Vocabulary Object Detection with Vision Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 728–755. [Google Scholar]
  21. Agarwal, A.; Karanam, S.; Joseph, K.J.; Saxena, A.; Goswami, K.; Srinivasan, B.V. A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2283–2293. [Google Scholar]
  22. Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; Cohen-Or, D. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Trans. Graph. 2023, 42, 4. [Google Scholar] [CrossRef]
  23. Bao, Z.; Li, Y.; Singh, K.K.; Wang, Y.-X.; Hebert, M. Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models. In Proceedings of the ACM SIGGRAPH 2024 Conference Papers, Denver, CO, USA, 28 July–1 August 2024; pp. 1–10. [Google Scholar]
  24. Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
  25. Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
  26. Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
  27. Heydari, M.; Souden, M.; Conejo, B.; Atkins, J. ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
  28. Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
  29. Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS Workshop on Deep Generative Models and Downstream Applications, Virtual, 6–14 December 2021. [Google Scholar]
  30. Nichol, A.Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; Mcgrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 16784–16804. [Google Scholar]
  31. Wang, Z.; Bao, J.; Gu, S.; Chen, D.; Zhou, W.; Li, H. DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 20906–20915. [Google Scholar]
  32. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–18763. [Google Scholar]
  33. Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
  34. Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Gontijo-Lopes, R.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
  35. Ma, W.-D.K.; Lahiri, A.; Lewis, J.; Leung, T.; Kleijn, W.B. Directed diffusion: Direct control of object placement through attention guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4098–4106. [Google Scholar]
  36. Kim, Y.; Lee, J.; Kim, J.-H.; Ha, J.-W.; Zhu, J.-Y. Dense Text-to-Image Generation with Attention Modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 7701–7711. [Google Scholar]
  37. Lu, S.; Liu, Y.; Kong, A.W.-K. TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2294–2305. [Google Scholar]
  38. Rassin, R.; Hirsch, E.; Glickman, D.; Ravfogel, S.; Goldberg, Y.; Chechik, G. Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 3536–3559. [Google Scholar]
  39. Zhang, S.; Zhang, L.; Peng, T.; Liu, Q.; Li, X. VADP: Visitor-attribute-based adaptive differential privacy for IoMT data sharing. Comput. Secur. 2025, 104513. [Google Scholar] [CrossRef]
  40. Zhang, S.; Liu, Q.; Wang, T.; Liang, W.; Li, K.-C.; Wang, G. FSAIR: Fine-grained secure approximate image retrieval for mobile cloud computing. IEEE Internet Things J. 2024, 11, 23297–23308. [Google Scholar] [CrossRef]
  41. Zhang, S.; Wang, Q.; Liu, Q.; Luo, E.; Peng, T. VulTrLM: LLM-assisted vulnerability detection via AST decomposition and comment enhancement. Empir. Softw. Eng. 2026, 31, 10. [Google Scholar] [CrossRef]
  42. Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access 2019, 7, 47230–47244. [Google Scholar] [CrossRef]
  43. Wang, H.; Wang, S.; Wang, L.; Wang, R. FLAB: Exploring anomaly bias in backdoor attacks. Expert Syst. Appl. 2026, 300, 130415. [Google Scholar] [CrossRef]
  44. Zhang, S.; Yin, Y.; Liang, W.; Wu, F.; Meng, W. FedCode: Addressing Federated Domain Shift by Contrastive Feature Decoupling. Neurocomputing 2026, 678, 133151. [Google Scholar] [CrossRef]
  45. Zhang, S.; Pan, Y.; Liu, Q.; Yan, Z.; Choo, K.-K.R.; Wang, G. Backdoor attacks and defenses targeting multi-domain AI models: A comprehensive review. ACM Comput. Surv. 2024, 57, 1–35. [Google Scholar] [CrossRef]
  46. Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  47. Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar]
  48. Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting Twenty-Thousand Classes Using Image-Level Supervision. In Proceedings of the Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 350–368. [Google Scholar]
  49. Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 6629–6640. [Google Scholar]
  50. Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7514–7528. [Google Scholar]
  51. Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
  52. Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]
Figure 1. Attack behavior of backdoor attacks on text-to-image models.
Figure 1. Attack behavior of backdoor attacks on text-to-image models.
Mathematics 14 01874 g001
Figure 2. The text-guided image generation process of Stable Diffusion.
Figure 2. The text-guided image generation process of Stable Diffusion.
Mathematics 14 01874 g002
Figure 3. An overview of MultiAttack.
Figure 3. An overview of MultiAttack.
Mathematics 14 01874 g003
Figure 4. Comparison of images generated before and after backdoor enhancement stage.
Figure 4. Comparison of images generated before and after backdoor enhancement stage.
Mathematics 14 01874 g004
Figure 5. Visualization of the generation process of the poisoned model and clean model.
Figure 5. Visualization of the generation process of the poisoned model and clean model.
Mathematics 14 01874 g005
Figure 6. Comparison between clean models and backdoor models on clean inputs.
Figure 6. Comparison between clean models and backdoor models on clean inputs.
Mathematics 14 01874 g006
Figure 7. Comparison between clean models and backdoor models on poisoned inputs. Note: The red characters in the poisoned prompt examples denote the textual triggers.
Figure 7. Comparison between clean models and backdoor models on poisoned inputs. Note: The red characters in the poisoned prompt examples denote the textual triggers.
Mathematics 14 01874 g007
Figure 8. Visual comparison of image generation results under different target object settings [13,18].
Figure 8. Visual comparison of image generation results under different target object settings [13,18].
Mathematics 14 01874 g008
Figure 9. Impact of backdoor enhancement for multiple objects.
Figure 9. Impact of backdoor enhancement for multiple objects.
Mathematics 14 01874 g009
Figure 10. Evaluation of (a) visual stealth score (VSS) and (b) attack effectiveness (CLIPp). The up arrow ↑ indicates that higher values represent better performance.
Figure 10. Evaluation of (a) visual stealth score (VSS) and (b) attack effectiveness (CLIPp). The up arrow ↑ indicates that higher values represent better performance.
Mathematics 14 01874 g010
Table 1. Summary of textual triggers and attack behavior in T2I backdoor attacks.
Table 1. Summary of textual triggers and attack behavior in T2I backdoor attacks.
MethodTrigger TypePoisoned Prompt ExampleAttack Behavior
CBACT2I [12]HomoglyphsA teddy bear sitting on top of a small white toilet.Specific image
Rick [13]HomoglyphsA boat on a lake, oil painting.Semantic override
BadT2I [15]Specific token[T] A dog sits in an opened, overturned umbrella.Object substitution
Nightshade [10]Specific tokenA dog on the grass.Object substitution
BAGM [16]Specific tokenA picture of a McDonald’s burger on the table.Object substitution
Personalization [17]Combination tokenA photo of a beautiful car on a road.Object substitution
EmoAttack [14]Emotion tokenA sorrowful dog that is beautiful on the grass.Semantic override
EvilEdit [18]Rare tokenA photo of a cf cat.Object substitution
MultiAttack (Ours)Rare tokenA dog cf that is beautiful on the grass.Semantic coexistence
HomoglyphsA dog that is beautiful on the grass.Semantic coexistence
  Note: The red characters in the poisoned prompt examples denote the textual triggers.
Table 2. Comparison of attack performance for different backdoor attacks.
Table 2. Comparison of attack performance for different backdoor attacks.
MethodAttack PerformanceNormal-Functionality
ASR target ASR described VSS ↑ CLIP p FID ↓ CLIP c LPIPS ↓
Clean 17.4626.1
Rick [13]0.7380.9610.70629.1219.1525.550.207
EvilEdit [18]0.7350.9550.70029.0217.8725.890.211
Ours (homoglyphs)0.8660.9460.82630.1418.5925.470.214
Ours (rare)0.8580.9490.81930.0218.5025.770.189
Note: Up arrows (↑) indicate that higher values are better; Down arrows (↓) indicate that lower values are better. Bold values indicate the best performance among all methods.
Table 3. Evaluation of attack effectiveness on distinct target objects.
Table 3. Evaluation of attack effectiveness on distinct target objects.
Target ObjectMethod ASR target ASR described VSS ↑
KnifeRick [13]0.6940.9110.609
EvilEdit [18]0.7030.9090.623
Ours (homoglyphs)0.8010.9050.721
Ours (rare)0.7980.9060.718
CarRick [13]0.6710.9080.591
EvilEdit [18]0.7450.8410.607
Ours (homoglyphs)0.7790.9250.708
Ours (rare)0.7710.9340.705
Note: Up arrows (↑) indicate that higher values are better. Bold values indicate the best performance among all methods.
Table 4. Attack performance under the closed-set classifier (ResNet-50 [48]) and the open-vocabulary detector (OWL-ViT [20]).
Table 4. Attack performance under the closed-set classifier (ResNet-50 [48]) and the open-vocabulary detector (OWL-ViT [20]).
MethodDetector ASR target ASR described VSS ↑
Rick [13]ResNet-50 [48]0.7380.9610.706
OWL-ViT [20]0.5630.7730.416
EvilEdit [18]ResNet-50 [48]0.7350.9550.700
OWL-ViT [20]0.5600.7660.413
Ours (homoglyphs)ResNet-50 [48]0.8660.9460.826
OWL-ViT [20]0.7560.7210.542
Ours (rare)ResNet-50 [48]0.8580.9490.819
OWL-ViT [20]0.7480.7230.534
Note: Up arrows (↑) indicate that higher values are better. Bold values indicate the best performance among all methods.
Table 5. Evaluation on normal concepts generation of poisoned model.
Table 5. Evaluation on normal concepts generation of poisoned model.
MetricModelBackpackBearBookCatDogHedgehogMonkeyPandaAVG
FIDClean80.8215.11113.8543.9564.5017.2419.8916.0346.42
Ours (rare)79.6915.73122.4853.3673.7819.2023.8316.8150.61
Ours (homoglyphs)83.5417.83128.0858.1376.8219.4924.1817.8553.24
CLIP c Clean25.6024.2921.7924.4222.6127.6024.6725.2524.53
Ours (rare)25.2524.3722.0524.4022.6127.4824.4924.9624.45
Ours (homoglyphs)25.4624.4221.7524.0321.9227.4824.2324.8424.27
Note: Up arrows (↑) indicate that higher values are better; Down arrows (↓) indicate that lower values are better.
Table 6. Ablation experiments on the backdoor enhancement for the T2I diffusion model.
Table 6. Ablation experiments on the backdoor enhancement for the T2I diffusion model.
MethodEnhancementMetricBackpackBearBookCatDogHedgehogMonkeyPandaAVG
Ours (rare)w/o ASR t a r g e t 0.770.720.780.750.700.520.750.720.714
ASR d e s c r i b e d 0.850.890.920.980.980.981.000.900.938
VSS0.650.640.710.740.700.510.750.680.673
w ASR t a r g e t 0.920.880.920.860.880.780.800.820.858
ASR d e s c r i b e d 0.880.900.911.000.990.991.000.920.949
VSS0.820.810.840.860.870.780.800.770.819
Ours (homoglyphs)w/o ASR t a r g e t 0.750.730.800.770.710.520.760.750.724
ASR d e s c r i b e d 0.870.900.880.970.970.990.990.910.935
VSS0.660.690.700.750.710.510.750.680.681
w ASR t a r g e t 0.940.900.940.870.900.760.820.800.866
ASR d e s c r i b e d 0.890.930.860.990.991.000.990.920.946
VSS0.840.850.840.860.890.760.810.760.826
Note: Bold values indicate the best performance within the corresponding column.
Table 7. Backdoor enhancement hyperparameter analysis experiment for the T2I diffusion model.
Table 7. Backdoor enhancement hyperparameter analysis experiment for the T2I diffusion model.
  β 0.10.511.52
α  
0.1LPIPS: 0.2119
CLIPp: 30.363
LPIPS: 0.2131
CLIPp: 30.528
LPIPS: 0.2132
CLIPp: 30.409
LPIPS: 0.2128
CLIPp: 30.260
LPIPS: 0.2115
CLIPp: 30.437
0.5LPIPS: 0.1811
CLIPp: 31.072
LPIPS: 0.1812
CLIPp: 30.984
LPIPS: 0.1813
CLIPp: 30.962
LPIPS: 0.1801
CLIPp: 31.015
LPIPS: 0.1807
CLIPp: 30.929
1LPIPS: 0.1847
CLIPp: 30.918
LPIPS: 0.1850
CLIPp: 30.947
LPIPS: 0.1855
CLIPp: 30.972
LPIPS: 0.1860
CLIPp: 31.002
LPIPS: 0.1859
CLIPp: 31.091
1.5LPIPS: 0.1851
CLIPp: 31.197
LPIPS: 0.1855
CLIPp: 31.281
LPIPS: 0.1851
CLIPp: 31.223
LPIPS: 0.1843
CLIPp: 31.252
LPIPS: 0.1855
CLIPp: 31.145
2LPIPS: 0.1894
CLIPp: 31.127
LPIPS: 0.1896
CLIPp: 31.168
LPIPS: 0.1898
CLIPp: 31.225
LPIPS: 0.1904
CLIPp: 31.247
LPIPS: 0.1898
CLIPp: 31.280
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, Z.; Ning, H.; Pan, Y.; Liao, J.; Zhang, S. Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics 2026, 14, 1874. https://doi.org/10.3390/math14111874

AMA Style

Yang Z, Ning H, Pan Y, Liao J, Zhang S. Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics. 2026; 14(11):1874. https://doi.org/10.3390/math14111874

Chicago/Turabian Style

Yang, Zhoufan, Honghui Ning, Yimeng Pan, Junguo Liao, and Shaobo Zhang. 2026. "Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models" Mathematics 14, no. 11: 1874. https://doi.org/10.3390/math14111874

APA Style

Yang, Z., Ning, H., Pan, Y., Liao, J., & Zhang, S. (2026). Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics, 14(11), 1874. https://doi.org/10.3390/math14111874

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop