Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models
Abstract
1. Introduction
- (1)
- We propose MultiAttack, a novel backdoor attack with semantic-preserving multi-object coexistence for text-to-image diffusion models. By employing a semantic-preserving data poisoning strategy, our approach embeds hidden backdoors into the text encoder while maintaining the semantic integrity of the original prompt context.
- (2)
- We construct a backdoor enhancement mechanism based on spatial orthogonality constraints. By internalizing attention disentanglement directly into the model weights as a conditional response, this mechanism ensures the stable and coherent synthesis of multi-object backdoor outputs without requiring extra inference intervention, thereby significantly enhancing the visual stealth of the attack.
- (3)
- We demonstrate the effectiveness of MultiAttack through comprehensive experiments. Our evaluation on Stable Diffusion demonstrates that MultiAttack outperforms state-of-the-art baselines by increasing the attack success rate by 13.1% and visual stealth by 12.6%. Moreover, on benign inputs, the backdoored model maintained performance comparable to the clean model.
2. Related Work
2.1. Diffusion Model
2.2. Compositional Text-to-Image Synthesis
2.3. Backdoor Attacks on T2I Diffusion Models
3. Preliminaries
3.1. Text-to-Image Diffusion Models
- (1)
- Text encoding. Use a pretrained text encoder to convert the input text prompt y into text embedding . First, the tokenizer module within the text encoder processes the input prompt y, converting its words and sub-words into a sequence of tokens. Then, these tokens are mapped into text embeddings in the latent space. These embeddings capture the semantic information of the input and serve as conditions to guide the subsequent denoising diffusion process.
- (2)
- Conditional denoising. Conditional denoising is a key step in the image generation process of diffusion models, guided by the text embeddings c. The conditional diffusion module iteratively reverses a forward diffusion process by predicting and eliminating noise. Its training objective is defined as:where is the latent representation of image x after passing through the image encoder , while represents the noisy latent representation of z at time t. The module is trained to correctly remove the noise added to z.
- (3)
- Image generation. Through iterative conditional denoising steps, the model arrives at the final latent representation . This representation is then converted into the corresponding image using a decoder .
3.2. Threat Model
- (1)
- The model must maintain its original performance on benign inputs after the backdoor is implanted. Images generated from clean prompts should not exhibit a perceptible degradation in quality.
- (2)
- The backdoor should be successfully activated when the input contains the trigger, without ignoring the semantic content of the objects described in the prompt.
- (3)
- When activated, the backdoor should generate the target object into the image, rather than generating a semantically inconsistent image that replaces or ignores the original content.
4. Methodology
4.1. Overview of MultiAttack
4.2. Backdoor Embedding
| Algorithm 1 MultiAttack backdoor embedding via semantic-preserving data poisoning |
| Input: The normal text encoder , the dataset Y, the weights of the backdoor loss function , epoch N. Output: The poisoned text encoder 1: Initialize the backdoor text encoder: ; 2: Freeze normal text encoder ; 3: Initialize the training epoch: ; 4: while do 5: for i in do 6: Create the poisoned text by using the original text with triggers t; 7: Create the target text by adding the target object p to the original text ; 8: ; 9: Update by overall training loss ; 10: end for 11: ; 12: end while 13: return |
4.3. Backdoor Enhancement
| Algorithm 2 MultiAttack backdoor enhancement based on spatial orthogonality |
| Input: The diffusion model , the target prompt embedding , the weight of the backdoor loss function, epoch Output: The poisoned diffusion model 1: Initialize the backdoor diffusion model: ; 2: Freeze other parameters of the cross-attention layer in ; 3: Initialize the training epoch: ; 4: while do 5: Extract the total attention map of the target text ; 6: Extract the attention map of each generated object from A; 7: ; % Enforce spatial orthogonality. 8: Update by overall training loss ; 9: ; 10: end while 11: return |
5. Experiments
5.1. Experimental Settings
5.1.1. Attack Settings
5.1.2. Evaluation Metrics
5.2. Visualization Results
5.3. Evaluation Results
5.3.1. Effectiveness Evaluation
5.3.2. Open-Vocabulary Evaluation
5.3.3. Integrity Evaluation
5.3.4. Ablation Study
5.3.5. Impact of the Balancing Weights
5.3.6. Scalability Analysis
5.3.7. Robustness Evaluation
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Zhao, J.; Zheng, H.; Wang, C.; Lan, L.; Huang, W.; Tang, Y. MagicNaming: Consistent Identity Generation by Finding a “Name Space” in T2I Diffusion Models. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 10439–10447. [Google Scholar]
- Kim, S.; Lee, J.; Hong, K.; Kim, D.; Ahn, N. DiffBlender: Composable and versatile multimodal text-to-image diffusion models. Expert Syst. Appl. 2026, 297, 129345. [Google Scholar] [CrossRef]
- Han, J.; Kwon, D.; Lee, G.; Kim, J.; Choi, J. Enhancing Creative Generation on Stable Diffusion-based Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 28609–28618. [Google Scholar]
- Li, T.; Luo, W.; Chen, Z.; Ma, L.; Qi, G.J. Self-Guidance: Boosting Flow and Diffusion Generation on Their Own. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 781–791. [Google Scholar] [CrossRef] [PubMed]
- Ding, H.; Wang, S.; Yuan, X.; Huang, N.; Cui, X. DEEL: An imbalanced binary data classification method based on diffusion model data augmentation and multi-objective optimization ensemble. Inf. Process. Manag. 2026, 63, 104537. [Google Scholar] [CrossRef]
- Zhang, C.; Sun, S.; Tu, J.; Chen, X.; Wang, D. Clean-label backdoor attack via sample-customized feature alignment. Expert Syst. Appl. 2026, 297, 129481. [Google Scholar] [CrossRef]
- Zhang, S.; Chen, W.; Li, X.; Liu, Q.; Wang, G. APBAM: Adversarial perturbation-driven backdoor attack in multimodal learning. Inf. Sci. 2025, 700, 121847. [Google Scholar] [CrossRef]
- Gu, Z.; Shi, J.; Yang, Y. ANODYNE: Mitigating backdoor attacks in federated learning. Expert Syst. Appl. 2025, 259, 125359. [Google Scholar] [CrossRef]
- Song, Z.; Li, Y.; Yuan, D.; Liu, L.; Wei, S.; Wu, B. Wpda: Frequency-based backdoor attack with wavelet packet decomposition. Neural Netw. 2026, 194, 108074. [Google Scholar] [CrossRef]
- Shan, S.; Ding, W.; Passananti, J.; Wu, S.; Zheng, H.; Zhao, B.Y. Nightshade: Prompt-specific poisoning attacks on text-to-image generative models. In Proceedings of the IEEE Symposium on Security and Privacy, San Francisco, CA, USA, 20–23 May 2024; pp. 807–825. [Google Scholar]
- Naseh, A.; Roh, J.; Bagdasaryan, E.; Houmansadr, A. Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors. arXiv 2024, arXiv:2406.15213. [Google Scholar]
- Jiang, W.; He, J.; Li, H.; Xu, G.; Zhang, R.; Chen, H.; Hao, M.; Yang, H. Combinational Backdoor Attack against Customized Text-to-Image Models. arXiv 2024, arXiv:2411.12389. [Google Scholar]
- Struppek, L.; Hintersdorf, D.; Kersting, K. Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 4584–4596. [Google Scholar]
- Wei, T.; Pang, S.; Guo, Q.; Ma, Y.; Cao, X.; Cheng, M.M.; Guo, Q. EmoAttack: Emotion-to-Image Diffusion Models for Emotional Backdoor Generation. arXiv 2025, arXiv:2406.15863. [Google Scholar]
- Zhai, S.; Dong, Y.; Shen, Q.; Pu, S.; Fang, Y.; Su, H. Text-to-Image Diffusion Models can be Easily Backdoored through Multimodal Data Poisoning. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23), Ottawa, ON, Canada, 29 October–3 November 2023; pp. 1577–1587. [Google Scholar]
- Vice, J.; Akhtar, N.; Hartley, R.; Mian, A. BAGM: A Backdoor Attack for Manipulating Text-to-Image Generative Models. IEEE Trans. Inf. Forensics Secur. 2024, 19, 4865–4880. [Google Scholar] [CrossRef]
- Huang, Y.; Juefei-Xu, F.; Guo, Q.; Zhang, J.; Wu, Y.; Hu, M.; Li, T.; Pu, G.; Liu, Y. Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 21169–21178. [Google Scholar]
- Wang, H.; Guo, S.; He, J.; Chen, K.; Zhang, S.; Zhang, T.; Xiang, T. EvilEdit: Backdooring Text-to-Image Diffusion Models in One Second. In Proceedings of the ACM International Conference on Multimedia (MM ’24), Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 3657–3665. [Google Scholar]
- Orgad, H.; Kawar, B.; Belinkov, Y. Editing Implicit Assumptions in Text-to-Image Diffusion Models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 7053–7061. [Google Scholar]
- Minderer, M.; Gritsenko, A.; Stone, A.; Neumann, M.; Weissenborn, D.; Dosovitskiy, A.; Mahendran, A.; Arnab, A.; Dehghani, M.; Shen, Z.; et al. Simple Open-Vocabulary Object Detection with Vision Transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 728–755. [Google Scholar]
- Agarwal, A.; Karanam, S.; Joseph, K.J.; Saxena, A.; Goswami, K.; Srinivasan, B.V. A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2283–2293. [Google Scholar]
- Chefer, H.; Alaluf, Y.; Vinker, Y.; Wolf, L.; Cohen-Or, D. Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. ACM Trans. Graph. 2023, 42, 4. [Google Scholar] [CrossRef]
- Bao, Z.; Li, Y.; Singh, K.K.; Wang, Y.-X.; Hebert, M. Separate-and-Enhance: Compositional Finetuning for Text-to-Image Diffusion Models. In Proceedings of the ACM SIGGRAPH 2024 Conference Papers, Denver, CO, USA, 28 July–1 August 2024; pp. 1–10. [Google Scholar]
- Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis With Latent Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 10684–10695. [Google Scholar]
- Song, J.; Meng, C.; Ermon, S. Denoising Diffusion Implicit Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
- Ho, J.; Jain, A.; Abbeel, P. Denoising Diffusion Probabilistic Models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
- Heydari, M.; Souden, M.; Conejo, B.; Atkins, J. ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar]
- Dhariwal, P.; Nichol, A.Q. Diffusion Models Beat GANs on Image Synthesis. Adv. Neural Inf. Process. Syst. 2021, 34, 8780–8794. [Google Scholar]
- Ho, J.; Salimans, T. Classifier-Free Diffusion Guidance. In Proceedings of the NeurIPS Workshop on Deep Generative Models and Downstream Applications, Virtual, 6–14 December 2021. [Google Scholar]
- Nichol, A.Q.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; Mcgrew, B.; Sutskever, I.; Chen, M. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 16784–16804. [Google Scholar]
- Wang, Z.; Bao, J.; Gu, S.; Chen, D.; Zhou, W.; Li, H. DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 20906–20915. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; pp. 8748–18763. [Google Scholar]
- Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
- Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.; Ghasemipour, S.K.S.; Gontijo-Lopes, R.; Ayan, B.K.; Salimans, T.; et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
- Ma, W.-D.K.; Lahiri, A.; Lewis, J.; Leung, T.; Kleijn, W.B. Directed diffusion: Direct control of object placement through attention guidance. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 4098–4106. [Google Scholar]
- Kim, Y.; Lee, J.; Kim, J.-H.; Ha, J.-W.; Zhu, J.-Y. Dense Text-to-Image Generation with Attention Modulation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 7701–7711. [Google Scholar]
- Lu, S.; Liu, Y.; Kong, A.W.-K. TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 2294–2305. [Google Scholar]
- Rassin, R.; Hirsch, E.; Glickman, D.; Ravfogel, S.; Goldberg, Y.; Chechik, G. Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment. Adv. Neural Inf. Process. Syst. 2023, 36, 3536–3559. [Google Scholar]
- Zhang, S.; Zhang, L.; Peng, T.; Liu, Q.; Li, X. VADP: Visitor-attribute-based adaptive differential privacy for IoMT data sharing. Comput. Secur. 2025, 104513. [Google Scholar] [CrossRef]
- Zhang, S.; Liu, Q.; Wang, T.; Liang, W.; Li, K.-C.; Wang, G. FSAIR: Fine-grained secure approximate image retrieval for mobile cloud computing. IEEE Internet Things J. 2024, 11, 23297–23308. [Google Scholar] [CrossRef]
- Zhang, S.; Wang, Q.; Liu, Q.; Luo, E.; Peng, T. VulTrLM: LLM-assisted vulnerability detection via AST decomposition and comment enhancement. Empir. Softw. Eng. 2026, 31, 10. [Google Scholar] [CrossRef]
- Gu, T.; Liu, K.; Dolan-Gavitt, B.; Garg, S. BadNets: Evaluating Backdooring Attacks on Deep Neural Networks. IEEE Access 2019, 7, 47230–47244. [Google Scholar] [CrossRef]
- Wang, H.; Wang, S.; Wang, L.; Wang, R. FLAB: Exploring anomaly bias in backdoor attacks. Expert Syst. Appl. 2026, 300, 130415. [Google Scholar] [CrossRef]
- Zhang, S.; Yin, Y.; Liang, W.; Wu, F.; Meng, W. FedCode: Addressing Federated Domain Shift by Contrastive Feature Decoupling. Neurocomputing 2026, 678, 133151. [Google Scholar] [CrossRef]
- Zhang, S.; Pan, Y.; Liu, Q.; Yan, Z.; Choo, K.-K.R.; Wang, G. Backdoor attacks and defenses targeting multi-domain AI models: A comprehensive review. ACM Comput. Surv. 2024, 57, 1–35. [Google Scholar] [CrossRef]
- Gal, R.; Alaluf, Y.; Atzmon, Y.; Patashnik, O.; Bermano, A.H.; Chechik, G.; Cohen-or, D. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Ruiz, N.; Li, Y.; Jampani, V.; Pritch, Y.; Rubinstein, M.; Aberman, K. DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 22500–22510. [Google Scholar]
- Zhou, X.; Girdhar, R.; Joulin, A.; Krähenbühl, P.; Misra, I. Detecting Twenty-Thousand Classes Using Image-Level Supervision. In Proceedings of the Computer Vision (ECCV 2022), Tel Aviv, Israel, 23–27 October 2022; pp. 350–368. [Google Scholar]
- Heusel, M.; Ramsauer, H.; Unterthiner, T.; Nessler, B.; Hochreiter, S. GANs trained by a two time-scale update rule converge to a local nash equilibrium. Adv. Neural Inf. Process. Syst. 2017, 6629–6640. [Google Scholar]
- Hessel, J.; Holtzman, A.; Forbes, M.; Bras, R.L.; Choi, Y. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Punta Cana, Dominican Republic, 7–11 November 2021; pp. 7514–7528. [Google Scholar]
- Zhang, R.; Isola, P.; Efros, A.A.; Shechtman, E.; Wang, O. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 586–595. [Google Scholar]
- Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. LoRA: Low-Rank Adaptation of Large Language Models. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]










| Method | Trigger Type | Poisoned Prompt Example | Attack Behavior |
|---|---|---|---|
| CBACT2I [12] | Homoglyphs | A teddy bear sitting on top of a small white toilet. | Specific image |
| Rick [13] | Homoglyphs | A boat on a lake, oil painting. | Semantic override |
| BadT2I [15] | Specific token | [T] A dog sits in an opened, overturned umbrella. | Object substitution |
| Nightshade [10] | Specific token | A dog on the grass. | Object substitution |
| BAGM [16] | Specific token | A picture of a McDonald’s burger on the table. | Object substitution |
| Personalization [17] | Combination token | A photo of a beautiful car on a road. | Object substitution |
| EmoAttack [14] | Emotion token | A sorrowful dog that is beautiful on the grass. | Semantic override |
| EvilEdit [18] | Rare token | A photo of a cf cat. | Object substitution |
| MultiAttack (Ours) | Rare token | A dog cf that is beautiful on the grass. | Semantic coexistence |
| Homoglyphs | A dog that is beautiful on the grass. | Semantic coexistence |
| Method | Attack Performance | Normal-Functionality | |||||
|---|---|---|---|---|---|---|---|
| ↑ | ↑ | VSS ↑ | ↑ | FID ↓ | ↑ | LPIPS ↓ | |
| Clean | 17.46 | 26.1 | |||||
| Rick [13] | 0.738 | 0.961 | 0.706 | 29.12 | 19.15 | 25.55 | 0.207 |
| EvilEdit [18] | 0.735 | 0.955 | 0.700 | 29.02 | 17.87 | 25.89 | 0.211 |
| Ours (homoglyphs) | 0.866 | 0.946 | 0.826 | 30.14 | 18.59 | 25.47 | 0.214 |
| Ours (rare) | 0.858 | 0.949 | 0.819 | 30.02 | 18.50 | 25.77 | 0.189 |
| Target Object | Method | ↑ | ↑ | VSS ↑ |
|---|---|---|---|---|
| Knife | Rick [13] | 0.694 | 0.911 | 0.609 |
| EvilEdit [18] | 0.703 | 0.909 | 0.623 | |
| Ours (homoglyphs) | 0.801 | 0.905 | 0.721 | |
| Ours (rare) | 0.798 | 0.906 | 0.718 | |
| Car | Rick [13] | 0.671 | 0.908 | 0.591 |
| EvilEdit [18] | 0.745 | 0.841 | 0.607 | |
| Ours (homoglyphs) | 0.779 | 0.925 | 0.708 | |
| Ours (rare) | 0.771 | 0.934 | 0.705 |
| Method | Detector | ↑ | ↑ | VSS ↑ |
|---|---|---|---|---|
| Rick [13] | ResNet-50 [48] | 0.738 | 0.961 | 0.706 |
| OWL-ViT [20] | 0.563 | 0.773 | 0.416 | |
| EvilEdit [18] | ResNet-50 [48] | 0.735 | 0.955 | 0.700 |
| OWL-ViT [20] | 0.560 | 0.766 | 0.413 | |
| Ours (homoglyphs) | ResNet-50 [48] | 0.866 | 0.946 | 0.826 |
| OWL-ViT [20] | 0.756 | 0.721 | 0.542 | |
| Ours (rare) | ResNet-50 [48] | 0.858 | 0.949 | 0.819 |
| OWL-ViT [20] | 0.748 | 0.723 | 0.534 |
| Metric | Model | Backpack | Bear | Book | Cat | Dog | Hedgehog | Monkey | Panda | AVG |
|---|---|---|---|---|---|---|---|---|---|---|
| FID ↓ | Clean | 80.82 | 15.11 | 113.85 | 43.95 | 64.50 | 17.24 | 19.89 | 16.03 | 46.42 |
| Ours (rare) | 79.69 | 15.73 | 122.48 | 53.36 | 73.78 | 19.20 | 23.83 | 16.81 | 50.61 | |
| Ours (homoglyphs) | 83.54 | 17.83 | 128.08 | 58.13 | 76.82 | 19.49 | 24.18 | 17.85 | 53.24 | |
| Clean | 25.60 | 24.29 | 21.79 | 24.42 | 22.61 | 27.60 | 24.67 | 25.25 | 24.53 | |
| Ours (rare) | 25.25 | 24.37 | 22.05 | 24.40 | 22.61 | 27.48 | 24.49 | 24.96 | 24.45 | |
| Ours (homoglyphs) | 25.46 | 24.42 | 21.75 | 24.03 | 21.92 | 27.48 | 24.23 | 24.84 | 24.27 |
| Method | Enhancement | Metric | Backpack | Bear | Book | Cat | Dog | Hedgehog | Monkey | Panda | AVG |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ours (rare) | w/o | 0.77 | 0.72 | 0.78 | 0.75 | 0.70 | 0.52 | 0.75 | 0.72 | 0.714 | |
| 0.85 | 0.89 | 0.92 | 0.98 | 0.98 | 0.98 | 1.00 | 0.90 | 0.938 | |||
| VSS | 0.65 | 0.64 | 0.71 | 0.74 | 0.70 | 0.51 | 0.75 | 0.68 | 0.673 | ||
| w | 0.92 | 0.88 | 0.92 | 0.86 | 0.88 | 0.78 | 0.80 | 0.82 | 0.858 | ||
| 0.88 | 0.90 | 0.91 | 1.00 | 0.99 | 0.99 | 1.00 | 0.92 | 0.949 | |||
| VSS | 0.82 | 0.81 | 0.84 | 0.86 | 0.87 | 0.78 | 0.80 | 0.77 | 0.819 | ||
| Ours (homoglyphs) | w/o | 0.75 | 0.73 | 0.80 | 0.77 | 0.71 | 0.52 | 0.76 | 0.75 | 0.724 | |
| 0.87 | 0.90 | 0.88 | 0.97 | 0.97 | 0.99 | 0.99 | 0.91 | 0.935 | |||
| VSS | 0.66 | 0.69 | 0.70 | 0.75 | 0.71 | 0.51 | 0.75 | 0.68 | 0.681 | ||
| w | 0.94 | 0.90 | 0.94 | 0.87 | 0.90 | 0.76 | 0.82 | 0.80 | 0.866 | ||
| 0.89 | 0.93 | 0.86 | 0.99 | 0.99 | 1.00 | 0.99 | 0.92 | 0.946 | |||
| VSS | 0.84 | 0.85 | 0.84 | 0.86 | 0.89 | 0.76 | 0.81 | 0.76 | 0.826 |
| 0.1 | 0.5 | 1 | 1.5 | 2 | ||
| 0.1 | LPIPS: 0.2119 CLIPp: 30.363 | LPIPS: 0.2131 CLIPp: 30.528 | LPIPS: 0.2132 CLIPp: 30.409 | LPIPS: 0.2128 CLIPp: 30.260 | LPIPS: 0.2115 CLIPp: 30.437 | |
| 0.5 | LPIPS: 0.1811 CLIPp: 31.072 | LPIPS: 0.1812 CLIPp: 30.984 | LPIPS: 0.1813 CLIPp: 30.962 | LPIPS: 0.1801 CLIPp: 31.015 | LPIPS: 0.1807 CLIPp: 30.929 | |
| 1 | LPIPS: 0.1847 CLIPp: 30.918 | LPIPS: 0.1850 CLIPp: 30.947 | LPIPS: 0.1855 CLIPp: 30.972 | LPIPS: 0.1860 CLIPp: 31.002 | LPIPS: 0.1859 CLIPp: 31.091 | |
| 1.5 | LPIPS: 0.1851 CLIPp: 31.197 | LPIPS: 0.1855 CLIPp: 31.281 | LPIPS: 0.1851 CLIPp: 31.223 | LPIPS: 0.1843 CLIPp: 31.252 | LPIPS: 0.1855 CLIPp: 31.145 | |
| 2 | LPIPS: 0.1894 CLIPp: 31.127 | LPIPS: 0.1896 CLIPp: 31.168 | LPIPS: 0.1898 CLIPp: 31.225 | LPIPS: 0.1904 CLIPp: 31.247 | LPIPS: 0.1898 CLIPp: 31.280 | |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Yang, Z.; Ning, H.; Pan, Y.; Liao, J.; Zhang, S. Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics 2026, 14, 1874. https://doi.org/10.3390/math14111874
Yang Z, Ning H, Pan Y, Liao J, Zhang S. Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics. 2026; 14(11):1874. https://doi.org/10.3390/math14111874
Chicago/Turabian StyleYang, Zhoufan, Honghui Ning, Yimeng Pan, Junguo Liao, and Shaobo Zhang. 2026. "Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models" Mathematics 14, no. 11: 1874. https://doi.org/10.3390/math14111874
APA StyleYang, Z., Ning, H., Pan, Y., Liao, J., & Zhang, S. (2026). Semantic-Preserving Multi-Object Coexistence: A Backdoor Attack on Text-to-Image Diffusion Models. Mathematics, 14(11), 1874. https://doi.org/10.3390/math14111874

