AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers
Abstract
1. Introduction
- 1.
- We propose a novel generative attack framework that extends existing single-target generative attacks to the multi-target setting by exploiting class-shared attention patterns. This design reduces training cost and improves applicability in complex attack scenarios.
- 2.
- We incorporate ViT self-attention into target feature fusion and adaptive perturbation generation. Through attention-guided semantic injection and important patch selection, the perturbation is concentrated on model-sensitive regions, improving both attack success rate and transferability while maintaining visual imperceptibility.
- 3.
- We conduct extensive experiments on ImageNet and CIFAR-10 to evaluate AMGAA under single-target transfer, multi-target attack, and unknown-class generalization settings. The results show that AMGAA achieves higher average ASR across multiple Transformer and CNN victim models, while preserving strong perceptual quality in terms of SSIM and LPIPS.
2. Related Work
2.1. Development of Adversarial Example Generation
2.2. Vision Transformers and Adversarial Attacks
2.3. Attention Mechanisms in Adversarial Example Generation
2.4. Adaptive Perturbation Generation
2.5. Multi-Target Generators
3. Methodology
3.1. Overview of the Framework
- ViT feature extraction module: This module extracts deep feature representations and attention weights from the source and target images, providing the basis for feature alignment and perturbation generation.
- Feature fusion module: Discriminative target features are selected using target attention information and injected into the source representation to enhance feature-level similarity to the target class.
- Perturbation generation module: Guided by the source attention distribution, this module adaptively concentrates perturbations on important regions, improving attack precision and effectiveness.
- Loss optimization module: This module optimizes the model from multiple perspectives. Specifically, adversarial loss, attention constraint loss, and total variation loss are jointly optimized to improve attack performance, structural consistency, and visual naturalness.
3.2. ViT Feature Extraction
3.3. Attention-Guided Feature Fusion
3.4. Adaptive Perturbation Generation Module
3.5. Loss Optimization
3.5.1. Adversarial Loss
3.5.2. Attention Constraint Loss
3.5.3. Total Variation Loss
3.5.4. Overall Objective
4. Experiments and Results
4.1. Experimental Settings
- Victim models: To evaluate the attack performance of AMGAA across different architectures, we consider several representative victim models, including the standard ViT-B/16 [12], Swin Transformer (Swin-B) [41] with hierarchical window-based attention, robustness-oriented RVT [42], and classical CNNs, including ResNet-50 [15], DenseNet-121 [43], and VGG-19 [14]. These models cover both Transformer and CNN-based architectures, enabling a systematic evaluation of cross-architecture transferability and attack effectiveness.
- Evaluation metrics: We evaluate the proposed method using three metrics.
- 1.
- Attack Success Rate (ASR): the proportion of adversarial examples that successfully induce the target model to make the desired incorrect prediction.
- 2.
- Generalization ability: the ability of the generator to produce effective adversarial examples for unknown target classes.
- 3.
- Perceptual quality: SSIM and LPIPS are used to measure structural similarity and perceptual distance between clean and adversarial images.
- Compared methods: We compare AMGAA with six representative adversarial attacks, including gradient-based iterative pixel-level attacks PGD [23], C&W [22], and MI-FGSM [24], the local patch-based attack G-Patch [21], and generative multi-target attacks MAN [37] and CGNC [38]. These baselines cover different attack paradigms, allowing a comprehensive comparison in terms of attack success, transferability, and generalization.
- Implementation details: In all experiments, the perturbation budget is set to , and ViT-B/16 is used as the surrogate model. For AMGAA, we use the AdamW optimizer [44] with a learning rate of 2 × 10−4 and train the model for 20 epochs. The number of selected patches k is set to 16, and the feature injection coefficient is set to 0.3, and the learnable source-feature retention coefficient is initialized to 1. For the baseline attacks, PGD is performed for 20 iterations with a step size of . MI-FGSM is also run for 20 iterations with a step size of and a momentum factor of 1.0. C&W uses the Adam optimizer [45] with a learning rate of 0.01 and a maximum of 1000 iterations. Unless otherwise specified, other hyperparameters follow the original papers or common settings, and all methods are evaluated under the same perturbation budget for fair comparison.
4.2. Main Experimental Results
4.2.1. Transferability in Single-Target Attacks
4.2.2. Comparison of Multi-Target Attack Success Rate
4.2.3. Comparison of Generalization Ability
4.2.4. Perceptual Imperceptibility Analysis
4.3. Ablation Studies
4.3.1. Module Ablation Study
- Full AMGAA: The complete model.
- w/o attention-guided fusion (w/o AGF): Target attention guidance is removed during feature fusion, while only target feature injection is retained. This setting examines the effect of attention-guided semantic injection.
- w/o adaptive perturbation (w/o AP): Adaptive perturbation weighting is removed, and uniform perturbations are applied to the selected patches. This setting evaluates the role of source-attention-based perturbation allocation.
- Random-k selection (Rand-k): Importance-based patch selection is replaced with random selection of k patches. This setting verifies the effectiveness of attention-based patch selection.
- w/o multi-layer aggregation (w/o MLA): Only the last-layer ViT features and attention maps are used. This setting assesses the contribution of multi-layer representation aggregation.
4.3.2. Loss Ablation Study
4.3.3. Hyperparameter Analysis
- Number of important patches k: This parameter controls the perturbation coverage. A smaller k helps preserve visual naturalness but may limit attack strength, while a larger k can improve ASR but may introduce more visible distortions. We evaluate the performance under .
- Feature injection coefficient : This parameter controls the strength of target feature injection. A small may provide insufficient target semantic guidance, while an excessively large may cause over-aggressive fusion and disrupt feature stability and visual naturalness. We evaluate the effect of on ASR and transferability.
4.4. Attention Visualization Analysis
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25. [Google Scholar]
- Vilas, M.G.; Schaumlöffel, T.; Roig, G. Analyzing Vision Transformers for Image Classification in Class Embedding Space. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 40030–40041. [Google Scholar]
- Xu, C.; Wu, J.; Zhang, F.; Freer, J.; Zhang, Z.; Cheng, Y. A deep image classification model based on prior feature knowledge embedding and application in medical diagnosis. Sci. Rep. 2024, 14, 13244. [Google Scholar] [CrossRef] [PubMed]
- Tang, Z.; Zhang, Y.; Ming, L. Novel snap-layer MMPC scheme via neural dynamics equivalency and solver for redundant robot arms with five-layer physical limits. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 3534–3546. [Google Scholar] [CrossRef]
- Zhang, Z.; Cao, Z.; Li, X. Neural dynamic fault-tolerant scheme for collaborative motion planning of dual-redundant robot manipulators. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11189–11201. [Google Scholar] [CrossRef]
- Xiao, L.; Zhang, Y.; Liao, B.; Zhang, Z.; Ding, L.; Jin, L. A velocity-level bi-criteria optimization scheme for coordinated path tracking of dual robot manipulators using recurrent neural network. Front. Neurorobot. 2017, 11, 47. [Google Scholar] [CrossRef]
- Zhang, Z.; Ding, C.; Zhang, M.; Luo, Y.; Mai, J. DCDLN: A densely connected convolutional dynamic learning network for malaria disease diagnosis. Neural Netw. 2024, 176, 106339. [Google Scholar] [CrossRef] [PubMed]
- Cai, Z.; Zhou, K.; Liao, Z. A systematic review of YOLO-based object detection in medical imaging: Advances, challenges, and future directions. Comput. Mater. Contin. 2025, 85, 2255. [Google Scholar] [CrossRef]
- Sun, L.; Mo, Z.; Yan, F.; Xia, L.; Shan, F.; Ding, Z.; Song, B.; Gao, W.; Shao, W.; Shi, F.; et al. Adaptive feature selection guided deep forest for COVID-19 classification with chest CT. IEEE J. Biomed. Health Inform. 2020, 24, 2798–2805. [Google Scholar] [CrossRef] [PubMed]
- Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
- Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
- Mahmood, K.; Mahmood, R.; Van Dijk, M. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 7838–7847. [Google Scholar]
- Shao, R.; Shi, Z.; Yi, J.; Chen, P.Y.; Hsieh, C.J. On the adversarial robustness of vision transformers. arXiv 2021, arXiv:2103.15670. [Google Scholar] [CrossRef]
- Fu, Y.; Zhang, S.; Wu, S.; Wan, C.; Lin, Y. Patch-fool: Are vision transformers always robust against adversarial perturbations? arXiv 2022, arXiv:2203.08392. [Google Scholar] [CrossRef]
- Joshi, A.; Akula, S.C.; Jagatap, G.; Hegde, C. A Few Adversarial Tokens Can Break Vision Transformers. 2025. Available online: https://robustart.github.io/long_paper/31.pdf (accessed on 9 June 2026).
- Joshi, A.; Jagatap, G.; Hegde, C. Adversarial token attacks on vision transformers. arXiv 2021, arXiv:2110.04337. [Google Scholar] [CrossRef]
- Shao, M. Random position adversarial patch for vision transformers. arXiv 2023, arXiv:2307.04066. [Google Scholar] [CrossRef]
- Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (sp); IEEE: New York, NY, USA, 2017; pp. 39–57. [Google Scholar]
- Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
- Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 9185–9193. [Google Scholar]
- Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; PMLR; pp. 2206–2216. [Google Scholar]
- Naseer, M.M.; Ranasinghe, K.; Khan, S.H.; Hayat, M.; Shahbaz Khan, F.; Yang, M.H. Intriguing properties of vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 23296–23308. [Google Scholar]
- Gu, J.; Tresp, V.; Qin, Y. Are vision transformers robust to patch perturbations? In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 404–421. [Google Scholar]
- Wu, W.; Su, Y.; Chen, X.; Zhao, S.; King, I.; Lyu, M.R.; Tai, Y.W. Boosting the transferability of adversarial samples via attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 1161–1170. [Google Scholar]
- Wang, J.; Yin, Z.; Jiang, J.; Du, Y. Attention-guided black-box adversarial attacks with large-scale multiobjective evolutionary optimization. Int. J. Intell. Syst. 2022, 37, 7526–7547. [Google Scholar] [CrossRef]
- Sharma, V.; Kalra, A.; Vaibhav, S.C.; Patel, L.; Morency, L.P. Attend and attack: Attention guided adversarial attacks on visual question answering models. In Proceedings of the Proc. Conf. Neural Inf. Process. Syst. Workshop Secur. Mach. Learn; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 2, p. 1. [Google Scholar]
- Yuan, Z.; Zhang, J.; Jiang, Z.; Li, L.; Shan, S. Adaptive perturbation for adversarial attack. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5663–5676. [Google Scholar] [CrossRef]
- Duan, J.; Qiu, L.; He, G.; Zhao, L.; Zhang, Z.; Li, H. A region-adaptive local perturbation-based method for generating adversarial examples in synthetic aperture radar object detection. Remote Sens. 2024, 16, 997. [Google Scholar] [CrossRef]
- Liu, J.; Zhang, C.; Lyu, X. Boosting the transferability of adversarial examples via local mixup and adaptive step size. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
- Hu, J.; Li, X.; Liu, C.; Zhang, R.; Tang, J.; Sun, Y.; Wang, Y. APDL: An adaptive step size method for white-box adversarial attacks. Complex Intell. Syst. 2025, 11, 116. [Google Scholar] [CrossRef]
- Xiao, C.; Li, B.; Zhu, J.Y.; He, W.; Liu, M.; Song, D. Generating adversarial examples with adversarial networks. arXiv 2018, arXiv:1801.02610. [Google Scholar]
- Naseer, M.; Khan, S.; Hayat, M.; Khan, F.S.; Porikli, F. On generating transferable targeted perturbations. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 7708–7717. [Google Scholar]
- Han, J.; Dong, X.; Zhang, R.; Chen, D.; Zhang, W.; Yu, N.; Luo, P.; Wang, X. Once a man: Towards multi-target attack via learning multi-target adversarial network once. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 5158–5167. [Google Scholar]
- Fang, H.; Kong, J.; Chen, B.; Dai, T.; Wu, H.; Xia, S.T. Clip-guided generative networks for transferable targeted adversarial attacks. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–19. [Google Scholar]
- Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 9 June 2026).
- Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
- Mao, X.; Qi, G.; Chen, Y.; Li, X.; Duan, R.; Ye, S.; He, Y.; Xue, H. Towards robust vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 12042–12051. [Google Scholar]
- Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 4700–4708. [Google Scholar]
- Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Kim, K.; Wu, B.; Dai, X.; Zhang, P.; Yan, Z.; Vajda, P.; Kim, S.J. Rethinking the Self-Attention in Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; IEEE: New York, NY, USA, 2021; pp. 3071–3075. [Google Scholar]





| Attacks | ViT-B/16 | Swin-B | RVT | DenseNet-121 | VGG-19 | ResNet-50 | Avg |
|---|---|---|---|---|---|---|---|
| PGD | 96.1 | 29.6 | 12.4 | 11.4 | 9.9 | 5.6 | 27.5 |
| C&W | 94.6 | 20.3 | 9.1 | 13.7 | 8.5 | 7.4 | 25.1 |
| MI-FGSM | 93.9 | 14.5 | 5.3 | 6.4 | 4.2 | 3.8 | 21.4 |
| G-Patch | 70.8 | 49.7 | 35.4 | 24.5 | 20.1 | 14.7 | 35.9 |
| MAN | 89.6 | 40.1 | 24.3 | 19.9 | 15.2 | 14.4 | 33.9 |
| CGNC | 90.5 | 47.4 | 33.2 | 20.0 | 19.7 | 11.6 | 37.1 |
| AMGAA | 91.4 | 55.4 | 40.7 | 30.1 | 27.6 | 14.1 | 43.2 |
| Attacks | ViT-B/16 | Swin-B | RVT | DenseNet-121 | VGG-19 | ResNet-50 | Avg |
|---|---|---|---|---|---|---|---|
| MAN | 86.4 | 36.0 | 19.7 | 17.4 | 14.5 | 10.9 | 30.9 |
| CGNC | 87.7 | 39.5 | 28.4 | 16.6 | 18.4 | 9.4 | 33.3 |
| AMGAA | 89.8 | 47.4 | 36.8 | 26.4 | 22.1 | 11.8 | 39.0 |
| Attacks | ViT-B/16 | Swin-B | RVT | DenseNet-121 | VGG-19 | ResNet-50 | Avg |
|---|---|---|---|---|---|---|---|
| MAN | 20.7 | 6.1 | 1.6 | 2.4 | 2.5 | 2.0 | 5.9 |
| CGNC | 30.5 | 12.8 | 4.7 | 10.5 | 11.1 | 8.6 | 13.0 |
| AMGAA | 41.7 | 30.4 | 14.6 | 15.0 | 14.7 | 10.2 | 21.1 |
| Attacks | SSIM | LPIPS |
|---|---|---|
| G-Patch | 0.85 | 0.024 |
| CGNC | 0.91 | 0.019 |
| AMGAA | 0.91 | 0.018 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Ou, D.; Lu, J.; Zhou, S.; Zeng, Y.; Liao, D.; Liu, H.; Yang, C.; He, Y.; Tian, J. AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy 2026, 28, 680. https://doi.org/10.3390/e28060680
Ou D, Lu J, Zhou S, Zeng Y, Liao D, Liu H, Yang C, He Y, Tian J. AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy. 2026; 28(6):680. https://doi.org/10.3390/e28060680
Chicago/Turabian StyleOu, Dongbo, Jintian Lu, Shihui Zhou, Ying Zeng, Dongwan Liao, Haoyin Liu, Chao Yang, Yingsheng He, and Jie Tian. 2026. "AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers" Entropy 28, no. 6: 680. https://doi.org/10.3390/e28060680
APA StyleOu, D., Lu, J., Zhou, S., Zeng, Y., Liao, D., Liu, H., Yang, C., He, Y., & Tian, J. (2026). AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy, 28(6), 680. https://doi.org/10.3390/e28060680

