AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers

Ou, Dongbo; Lu, Jintian; Zhou, Shihui; Zeng, Ying; Liao, Dongwan; Liu, Haoyin; Yang, Chao; He, Yingsheng; Tian, Jie

doi:10.3390/e28060680

Open AccessArticle

AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers

by

Dongbo Ou

,

Jintian Lu

,

Shihui Zhou

,

Ying Zeng

,

Dongwan Liao

,

Haoyin Liu

,

Chao Yang

,

Yingsheng He

and

Jie Tian

^*

School of Computer Science and Engineering, Jishou University, Jishou 416000, China

^*

Author to whom correspondence should be addressed.

Entropy 2026, 28(6), 680; https://doi.org/10.3390/e28060680 (registering DOI)

Submission received: 12 May 2026 / Revised: 9 June 2026 / Accepted: 10 June 2026 / Published: 12 June 2026

Download

Browse Figures

Versions Notes

Abstract

Vision Transformers (ViTs) have achieved strong performance in computer vision, but their adversarial robustness remains underexplored. Existing ViT-oriented attacks mainly rely on iterative optimization, leading to high generation cost and limited transferability. Moreover, most generative attacks target a single class, making them inefficient for multi-target scenarios. To address these issues, we propose Attention-Guided Multi-Target Generative Adversarial Attack (AMGAA). AMGAA leverages ViT self-attention to guide target feature fusion and adaptively selects important source patches for perturbation generation. It jointly optimizes adversarial, attention constraint, and total variation losses to improve targeted attack success while preserving visual naturalness. Experiments on CIFAR-10 and ImageNet show that AMGAA achieves average attack success rates (ASRs) of 43.2% and 39.0% in ImageNet single-target transfer attacks and CIFAR-10 multi-target attacks, respectively. Compared with the generative attack methods evaluated in our experiments, AMGAA improves ASR by 5.7 percentage points in the multi-target setting and by 8.1 percentage points in unknown-class generalization. AMGAA also obtains a low LPIPS of 0.018, indicating good visual imperceptibility. Ablation studies confirm the effectiveness of its key components.

Keywords:

vision transformer; adversarial attack; multi-target attack; transfer attack; attention mechanism; security of AI

1. Introduction

With the rapid development of artificial intelligence, deep learning models have shown strong potential in various fields, such as image recognition [1,2,3], automated control [4,5,6], and medical diagnosis [7,8,9]. However, recent studies have revealed that these models are highly vulnerable to adversarial examples, making adversarial robustness a central issue in AI security [10,11]. Adversarial examples are crafted by adding small perturbations to clean inputs. Although such perturbations are often imperceptible to humans, they can mislead deep models into making incorrect predictions. This phenomenon exposes the robustness limitations of deep neural networks and motivates further research on secure model design. Therefore, studying adversarial example generation is essential for understanding model vulnerabilities and developing more robust learning systems.

As a mainstream visual modeling paradigm, Vision Transformer (ViT) has achieved remarkable progress in computer vision [12]. Unlike conventional convolutional neural networks (CNNs), which mainly rely on local convolutions [13,14,15], ViT captures long-range dependencies through self-attention and thus learns more discriminative visual representations. It has achieved state-of-the-art performance on various vision tasks. Therefore, investigating the adversarial vulnerability of ViTs is crucial for their reliable deployment in real-world applications.

Existing studies have shown that, due to the structural properties of ViT self-attention, several attacks effective on CNNs suffer a clear performance drop when transferred to ViTs [16,17]. Inspired by word-substitution attacks against Transformers in natural language processing (NLP), patch-level attack paradigms have been introduced for ViTs [18]. These methods can significantly disrupt ViT predictions by modifying only a small number of image patches, and such patch-wise perturbations are often more effective against ViTs than CNNs. According to the generation mechanism, existing methods can be roughly divided into two categories. The first category iteratively optimizes selected patches using gradients from the target model. These methods usually achieve high white-box attack success rates, but suffer from high computational cost, overfitting, and limited transferability [18,19,20]. The second category relies on generative models, where a trained generator directly outputs adversarial patches in a single forward pass. Such methods are more efficient at inference time and often exhibit better transferability [21]. However, generative attacks for ViTs remain underexplored, and most existing methods are designed for a single target class. A new target class therefore requires retraining, which greatly limits their scalability.

To address these limitations, we propose a multi-target generative adversarial attack framework that fully exploits ViT self-attention to generate adversarial examples with high attack effectiveness and low perceptibility. Specifically, we first extract feature representations and attention weights from both the source and target images. Then, an attention-guided feature fusion module injects discriminative target features into the source representation, encouraging the generated examples to approach the target-class feature distribution. Based on this fused representation, an adaptive perturbation strategy is designed to modulate perturbation generation according to the source attention weights and focus the perturbation on highly attended patches. Finally, we optimize the generation process with attention-aware objectives, including adversarial and attention regularization losses, to balance attack strength and visual imperceptibility. The main contributions of this work are summarized as follows:

1.: We propose a novel generative attack framework that extends existing single-target generative attacks to the multi-target setting by exploiting class-shared attention patterns. This design reduces training cost and improves applicability in complex attack scenarios.
2.: We incorporate ViT self-attention into target feature fusion and adaptive perturbation generation. Through attention-guided semantic injection and important patch selection, the perturbation is concentrated on model-sensitive regions, improving both attack success rate and transferability while maintaining visual imperceptibility.
3.: We conduct extensive experiments on ImageNet and CIFAR-10 to evaluate AMGAA under single-target transfer, multi-target attack, and unknown-class generalization settings. The results show that AMGAA achieves higher average ASR across multiple Transformer and CNN victim models, while preserving strong perceptual quality in terms of SSIM and LPIPS.

2. Related Work

2.1. Development of Adversarial Example Generation

The concept of adversarial examples was first introduced by Szegedy et al. [10], who showed that deep neural networks (DNNs) are vulnerable to imperceptible input perturbations. This finding exposed the inherent weakness of neural networks and motivated extensive research on adversarial attacks. Goodfellow et al. [11] proposed the Fast Gradient Sign Method (FGSM), which efficiently generates adversarial examples using the gradient of the loss with respect to the input. Carlini and Wagner later introduced the C&W attack [22], which improves imperceptibility by explicitly optimizing the distance between adversarial and clean samples. Madry et al. [23] proposed Projected Gradient Descent (PGD), a strong iterative attack that crafts adversarial examples through multi-step optimization. Dong et al. [24] further developed Momentum Iterative FGSM (MI-FGSM), which incorporates momentum to stabilize gradient updates and enhance transferability. To provide a more reliable robustness evaluation, Croce and Hein proposed AutoAttack [25], an automated ensemble of parameter-free attacks. Although these methods have achieved strong performance on CNNs, their applicability and effectiveness against attention-based Vision Transformers require further investigation.

2.2. Vision Transformers and Adversarial Attacks

With the wide adoption of Vision Transformers in visual recognition, their robustness against adversarial attacks has attracted increasing attention. Naseer et al. found that ViTs can be more robust than CNNs in certain adversarial settings, possibly due to the global feature modeling capability of self-attention [26]. However, whether ViTs are consistently more robust than CNNs remains controversial.

Inspired by token-level attacks against Transformers in NLP, Fu et al. proposed Patch-Fool for ViTs [18]. This method disrupts self-attention by applying tailored perturbations to critical image patches. The results show that ViTs may be more vulnerable than CNNs under patch-level attacks, indicating that their robustness advantage does not hold in all attack scenarios. Following this line, subsequent studies [19,27] further analyzed ViT robustness from the perspective of patch/token perturbations. Overall, patch- and token-level attacks have become an important direction for studying the adversarial vulnerability of ViTs and for understanding their self-attention and token representation properties.

2.3. Attention Mechanisms in Adversarial Example Generation

Attention mechanisms were first widely used in sequence modeling tasks, such as machine translation, due to their ability to capture long-range dependencies. With the emergence of self-attention-based vision models such as ViT, attention has also been systematically introduced into computer vision. Compared with conventional convolutional features, self-attention explicitly models correlations among spatial locations and semantic regions, providing more fine-grained and structured priors for adversarial example generation.

For example, Wu et al. [28] incorporated internal attention maps into the attack objective and constrained perturbations to high-attention regions, significantly improving the transferability of adversarial examples across models. Wang et al. [29] used attention maps to select a small set of salient pixels as optimization variables and searched for perturbations in this reduced subspace with a large-scale multi-objective evolutionary algorithm, achieving a better balance between black-box attack success rate and visual quality. Sharma et al. [30] exploited intermediate attention maps of VQA models to apply targeted perturbations to image-text alignment regions, enabling adversarial examples to alter model answers while preserving visual naturalness.

These studies demonstrate that attention plays an important role in generating high-quality adversarial examples. For models inherently equipped with self-attention, such as ViTs, effectively exploiting this mechanism is crucial for improving attack effectiveness and perceptual quality.

2.4. Adaptive Perturbation Generation

Adaptive perturbation generation is an effective strategy for improving adversarial attack performance. Existing studies have explored adaptive attacks from different perspectives, including perturbation magnitude, perturbation region, and optimization step size. To address the image-dependent variation in optimal scaling factors, Yuan et al. [31] designed an adaptive perturbation strategy that selects a suitable perturbation strength for each image, improving attack success rate while reducing computational cost. To overcome the limited scale adaptability of local adversarial perturbations, Duan et al. [32] proposed a region-adaptive local perturbation method that adjusts the perturbation region according to object size and shape, thereby improving attack generality.

Furthermore, Liu et al. [33] combined local mixup with adaptive step size to enhance input diversity and dynamically adjust perturbation updates, improving cross-model transferability. From the optimization perspective, Hu et al. [34] proposed APDL, which adaptively adjusts the single-step perturbation magnitude and adopts dual-loss optimization to accelerate convergence while maintaining a high attack success rate with fewer iterations. However, these methods are mainly designed for CNNs or specific vision tasks and do not fully consider the patch-token representation and self-attention mechanism of ViTs. Therefore, adaptive perturbation generation for ViTs remains underexplored.

2.5. Multi-Target Generators

Generative adversarial attacks use a single feed-forward generator to directly produce perturbations or adversarial examples, avoiding the cost and overfitting risk of iterative optimization and improving efficiency in black-box settings. Representative methods include AdvGAN [35] and TTP [36]. AdvGAN employs a generative adversarial network (GAN) to approximate the data distribution and generate high-quality adversarial examples, while TTP improves transferability by matching the distribution of perturbed images to that of the target class, reducing reliance on model-specific decision boundaries.

However, these methods are usually designed for a single target class, requiring a separate generator for each new class and thus increasing training cost. To address this issue, MAN [37] introduced a multi-target generator, where the discrete target label

t \in {1, \dots, K}

is explicitly used as a condition. It extracts image appearance features and target-label features separately, fuses them in the feature space, and uses residual and deconvolutional layers to generate adversarial examples for the given target label. As a result, one generator can cover K target classes while maintaining attack success rate and transferability. Building on this idea, CGNC [38] incorporates CLIP-based textual semantics into the multi-target generator and uses cross-attention to enhance the interaction between target semantics and image features, improving transferable targeted attacks.

Overall, existing generative methods have demonstrated the feasibility of “one generator for multiple targets” mainly in CNN-based settings. For ViTs, how to exploit class-shared attention patterns to build an efficient and transferable multi-target generator remains insufficiently explored.

3. Methodology

3.1. Overview of the Framework

Our framework is built on Vision Transformer (ViT) and integrates feature fusion with perturbation generation to form an efficient adversarial example generation pipeline. As shown in Figure 1, the framework consists of four main modules:

ViT feature extraction module: This module extracts deep feature representations and attention weights from the source and target images, providing the basis for feature alignment and perturbation generation.
Feature fusion module: Discriminative target features are selected using target attention information and injected into the source representation to enhance feature-level similarity to the target class.
Perturbation generation module: Guided by the source attention distribution, this module adaptively concentrates perturbations on important regions, improving attack precision and effectiveness.
Loss optimization module: This module optimizes the model from multiple perspectives. Specifically, adversarial loss, attention constraint loss, and total variation loss are jointly optimized to improve attack performance, structural consistency, and visual naturalness.

By coordinating these modules, the proposed method effectively exploits the global modeling capability of ViT, enabling the generated adversarial examples to better approach the target feature distribution while maintaining high visual imperceptibility.

3.2. ViT Feature Extraction

As a core component of the framework, the feature extraction module captures deep feature representations and global attention weights from input images. Built on Vision Transformer (ViT), this module first divides an input image into fixed-size patches and then models the patch sequence with Transformer encoders to obtain discriminative feature and attention representations.

Given a source image

x_{s}

and a target image

x_{t}

, the ViT-based feature extraction process is formulated as:

F_{s}, A_{s} = ViT (x_{s}),

(1)

F_{t}, A_{t} = ViT (x_{t}),

(2)

where

F_{s}, F_{t} \in R^{n \times d}

denote the feature representations of the source and target images, respectively. Here, n is the number of patches and d is the feature dimension.

A_{s}, A_{t} \in R^{n \times n}

denote the corresponding self-attention matrices, which describe patch-wise correlations and attention distributions.

Different ViT layers capture hierarchical information. Shallow layers mainly encode local textures and fine-grained details, whereas deeper layers tend to capture high-level semantic information. To exploit multi-level representations, AMGAA does not rely on a single Transformer layer. Instead, it introduces a weighted aggregation mechanism to integrate features and attention maps from multiple layers:

F_{final} = \sum_{l = 1}^{L} w_{l} \cdot F^{(l)},

(3)

A_{final} = \sum_{l = 1}^{L} w_{l} \cdot A^{(l)},

(4)

where L denotes the number of selected Transformer layers, and

w_{l}

is the learnable weight of the l-th layer, satisfying

\sum_{l = 1}^{L} w_{l} = 1 .

(5)

In these equations,

F^{(l)}

and

A^{(l)}

represent the feature representation and attention matrix from the l-th layer, respectively. This multi-layer aggregation preserves both local details and global semantics, providing a more robust representation for subsequent attention-guided feature fusion and perturbation generation.

3.3. Attention-Guided Feature Fusion

To move the generated adversarial examples closer to the target image in the feature space, we design an attention-guided feature fusion mechanism. This mechanism weights the target features using the target attention distribution and injects key semantic information into the source representation, thereby enhancing the tendency of the source image to shift toward the target class.

Specifically, for the i-th patch of the source image, the fused feature is defined as:

F_{s}^{'} (i) = F_{s} (i) + α {\hat{A}}_{t} (i) F_{t} (i),

(6)

where

F_{s} (i)

and

F_{t} (i)

denote the feature vectors of the i-th patch in the source and target images, respectively.

{\hat{A}}_{t} (i)

is the normalized importance weight of the i-th patch derived from the target attention matrix

A_{t}

, and

α

is a hyperparameter controlling the feature injection strength.

In this fusion process, target regions with higher attention weights exert stronger influence on the source representation, encouraging the source features to approach the discriminative features of the target image. Unlike direct global feature replacement, this attention-guided selective injection highlights key target regions more precisely, improves feature transfer effectiveness, and provides a clearer direction for subsequent perturbation generation.

3.4. Adaptive Perturbation Generation Module

After feature fusion, we further design an attention-based adaptive perturbation generation strategy. Unlike conventional methods that apply perturbations uniformly over the entire image, our method determines the perturbation regions according to the source attention response. This allows the limited perturbation budget to focus on decision-sensitive patches, improving attack efficiency and effectiveness.

First, the source attention matrix

A_{s}

is aggregated to compute the global importance score of each patch:

s_{i} = \sum_{j} A_{s} (j, i),

(7)

where

s_{i}

denotes the importance of the i-th patch in global attention interactions. Based on

s_{i}

, the top-k important patches are selected, and their index set is denoted as

Ω_{k}

. A binary mask is then constructed as:

M (i) = \{\begin{matrix} 1, & i \in Ω_{k} \\ 0, & i \notin Ω_{k} \end{matrix} .

(8)

Then, the feature variation before and after fusion is used to construct the patch-level guidance representation:

z (i) = M (i) \cdot {\hat{A}}_{s} (i) \cdot (F_{s}^{'} (i) - β F_{s} (i)),

(9)

where

{\hat{A}}_{s} (i)

denotes the normalized importance weight of the i-th source patch, and

F_{s}^{'} (i) - β F_{s} (i)

represents the feature variation with source-feature retention controlled by the learnable coefficient

β

. The guidance representation

z (i)

is not directly added to the image. Instead, it serves as the input condition of the perturbation generator.

Given the perturbation budget

ϵ

, the pixel-level perturbation is generated by:

δ = clip (G_{θ} (z), - ϵ, ϵ),

(10)

where

G_{θ} (\cdot)

(see Figure 2) denotes the perturbation generator, which maps the patch-level guidance representation to a pixel-level perturbation with the same spatial size as the source image. The function

clip (\cdot)

constrains the generated perturbation within the budget

[- ϵ, ϵ]

.

Finally, the adversarial example is obtained as:

x_{s}^{'} = {clip}_{[0, 1]} (x_{s} + δ) .

(11)

This design ensures that the perturbation is generated in the pixel space while still being guided by feature variation, attention strength, and the selected patch mask. As a result, the perturbation is concentrated on highly important regions, making it more consistent with the attention characteristics of ViTs and improving both attack effectiveness and visual imperceptibility.

3.5. Loss Optimization

To balance attack effectiveness, attention consistency, and visual naturalness, we construct a joint optimization objective consisting of adversarial loss, attention constraint loss, and total variation loss. By combining these losses, the proposed method improves targeted attack success while suppressing unnecessary high-frequency noise.

3.5.1. Adversarial Loss

The adversarial loss encourages the generated adversarial example to approach the target image in the feature space. We adopt cosine similarity loss to measure the feature distance between the adversarial example and the target image:

L_{adv} = 1 - \frac{F (x_{s}^{'}) \cdot F (x_{t})}{∥ F (x_{s}^{'}) ∥ ∥ F (x_{t}) ∥},

(12)

where

F (x_{s}^{'})

and

F (x_{t})

denote the feature representations of the adversarial example and the target image, respectively. Minimizing this loss drives the adversarial example toward the target feature distribution, thereby improving targeted attack effectiveness.

3.5.2. Attention Constraint Loss

To align the attention distribution of the adversarial example with that of the target image, we introduce an attention constraint loss:

L_{attn} = \sum_{i = 1}^{n} {∥{\hat{A}}_{x_{s}^{'}} (i) - {\hat{A}}_{t} (i)∥}^{2},

(13)

where

{\hat{A}}_{x_{s}^{'}} (i)

denotes the normalized attention weight of the adversarial example at the i-th patch, and

{\hat{A}}_{t} (i)

denotes the corresponding normalized attention weight of the target image. Minimizing this loss encourages attention-level alignment and enhances target-oriented guidance.

3.5.3. Total Variation Loss

To reduce high-frequency noise and improve visual naturalness, we employ total variation (TV) loss as a smoothness regularizer:

L_{tv} = \sum_{i, j} \sqrt{{(x_{i + 1, j}^{'} - x_{i, j}^{'})}^{2} + {(x_{i, j + 1}^{'} - x_{i, j}^{'})}^{2}},

(14)

where

x_{i, j}^{'}

denotes the pixel value of the adversarial example at position

(i, j)

. TV loss penalizes abrupt changes between adjacent pixels, encouraging spatially smoother perturbations and reducing visually noticeable artifacts.

3.5.4. Overall Objective

The overall loss function is defined as:

L_{total} = L_{adv} + λ_{1} L_{attn} + λ_{2} L_{tv},

(15)

where

λ_{1}

and

λ_{2}

are weighting coefficients that balance the contributions of different loss terms.

During training, the generator parameters are updated by backpropagation, rather than iteratively optimizing the perturbation for each image. Under the perturbation budget constraint, the generator learns to produce adversarial perturbations that balance attack effectiveness, attention transfer, and visual imperceptibility. After training, adversarial examples can be generated with a single forward pass, improving attack efficiency.

4. Experiments and Results

4.1. Experimental Settings

To comprehensively evaluate the effectiveness, transferability, and generalization ability of AMGAA, we conduct extensive experiments and compare it with several representative adversarial attack methods.

Datasets: We use CIFAR-10 [39] and ImageNet [40] in our experiments. CIFAR-10 is mainly used for training and evaluating multi-target attacks, while ImageNet is used to evaluate transferability and generalization ability.
Victim models: To evaluate the attack performance of AMGAA across different architectures, we consider several representative victim models, including the standard ViT-B/16 [12], Swin Transformer (Swin-B) [41] with hierarchical window-based attention, robustness-oriented RVT [42], and classical CNNs, including ResNet-50 [15], DenseNet-121 [43], and VGG-19 [14]. These models cover both Transformer and CNN-based architectures, enabling a systematic evaluation of cross-architecture transferability and attack effectiveness.
Evaluation metrics: We evaluate the proposed method using three metrics.
1.
Attack Success Rate (ASR): the proportion of adversarial examples that successfully induce the target model to make the desired incorrect prediction.
2.
Generalization ability: the ability of the generator to produce effective adversarial examples for unknown target classes.
3.
Perceptual quality: SSIM and LPIPS are used to measure structural similarity and perceptual distance between clean and adversarial images.
Compared methods: We compare AMGAA with six representative adversarial attacks, including gradient-based iterative pixel-level attacks PGD [23], C&W [22], and MI-FGSM [24], the local patch-based attack G-Patch [21], and generative multi-target attacks MAN [37] and CGNC [38]. These baselines cover different attack paradigms, allowing a comprehensive comparison in terms of attack success, transferability, and generalization.
Implementation details: In all experiments, the perturbation budget is set to $16 / 255$ , and ViT-B/16 is used as the surrogate model. For AMGAA, we use the AdamW optimizer [44] with a learning rate of 2 × 10⁻⁴ and train the model for 20 epochs. The number of selected patches k is set to 16, and the feature injection coefficient $α$ is set to 0.3, and the learnable source-feature retention coefficient $β$ is initialized to 1. For the baseline attacks, PGD is performed for 20 iterations with a step size of $2 / 255$ . MI-FGSM is also run for 20 iterations with a step size of $2 / 255$ and a momentum factor of 1.0. C&W uses the Adam optimizer [45] with a learning rate of 0.01 and a maximum of 1000 iterations. Unless otherwise specified, other hyperparameters follow the original papers or common settings, and all methods are evaluated under the same perturbation budget for fair comparison.

4.2. Main Experimental Results

4.2.1. Transferability in Single-Target Attacks

We evaluate the transferability of different methods under the single-target setting on ImageNet. Since evaluating only one target class may introduce bias caused by class-specific difficulty, we randomly select 20 target classes for evaluation. For each selected class, it is fixed as the attack target, and ImageNet training images whose ground-truth labels are different from the target class are used as source images. A separate single-target attack model is trained for each target class, resulting in 20 independently trained attack models.

During testing, each trained model is used only to generate adversarial examples for its corresponding target class. For each target class, we randomly sample 10,000 source images from the ImageNet validation set, ensuring that their ground-truth labels are different from the current target class. Therefore, a total of 200,000 targeted adversarial examples are generated for evaluation. All adversarial examples are first generated using ViT-B/16 as the surrogate model and then transferred to Swin-B, RVT, DenseNet-121, VGG-19, and ResNet-50 for black-box evaluation. For fair comparison, all methods are evaluated under the same perturbation budget. For G-Patch, the patch size is set to

64 \times 64

. For AMGAA, we set

k = 16

. Under an input size of

224 \times 224

and a patch size of

16 \times 16

, the perturbed area is equivalent to a

64 \times 64

patch. For CGNC and MAN, we follow the hyperparameter settings reported in their original papers. The final results are reported as the average ASR over the 20 target classes.

As shown in Table 1, AMGAA achieves the best overall performance in the single-target transfer attack setting, with an average ASR of 43.2%. It clearly outperforms CGNC (37.1%), G-Patch (35.9%), and MAN (33.9%). On the surrogate model ViT-B/16, AMGAA achieves an ASR of 91.4%, demonstrating its ability to generate strongly target-oriented adversarial examples. More importantly, in black-box transfer evaluation, AMGAA obtains the highest ASR on Swin-B, RVT, DenseNet-121, and VGG-19, reaching 55.4%, 40.7%, 30.1%, and 27.6%, respectively. These results indicate that AMGAA has strong cross-model transferability. Although G-Patch and MAN slightly outperform AMGAA on ResNet-50, AMGAA still shows more stable overall performance across most victim models. These results demonstrate that exploiting ViT attention for target feature fusion and adaptive perturbation allocation can effectively improve the transferability and overall effectiveness of generative targeted attacks.

4.2.2. Comparison of Multi-Target Attack Success Rate

We evaluate the attack success rate of different methods in the multi-target setting on CIFAR-10. Specifically, models are trained on the CIFAR-10 training set and evaluated on the CIFAR-10 test set. During testing, we randomly select 5000 images from the test set. For each image, all nine classes different from its ground-truth label are used as target classes, and the corresponding targeted adversarial examples are generated. This results in 45K adversarial examples for evaluation.

An attack is considered successful if the adversarial example is classified as its assigned target class. Based on this criterion, we compute the targeted attack success rate (ASR). In this experiment, AMGAA is mainly compared with two representative multi-target generative attacks, MAN and CGNC, to evaluate its multi-target attack capability and cross-model transferability.

As shown in Table 2, AMGAA achieves the best overall performance in the multi-target attack setting, with an average ASR of 39.0%, outperforming CGNC (33.3%) and MAN (30.9%). On the surrogate model ViT-B/16, AMGAA reaches an ASR of 89.8%, indicating its strong ability to learn target-oriented perturbations under multiple target classes. For black-box victim models, AMGAA also achieves the best or competitive ASR on Swin-B, RVT, DenseNet-121, VGG-19, and ResNet-50, with scores of 47.4%, 36.8%, 26.4%, 22.1%, and 11.8%, respectively. In particular, AMGAA improves over CGNC by 7.9 and 8.4 percentage points on Swin-B and RVT, respectively, showing stronger transferability across Transformer architectures. Overall, these results demonstrate that AMGAA maintains high targeted attack performance in the multi-target setting and improves transferability across different victim models.

4.2.3. Comparison of Generalization Ability

We evaluate the generalization ability of AMGAA on ImageNet. Specifically, 500 classes are randomly selected as known classes to train MAN, CGNC, and our AMGAA, while the remaining 500 classes are used as unknown target classes. During evaluation, 1000 source images are randomly sampled from the validation set of the known classes. For each source image, one target image is sampled from each unknown class, and the source image is attacked toward the corresponding target class, yielding 500,000 targeted adversarial examples in total. Generalization performance is measured by the attack success rate on these examples.

As shown in Table 3, AMGAA demonstrates stronger generalization ability on unknown classes than the compared methods, achieving an average ASR of 21.1%, which is higher than CGNC (13.0%) and MAN (5.9%). On the surrogate model ViT-B/16, AMGAA reaches an ASR of 41.7%, outperforming CGNC and MAN by 11.2 and 21.0 percentage points, respectively. This indicates that AMGAA can still learn effective target-oriented perturbation patterns for unknown classes. On Swin-B, AMGAA improves over CGNC by 17.6 percentage points, showing stronger transfer generalization in both unknown class and cross-model settings. Overall, these results suggest that AMGAA is not limited to target classes observed during training, but can also maintain effective attack performance on unknown classes, demonstrating better generalization and practical potential.

4.2.4. Perceptual Imperceptibility Analysis

We further evaluate the perceptual imperceptibility of adversarial examples generated by different generative attacks. Specifically, we conduct this analysis using the adversarial examples from the single-target transfer setting in Section 4.2.1. For G-Patch, CGNC, and AMGAA, each method is evaluated on 20 target classes, with 10,000 targeted adversarial examples generated for each class. Thus, 200,000 adversarial examples are used for perceptual quality evaluation for each method.

We adopt SSIM and LPIPS to measure the visual difference between clean and adversarial images. A higher SSIM indicates better structural similarity, while a lower LPIPS indicates stronger perceptual imperceptibility. The final results are obtained by averaging SSIM and LPIPS over the 200,000 adversarial examples generated by each method.

As shown in Table 4, AMGAA achieves the best perceptual imperceptibility. Compared with G-Patch, AMGAA obtains higher SSIM and lower LPIPS. Compared with CGNC, AMGAA achieves a lower LPIPS while maintaining the same SSIM. These results indicate that AMGAA can better suppress visual perturbations and preserve natural image appearance while maintaining strong attack performance.

4.3. Ablation Studies

To evaluate the influence of different design factors in AMGAA, we conduct ablation studies from three aspects: module design, loss function, and hyperparameter setting. Unless otherwise specified, all ablation experiments follow the same setting as the multi-target attack experiment in Section 4.2.2, including dataset split, surrogate model, victim models, and perturbation budget, to ensure fair comparison.

4.3.1. Module Ablation Study

To verify the effectiveness of the core modules in AMGAA, we design the following variants:

Full AMGAA: The complete model.
w/o attention-guided fusion (w/o AGF): Target attention guidance is removed during feature fusion, while only target feature injection is retained. This setting examines the effect of attention-guided semantic injection.
w/o adaptive perturbation (w/o AP): Adaptive perturbation weighting is removed, and uniform perturbations are applied to the selected patches. This setting evaluates the role of source-attention-based perturbation allocation.
Random-k selection (Rand-k): Importance-based patch selection is replaced with random selection of k patches. This setting verifies the effectiveness of attention-based patch selection.
w/o multi-layer aggregation (w/o MLA): Only the last-layer ViT features and attention maps are used. This setting assesses the contribution of multi-layer representation aggregation.

The module ablation results are shown in Figure 3a.

The results show that removing any key component leads to performance degradation, indicating that each module contributes to the final performance of AMGAA. Among all variants, random-k selection causes the largest drop in average ASR, from 39.0% to 34.6%, demonstrating the importance of importance-based patch selection for multi-target attacks. Removing attention-guided fusion reduces the average ASR to 35.4%, indicating that target attention guidance effectively enhances semantic injection and improves target-oriented attack performance. In addition, removing adaptive perturbation and multi-layer aggregation decreases the average ASR to 36.8% and 36.1%, respectively, showing that adaptive perturbation allocation and multi-layer feature aggregation are also important for improving attack stability and cross-model transferability.

4.3.2. Loss Ablation Study

The total loss comprises the adversarial loss, the attention constraint loss (

L_{attn}

), and the total variation loss (

L_{tv}

). To analyze the effect of each loss term, we design four settings: Full Loss, w/o

L_{attn}

, w/o

L_{tv}

, and w/o

L_{attn} + L_{tv}

, where both regularization terms are removed and only the adversarial loss is retained. These settings allow us to examine the contribution of each loss term to the attack success rate. The loss ablation results are shown in Figure 3b.

The results show that removing any loss term degrades the attack performance, indicating that all loss components contribute to AMGAA. When

L_{attn}

is removed, the average ASR drops from 39.0% to 36.1%, suggesting that the attention constraint loss improves attention consistency between the adversarial and target images and enhances target-oriented transferability. Removing

L_{tv}

further reduces the average ASR to 34.9%, showing that total variation loss helps suppress high-frequency noise and stabilize the perturbation structure. When both

L_{attn}

and

L_{tv}

are removed, the average ASR decreases to 32.9%, indicating that these two regularization terms are complementary in improving attack effectiveness and sample quality.

4.3.3. Hyperparameter Analysis

To further evaluate the influence of key parameter settings, we analyze two core hyperparameters in AMGAA, the number of important patches k and the feature injection coefficient

α

. These two parameters control the spatial coverage of perturbations and the strength of target semantic injection, respectively, and therefore directly affect attack performance and transferability. This analysis helps reveal the performance trends under different parameter settings and provides guidance for selecting the default values.

Number of important patches k: This parameter controls the perturbation coverage. A smaller k helps preserve visual naturalness but may limit attack strength, while a larger k can improve ASR but may introduce more visible distortions. We evaluate the performance under $k \in {1, 4, 8, 16, 32, 64}$ .

The results in Figure 4a show that the average ASR generally increases as k becomes larger, rising from 28.4% at

k = 1

to 39.9% at

k = 64

. This indicates that increasing the perturbation coverage can enhance attack effectiveness. However, the improvement becomes marginal when k increases from 16 to 32 and 64, suggesting that adding more patches does not yield proportional gains. Considering both attack performance and perturbation cost, we set

k = 16

as the default value.

Feature injection coefficient $α$ : This parameter controls the strength of target feature injection. A small $α$ may provide insufficient target semantic guidance, while an excessively large $α$ may cause over-aggressive fusion and disrupt feature stability and visual naturalness. We evaluate the effect of $α \in {0.1, 0.2, 0.3, 0.5, 1.0}$ on ASR and transferability.

As shown in Figure 4b,

α

has a significant impact on model performance. As

α

increases from 0.1 to 0.3, the average ASR improves from 34.0% to 39.0%, indicating that moderate target feature injection enhances semantic guidance. However, when

α

further increases to 0.5 and 1.0, the performance starts to decline, suggesting that overly strong feature injection may disturb the source feature structure and hinder adversarial generation and transfer. Overall,

α = 0.3

achieves the best performance and is therefore used as the default setting.

4.4. Attention Visualization Analysis

To verify whether the adversarial examples generated by AMGAA can indeed affect the internal attention behavior of ViTs and thereby facilitate the attack, we visualize the attention maps following the method of Kim et al. [46]. Specifically, we average the attention scores across all attention heads in each Transformer layer and visualize the attention score of each token with respect to a given query token, i.e., the center patch marked by the red box in the example. This visualization experiment is conducted based on the generalization setting in Section 4.2.3, and four adversarial examples are selected from the 500,000 generated samples for visualization analysis.

As shown in Figure 5, the adversarial images generated by AMGAA remain visually similar to the original images, but their attention responses differ across different Transformer layers. The target images serve as references for examining whether the adversarial attention responses exhibit target-oriented patterns. In particular, the attention distributions of adversarial images in the middle and deeper layers exhibit noticeable changes compared with the original images and show a tendency to approach the target-related attention patterns. This phenomenon suggests that AMGAA can modify the internal attention behavior of ViTs through visually imperceptible perturbations. These results further support the effectiveness of the proposed attention-guided perturbation generation strategy and provide intuitive evidence that AMGAA facilitates target-oriented adversarial transfer by influencing attention responses.

5. Conclusions

In this paper, we propose AMGAA, a multi-target generative adversarial attack method for exploiting the vulnerability of Vision Transformers in targeted attack scenarios. AMGAA leverages ViT self-attention in both feature fusion and perturbation generation, and jointly optimizes adversarial loss, attention constraint loss (

L_{attn}

), and total variation loss (

L_{tv}

) to balance attack effectiveness, transferability, and visual imperceptibility. Experimental results show that AMGAA outperforms competing methods in single-target transfer attacks, multi-target attacks, and unknown-class generalization. It also demonstrates clear advantages in cross-model transferability. Perceptual quality evaluation further indicates that AMGAA can maintain strong attack performance while preserving good visual naturalness. Ablation studies confirm the effectiveness of attention-guided feature fusion, adaptive perturbation allocation, important patch selection, and the proposed loss terms. Overall, this work provides an effective framework for multi-target generative attacks against ViTs. Beyond attack generation, AMGAA can also help analyze the adversarial vulnerability of ViTs by revealing how perturbations on decision-sensitive patches influence internal attention responses and target-oriented feature representations. This perspective may provide useful insights for robustness evaluation and the design of more reliable Transformer-based vision models. Nevertheless, AMGAA still has limited generalization capability when attacking unknown target classes, particularly compared with its performance on seen classes. Future work will focus on improving the generalization ability of generative attacks across unknown categories and exploring their applicability in real-world scenarios. In this direction, we will further investigate whether AMGAA-generated adversarial examples can be incorporated into adversarial retraining or fine-tuning of ViT models, with the goal of enhancing robustness against other known attack methods while maintaining standard classification performance on clean images. We will also explore an iterative attack–retraining process, where the retrained model is attacked again to generate new adversarial samples, so as to more systematically evaluate and improve the robustness of ViTs.

Author Contributions

Conceptualization, D.O. and J.T.; methodology, D.O. and J.L.; software, D.O.; validation, D.O., S.Z. and Y.Z.; formal analysis, D.O. and J.L.; investigation, D.O., D.L., H.L. and C.Y.; resources, J.T. and J.L.; data curation, D.O., D.L. and H.L.; writing—original draft preparation, D.O.; writing—review and editing, J.L., S.Z., Y.Z., Y.H. and J.T.; visualization, D.O. and C.Y.; supervision, J.T. and Y.H.; project administration, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this study are publicly available. CIFAR-10 is available at https://www.cs.toronto.edu/~kriz/cifar.html (accessed on 8 June 2026), and ImageNet is available at https://www.image-net.org/ (accessed on 8 June 2026). Further details and processed experimental results are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2012; Volume 25. [Google Scholar]
Vilas, M.G.; Schaumlöffel, T.; Roig, G. Analyzing Vision Transformers for Image Classification in Class Embedding Space. In Proceedings of the Advances in Neural Information Processing Systems; Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S., Eds.; Curran Associates, Inc.: New York, NY, USA, 2023; Volume 36, pp. 40030–40041. [Google Scholar]
Xu, C.; Wu, J.; Zhang, F.; Freer, J.; Zhang, Z.; Cheng, Y. A deep image classification model based on prior feature knowledge embedding and application in medical diagnosis. Sci. Rep. 2024, 14, 13244. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Zhang, Y.; Ming, L. Novel snap-layer MMPC scheme via neural dynamics equivalency and solver for redundant robot arms with five-layer physical limits. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 3534–3546. [Google Scholar] [CrossRef]
Zhang, Z.; Cao, Z.; Li, X. Neural dynamic fault-tolerant scheme for collaborative motion planning of dual-redundant robot manipulators. IEEE Trans. Neural Netw. Learn. Syst. 2024, 36, 11189–11201. [Google Scholar] [CrossRef]
Xiao, L.; Zhang, Y.; Liao, B.; Zhang, Z.; Ding, L.; Jin, L. A velocity-level bi-criteria optimization scheme for coordinated path tracking of dual robot manipulators using recurrent neural network. Front. Neurorobot. 2017, 11, 47. [Google Scholar] [CrossRef]
Zhang, Z.; Ding, C.; Zhang, M.; Luo, Y.; Mai, J. DCDLN: A densely connected convolutional dynamic learning network for malaria disease diagnosis. Neural Netw. 2024, 176, 106339. [Google Scholar] [CrossRef] [PubMed]
Cai, Z.; Zhou, K.; Liao, Z. A systematic review of YOLO-based object detection in medical imaging: Advances, challenges, and future directions. Comput. Mater. Contin. 2025, 85, 2255. [Google Scholar] [CrossRef]
Sun, L.; Mo, Z.; Yan, F.; Xia, L.; Shan, F.; Ding, Z.; Song, B.; Gao, W.; Shao, W.; Shi, F.; et al. Adaptive feature selection guided deep forest for COVID-19 classification with chest CT. IEEE J. Biomed. Health Inform. 2020, 24, 2798–2805. [Google Scholar] [CrossRef] [PubMed]
Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; Fergus, R. Intriguing properties of neural networks. arXiv 2013, arXiv:1312.6199. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 2002, 86, 2278–2324. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Mahmood, K.; Mahmood, R.; Van Dijk, M. On the robustness of vision transformers to adversarial examples. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 7838–7847. [Google Scholar]
Shao, R.; Shi, Z.; Yi, J.; Chen, P.Y.; Hsieh, C.J. On the adversarial robustness of vision transformers. arXiv 2021, arXiv:2103.15670. [Google Scholar] [CrossRef]
Fu, Y.; Zhang, S.; Wu, S.; Wan, C.; Lin, Y. Patch-fool: Are vision transformers always robust against adversarial perturbations? arXiv 2022, arXiv:2203.08392. [Google Scholar] [CrossRef]
Joshi, A.; Akula, S.C.; Jagatap, G.; Hegde, C. A Few Adversarial Tokens Can Break Vision Transformers. 2025. Available online: https://robustart.github.io/long_paper/31.pdf (accessed on 9 June 2026).
Joshi, A.; Jagatap, G.; Hegde, C. Adversarial token attacks on vision transformers. arXiv 2021, arXiv:2110.04337. [Google Scholar] [CrossRef]
Shao, M. Random position adversarial patch for vision transformers. arXiv 2023, arXiv:2307.04066. [Google Scholar] [CrossRef]
Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the 2017 IEEE Symposium on Security and Privacy (sp); IEEE: New York, NY, USA, 2017; pp. 39–57. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
Dong, Y.; Liao, F.; Pang, T.; Su, H.; Zhu, J.; Hu, X.; Li, J. Boosting adversarial attacks with momentum. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2018; pp. 9185–9193. [Google Scholar]
Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the International Conference on Machine Learning, Vienna, Austria, 12–18 July 2020; PMLR; pp. 2206–2216. [Google Scholar]
Naseer, M.M.; Ranasinghe, K.; Khan, S.H.; Hayat, M.; Shahbaz Khan, F.; Yang, M.H. Intriguing properties of vision transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 23296–23308. [Google Scholar]
Gu, J.; Tresp, V.; Qin, Y. Are vision transformers robust to patch perturbations? In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 404–421. [Google Scholar]
Wu, W.; Su, Y.; Chen, X.; Zhao, S.; King, I.; Lyu, M.R.; Tai, Y.W. Boosting the transferability of adversarial samples via attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2020; pp. 1161–1170. [Google Scholar]
Wang, J.; Yin, Z.; Jiang, J.; Du, Y. Attention-guided black-box adversarial attacks with large-scale multiobjective evolutionary optimization. Int. J. Intell. Syst. 2022, 37, 7526–7547. [Google Scholar] [CrossRef]
Sharma, V.; Kalra, A.; Vaibhav, S.C.; Patel, L.; Morency, L.P. Attend and attack: Attention guided adversarial attacks on visual question answering models. In Proceedings of the Proc. Conf. Neural Inf. Process. Syst. Workshop Secur. Mach. Learn; Curran Associates, Inc.: Red Hook, NY, USA, 2018; Volume 2, p. 1. [Google Scholar]
Yuan, Z.; Zhang, J.; Jiang, Z.; Li, L.; Shan, S. Adaptive perturbation for adversarial attack. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 5663–5676. [Google Scholar] [CrossRef]
Duan, J.; Qiu, L.; He, G.; Zhao, L.; Zhang, Z.; Li, H. A region-adaptive local perturbation-based method for generating adversarial examples in synthetic aperture radar object detection. Remote Sens. 2024, 16, 997. [Google Scholar] [CrossRef]
Liu, J.; Zhang, C.; Lyu, X. Boosting the transferability of adversarial examples via local mixup and adaptive step size. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
Hu, J.; Li, X.; Liu, C.; Zhang, R.; Tang, J.; Sun, Y.; Wang, Y. APDL: An adaptive step size method for white-box adversarial attacks. Complex Intell. Syst. 2025, 11, 116. [Google Scholar] [CrossRef]
Xiao, C.; Li, B.; Zhu, J.Y.; He, W.; Liu, M.; Song, D. Generating adversarial examples with adversarial networks. arXiv 2018, arXiv:1801.02610. [Google Scholar]
Naseer, M.; Khan, S.; Hayat, M.; Khan, F.S.; Porikli, F. On generating transferable targeted perturbations. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 7708–7717. [Google Scholar]
Han, J.; Dong, X.; Zhang, R.; Chen, D.; Zhang, W.; Yu, N.; Luo, P.; Wang, X. Once a man: Towards multi-target attack via learning multi-target adversarial network once. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2019; pp. 5158–5167. [Google Scholar]
Fang, H.; Kong, J.; Chen, B.; Dai, T.; Wu, H.; Xia, S.T. Clip-guided generative networks for transferable targeted adversarial attacks. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–19. [Google Scholar]
Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf (accessed on 9 June 2026).
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2009; pp. 248–255. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: New York, NY, USA, 2021; pp. 10012–10022. [Google Scholar]
Mao, X.; Qi, G.; Chen, Y.; Li, X.; Duan, R.; Ye, S.; He, Y.; Xue, H. Towards robust vision transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2022; pp. 12042–12051. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: New York, NY, USA, 2017; pp. 4700–4708. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled weight decay regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kim, K.; Wu, B.; Dai, X.; Zhang, P.; Yan, Z.; Vajda, P.; Kim, S.J. Rethinking the Self-Attention in Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops; IEEE: New York, NY, USA, 2021; pp. 3071–3075. [Google Scholar]

Figure 1. Overall framework of AMGAA. The source and target images are first fed into ViT to extract feature representations and attention maps. The target features and attention information are used by the feature fusion module to generate target-guided fused features, while the source attention map guides the perturbation generator to produce adaptive perturbations on important regions. The generated perturbation is added to the source image to obtain the final adversarial example.

Figure 2. Architecture of the perturbation generator.

Figure 3. Ablation results of AMGAA on different victim models.

Figure 4. Effects of key hyperparameters on attack success rate across different victim models.

Figure 5. Visualization of ViT attention maps for the original images, AMGAA-generated adversarial images, and target images. The red box denotes the selected query patch. In the attention heatmaps, brighter yellow-green colors indicate stronger attention responses, whereas darker purple-blue colors indicate weaker attention responses. Compared with the original images, the adversarial images preserve similar visual appearance while changing attention responses across different Transformer layers. The target images serve as references for examining whether the adversarial attention responses exhibit target-oriented patterns.

Table 1. Attack success rates (%) of different methods on various models under the single-target setting. The surrogate model is ViT-B/16.

Attacks	ViT-B/16	Swin-B	RVT	DenseNet-121	VGG-19	ResNet-50	Avg
PGD	96.1	29.6	12.4	11.4	9.9	5.6	27.5
C&W	94.6	20.3	9.1	13.7	8.5	7.4	25.1
MI-FGSM	93.9	14.5	5.3	6.4	4.2	3.8	21.4
G-Patch	70.8	49.7	35.4	24.5	20.1	14.7	35.9
MAN	89.6	40.1	24.3	19.9	15.2	14.4	33.9
CGNC	90.5	47.4	33.2	20.0	19.7	11.6	37.1
AMGAA	91.4	55.4	40.7	30.1	27.6	14.1	43.2

Table 2. Multi-target attack success rates (%) on CIFAR-10. The surrogate model is ViT-B/16.

Attacks	ViT-B/16	Swin-B	RVT	DenseNet-121	VGG-19	ResNet-50	Avg
MAN	86.4	36.0	19.7	17.4	14.5	10.9	30.9
CGNC	87.7	39.5	28.4	16.6	18.4	9.4	33.3
AMGAA	89.8	47.4	36.8	26.4	22.1	11.8	39.0

Table 3. Generalization performance (%) on unknown classes of ImageNet. The surrogate model is ViT-B/16.

Attacks	ViT-B/16	Swin-B	RVT	DenseNet-121	VGG-19	ResNet-50	Avg
MAN	20.7	6.1	1.6	2.4	2.5	2.0	5.9
CGNC	30.5	12.8	4.7	10.5	11.1	8.6	13.0
AMGAA	41.7	30.4	14.6	15.0	14.7	10.2	21.1

Table 4. Perceptual quality evaluation of adversarial examples.

Attacks	SSIM	LPIPS
G-Patch	0.85	0.024
CGNC	0.91	0.019
AMGAA	0.91	0.018

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ou, D.; Lu, J.; Zhou, S.; Zeng, Y.; Liao, D.; Liu, H.; Yang, C.; He, Y.; Tian, J. AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy 2026, 28, 680. https://doi.org/10.3390/e28060680

AMA Style

Ou D, Lu J, Zhou S, Zeng Y, Liao D, Liu H, Yang C, He Y, Tian J. AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy. 2026; 28(6):680. https://doi.org/10.3390/e28060680

Chicago/Turabian Style

Ou, Dongbo, Jintian Lu, Shihui Zhou, Ying Zeng, Dongwan Liao, Haoyin Liu, Chao Yang, Yingsheng He, and Jie Tian. 2026. "AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers" Entropy 28, no. 6: 680. https://doi.org/10.3390/e28060680

APA Style

Ou, D., Lu, J., Zhou, S., Zeng, Y., Liao, D., Liu, H., Yang, C., He, Y., & Tian, J. (2026). AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers. Entropy, 28(6), 680. https://doi.org/10.3390/e28060680

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AMGAA: Attention-Guided Multi-Target Generative Adversarial Attack for Vision Transformers

Abstract

1. Introduction

2. Related Work

2.1. Development of Adversarial Example Generation

2.2. Vision Transformers and Adversarial Attacks

2.3. Attention Mechanisms in Adversarial Example Generation

2.4. Adaptive Perturbation Generation

2.5. Multi-Target Generators

3. Methodology

3.1. Overview of the Framework

3.2. ViT Feature Extraction

3.3. Attention-Guided Feature Fusion

3.4. Adaptive Perturbation Generation Module

3.5. Loss Optimization

3.5.1. Adversarial Loss

3.5.2. Attention Constraint Loss

3.5.3. Total Variation Loss

3.5.4. Overall Objective

4. Experiments and Results

4.1. Experimental Settings

4.2. Main Experimental Results

4.2.1. Transferability in Single-Target Attacks

4.2.2. Comparison of Multi-Target Attack Success Rate

4.2.3. Comparison of Generalization Ability

4.2.4. Perceptual Imperceptibility Analysis

4.3. Ablation Studies

4.3.1. Module Ablation Study

4.3.2. Loss Ablation Study

4.3.3. Hyperparameter Analysis

4.4. Attention Visualization Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI