1. Introduction
The transformer architecture was originally developed for natural language processing tasks. Subsequently, it has been successfully adapted to computer vision tasks, achieving performance comparable to that of Convolutional Neural Network (CNN)-based detectors in areas such as image classification [
1,
2,
3] and object detection [
4,
5,
6,
7,
8,
9,
10].
Transformer-based detectors primarily rely on self-attention and cross-attention mechanisms. These detectors employ a backbone network to extract image features, which are subsequently processed by the encoder’s self-attention mechanism in the form of a token sequence. In the self-attention mechanism, the three inputs—query (Q), key (K), and value (V)—are all derived from the image features extracted by the backbone network. The decoder consists of both self-attention and cross-attention mechanisms, which share a similar structural design but differ in their input sources. Within the decoder, the Q, K, and V for the self-attention mechanism are learnable parameters commonly referred to as “object queries.” Each object query represents a potential object to be detected.
The token sequence generated by the self-attention mechanism in the decoder, along with the output tokens from the encoder, is used to compute cross-attention, which ultimately contributes to the generation of detection results. Within the detector, self-attention primarily captures contextual relationships among tokens in the sequence, enabling a global understanding of object information. Cross-attention, on the other hand, aligns object queries with the encoder’s aggregated feature representations. The overall architecture of a typical transformer-based detector is illustrated in
Figure 1.
Although transformer-based detectors and CNN-based detectors differ significantly in architectural design, both are susceptible to adversarial patches [
11,
12]. First introduced by Brown et al. [
13], adversarial patches refer to localized, unrestricted perturbations added to input images, which can cause Deep Neural Network (DNN) models to produce incorrect outputs. A substantial body of research has demonstrated that adversarial patches generated using a white-box surrogate model can effectively mislead black-box models with unknown parameters and architectures—a property known as attack transferability [
14,
15,
16]. Assessing transformer-based detectors through the lens of adversarial patch transferability provides valuable insights into their security vulnerabilities, supports the development of more robust models for real-world deployment, and has garnered increasing attention within the academic community.
The integration of transformer-based object detectors into modern vision sensor systems—ranging from autonomous vehicles and surveillance cameras to mobile robotics and wearable devices—has accelerated their deployment in safety-critical applications. However, as these models become embedded in edge-AI sensor pipelines, their susceptibility to adversarial perturbations introduces tangible risks to system reliability and operational safety. Recent studies have emphasized the need for robustness-aware design in vision-based sensing, particularly under physical-world conditions where lighting variations, occlusions, and sensor noise are unavoidable [
17,
18]. Our work directly addresses this emerging concern by exposing critical vulnerabilities in attention mechanisms that are now foundational to many on-device detection frameworks. By demonstrating that localized adversarial patches can reliably degrade detector performance across diverse transformer architectures—even under real-world sensing constraints—this paper contributes to the growing field of secure and trustworthy sensing.
A number of studies have aimed to disrupt the self-attention mechanism within the encoder, seeking to manipulate the attention weights such that malicious query tokens are prioritized, thereby inducing erroneous detector outputs [
19,
20]. However, we observe that such approaches may not be fully effective, as adversarial perturbations often fail to align with the semantic structure of the input. In the multi-head self-attention layer of the encoder, the attention weight
between the
i-th and
j-th tokens is computed as
Let the input token sequence be denoted as x =
, where each
represents a token embedding. The learnable projection matrices for query (Q), key (K), and value (V) are denoted as
,
, and
, respectively. The attention score between the
i-th and
j-th tokens is abbreviated as
, which is obtained by applying the softmax function to the raw attention logits:
The output of the self-attention mechanism at the
i-th position is computed based on a weighted aggregation of all input tokens, formulated as
where
denotes the value vector associated with the
j-th token.
From Equation (
3), it is evident that the output of self-attention at position
i is a weighted sum of features from all positions in the input sequence. In the context of images, significant discrepancies exist between foreground (i.e., object) and background features, as well as among features corresponding to different object classes. This heterogeneity poses a challenge in training a single adversarial patch that can effectively interfere with both object and background tokens, which may exhibit vastly different feature distributions.
As shown in
Figure 2a, the visualization of the self-attention map within the red region indicates that attention scores decrease with increasing spatial distance from the query location, while higher scores are observed among tokens belonging to the same object in adjacent regions. In
Figure 2b, we apply an adversarially trained patch designed to amplify attention scores between the patch and other positions. The visualization of the red area reveals a more pronounced change in attention scores near the adversarial patch, suggesting stronger interference in its vicinity.
These results indicate that the patch exerts greater influence on nearby regions. However, increasing the attention weights between the adversarial patch and both object and background features across the image remains a challenging task. Furthermore, the original encoder computes self-attention over the entire input sequence, resulting in high computational complexity. To address this issue, subsequent works have focused on simplifying the encoder architecture through techniques such as sparse self-attention and lightweight network designs. These approaches often generate only a limited number of query tokens, primarily concentrated around object regions.
The architectural variations among different transformer-based detectors—particularly in terms of encoder design—lead to inconsistent attention behaviors, which in turn hinder the transferability of adversarial patches across models.
Moreover, these methods do not affect the cross-attention mechanism within the decoder. The output of cross-attention is formed by combining the encoder’s output tokens with a residual component derived from the object query. Enhancing the influence of adversarial tokens propagated from the encoder through the cross-attention mechanism may offer a promising avenue for improving attack effectiveness.
In this paper, we propose the Localized Query Attack (LQA), a targeted adversarial attack specifically designed for transformer-based object detectors. LQA utilizes an adversarial patch with unrestricted perturbations to disrupt both the encoder and decoder components of the model. Within the encoder, LQA interferes with the self-attention mechanism by selectively amplifying attention scores between object regions and the adversarial patch. As illustrated in
Figure 3a, the red area highlights the location of the adversarial token in the input image. LQA focuses on strengthening the self-attention interactions between the adversarial token and the surrounding object region (indicated in blue). Due to the high feature similarity in the vicinity of the object, LQA aligns more coherently with the inherent attention mechanism compared to global interference strategies. This alignment leads to more effective disruption of the encoder’s representation learning, as reflected in subsequent performance degradation.
Inspired by the work of Ferrando et al. on transformer interpretability in text translation tasks [
21], we compute the joint attention matrix (JAM) for the decoder. This matrix decomposes the cross-attention output into contributions from distinct source tokens, as illustrated in
Figure 3b. By attenuating the contributions of residual and normal encoder tokens within the JAM, we amplify the influence of the adversarial token, thereby effectively disrupting the cross-attention mechanism.
To evaluate the efficacy of LQA, we conducted experiments against five state-of-the-art methods using five different transformer-based detectors across two datasets. Our results demonstrate a significant improvement, achieving up to an 18.38% enhancement over the second-best method. Additionally, we validate the practical applicability of LQA in real-world scenarios, aiming to bolster the security applications of transformer-based detectors. Our research not only contributes theoretically but also showcases the potential of LQA in enhancing model robustness and security.
3. Localized Query Attack
3.1. Preliminaries
LQA trains adversarial patches using DETR as a local surrogate white-box detector
f. Given an input dataset
, the objective of LQA is to optimize the adversarial patch
such that the following condition is satisfied:
where
denotes the adversarial patch applied to the input image, and
y represents the detection output of the model. The detection result for the
i-th object is defined as
, where
corresponds to the bounding box with top-left coordinates
and dimensions
, and
denotes the classification score (before softmax) associated with the detected object. We denote the ground truth annotations as
G.
3.2. Method
LQA is primarily composed of two loss components: , which targets the self-attention mechanism in the encoder, and , which focuses on the cross-attention mechanism within the decoder.
3.2.1. Attack on Encoder
Several prior works attempt to disrupt the self-attention mechanism by either increasing the discrepancy in attention weights between adversarial examples and clean images, or amplifying the self-attention scores between adversarial tokens and other tokens. However, we identify two major limitations in these approaches. First, the significant feature variations among different tokens make it difficult to optimize a single adversarial patch that can effectively disrupt both object and background tokens simultaneously. Second, substantial architectural differences exist among the encoders of various transformer-based detectors. As a result, perturbing all tokens indiscriminately does not align well with the design trends of more advanced models.
To address these issues, we propose LQA to locally disrupt the self-attention within the encoder, as illustrated in
Figure 3a. In the figure, the yellow grid represents distinct image regions corresponding to individual tokens mapped back to the input space, while the red region highlights the location of adversarial tokens introduced by the adversarial patch. The blue box indicates the detected object region. We denote the set of adversarial tokens as
, and the set of object tokens within the detection box as
. Based on this formulation, the loss term for disrupting self-attention in the encoder is defined as
.
We aim to disrupt the detector by increasing the self-attention scores between adversarial tokens and normal tokens within the corresponding detection boxes. Tokens located inside the same detection box typically share similar visual features, as they belong to a single object instance. When an adversarial patch occupies any part of the detection box, the resulting adversarial token should exhibit heightened self-attention interactions with other tokens in that region. Therefore, we randomly select the position of the adversarial patch within the detection box during training. The patch placement is determined according to the following formulation:
Let denote the coordinates of the top-left corner of the detection box, and let represent its width and height, respectively. The coefficients and are independently sampled from a uniform distribution over the interval (0, 1), ensuring diverse and spatially balanced placement of the adversarial patch within the detection box.
3.2.2. Attack on Decoder
The decoder layer in transformer-based architectures primarily consists of self-attention and cross-attention mechanisms, which share a similar structural framework but differ in their input sources. In the self-attention mechanism, the query, key, and value vectors are all derived from the decoder’s learnable object queries. Initially, self-attention is computed among these object queries to model inter-object dependencies and contextual relationships. Subsequently, the image features extracted by the encoder serve as the key and value inputs, while the output of the self-attention module acts as the query for the cross-attention mechanism. This enables the decoder to selectively extract object-related information from the global image features. The final output tokens are generated by integrating the output of the cross-attention module with the original query through residual connections, where this fused representation serves as the input for subsequent decoder layers.
Cross-attention plays a crucial role in modeling interactions between image features and object queries, making it essential for accurate object detection. In the context of Latent Query Adversary (LQA), adversarial patches are introduced into the input image to interfere with the detection process. These patches generate adversarial tokens that are subsequently used as query inputs in the cross-attention computation. The resulting cross-attention output tokens are influenced by two primary components: (1) the image features obtained from the encoder, and (2) the object queries propagated through the residual connection. Notably, object queries are learnable parameters within the decoder and are not directly accessible or modifiable by an attacker.
Prior studies have shown that residual components significantly influence the weight distribution of cross-attention outputs. To investigate this further, we conducted a comparative analysis of the cosine similarity between the output tokens and their corresponding residual components across different decoder layers in the DETR model. The results are visualized in
Figure 4a.
Figure 4a illustrates that the cosine similarity between residual components and cross-attention outputs increases with the number of decoder layers. Adversarial patches primarily affect the cross-attention mechanism through encoder input tokens. A higher similarity indicates that the influence of adversarial tokens (from the encoder input) on the cross-attention output diminishes, implying a reduced adversarial effect.
The self-attention map in transformer-based detectors inherently contains both semantic foreground signals (e.g., object regions) and background clutter or texture noise (i.e., the residual component). During adversarial attack, perturbations that indiscriminately amplify all attention responses may waste energy on non-discriminative background regions, reducing transferability. To resolve this, we propose computing a joint attention matrix(JMA), which decomposes the cross-attention output into contributions from the encoder input tokens and the residual components. By suppressing the residual, we encourage the adversarial patch to interfere primarily with object-related feature pathways, thereby enhancing cross-model disruption while minimizing perceptual distortion.
Let
denote the query inputs from the decoder in cross-attention, and let
represent the key–value inputs from the encoder. The attention weight matrix for the
h-th head in multi-head cross-attention can then be expressed as
The attention weight
is multiplied with the value tokens and the corresponding projection matrix to produce the output token of the cross-attention mechanism, formulated as
where
and
denote the output and value projection matrices for the
h-th attention head, respectively. The original input to the cross-attention module is retained as the residual component
, which is then concatenated with the computed cross-attention output. An
normalization is applied along the concatenation dimension to construct the joint attention matrix (JAM), denoted as
. The final JAM is defined as
In
, the cross-attention output is explicitly decomposed into contributions from the encoder input and the residual component. However, directly increasing the weight of adversarial tokens within the JAM to amplify their influence in cross-attention computation is not practically feasible. This is due to the fact that positional encodings are embedded within each object query in the decoder, as illustrated in
Figure 4b. By visualizing the spatial positions of detected objects associated with individual object queries, we observe a shared receptive region for objects detected by the same query.
Moreover, the position of the adversarial patch remains fixed in the input image. Although multiple adversarial tokens may be present in the encoder input, typically only one token is associated with each object query. Establishing a precise correspondence between adversarial tokens and object queries thus presents a non-trivial challenge.
To address this limitation, we propose an alternative strategy: suppressing the contribution of the residual component within the JAM. This approach effectively amplifies the relative influence of adversarial tokens on the cross-attention output, without requiring explicit alignment between object queries and adversarial regions. Based on this mechanism, we design a decoder-side adversarial loss, denoted as
, to diminish the impact of the residual pathway and thereby enhance the adversarial effect. The detailed formulation of
is presented in Equation (
10).
3.2.3. Total Loss
In addition to
and
, the LQA incorporates two additional loss components: a classification loss
and a total variation loss
, defined as follows:
Here,
denotes the confidence score assigned to the target class by the model. In our formulation, the background class is selected as the target class to encourage misclassification toward non-object predictions.
The total variation loss penalizes spatial discontinuities between adjacent pixels in the adversarial patch, promoting smoothness. A smoother patch is more robust to real-world noise and better mimics natural textures.
In total, LQA integrates four distinct loss components. Determining appropriate weighting coefficients for these losses presents a significant challenge, especially due to potential conflicts during optimization—for instance, between
and
. To address this issue, we adopt an adaptive weighting strategy inspired by Liu et al. [
38], which dynamically adjusts the loss coefficients based on their historical variations across iterations.
Specifically, the relative change in each loss
between consecutive iterations is computed as
Then, the normalized weight
for the
k-th loss at iteration
t is calculated using a softmax-like function:
where
controls the overall magnitude of the weighted losses and is empirically set to 10 in our experiments. Finally, the overall objective function used in LQA is formulated as
3.3. Experimental Settings
Detector. This paper employs DETR as the surrogate white-box detector for generating adversarial patches. In addition, we use Anchor-DETR, DAB-DETR, Conditional-DETR, Deformable-DETR, and DINO as black-box detectors to evaluate the transferability of the generated adversarial examples.
Although all these models are based on the transformer architecture, they differ significantly in structure and design. Conducting experiments in a black-box setting with these diverse architectures enables a comprehensive assessment of the generalization and effectiveness of adversarial attacks across different variants of transformer-based object detectors.
Datasets and Evaluation Metrics. We conduct experiments on two widely used datasets: the INRIA Person dataset and the COCO Person dataset.
The INRIA Person dataset is a benchmark for pedestrian detection, containing thousands of images, including 614 positive training samples and 288 positive test samples. The COCO dataset is a large-scale object detection benchmark consisting of 80 object categories, with 118 K training images and 5 K validation images. The COCO Person dataset, a subset of the COCO validation set, contains 1695 images that include at least one person.
Both datasets encompass a wide variety of scenes, including indoor and outdoor environments, and varying lighting conditions, times of day, and weather scenarios. Furthermore, the datasets feature a broad range of human poses and appearances, making them well-suited for evaluating the performance of adversarial patches under diverse real-world conditions.
In our experiments, adversarial patches are trained on the INRIA Person training set and evaluated on both the INRIA Person and COCO Person validation sets. Detailed experimental settings and results are summarized in
Table 1.
In object detection, mAP (mean average precision) is a widely adopted evaluation metric that quantifies the precision of a detector across different object categories. In our experiments, the effectiveness of adversarial attacks is evaluated by measuring the reduction in mAP@0.5—that is, the average precision at an Intersection over Union (IoU) threshold of 0.5—when adversarial examples are introduced.
The evaluation of detection performance typically involves two key criteria: localization accuracy and classification consistency. Localization accuracy is determined by whether the predicted bounding box overlaps sufficiently with the corresponding ground truth box, as measured by the IoU metric and compared against a predefined threshold. Classification consistency evaluates whether the predicted class label matches the true object category. A detection is considered correct only if both criteria are satisfied.
Based on these criteria, positive detections are identified, and precision is computed accordingly. The average precision (AP) is then calculated for each object category by integrating the precision–recall curve at different confidence thresholds. In this work, we focus specifically on the person category; hence, the reported mAP corresponds to the AP of the person class under the IoU threshold of 0.5.
As illustrated in
Figure 5, the IoU between two bounding boxes
and
is defined as the ratio of their intersection area to their union area:
While an IoU of 1 ideally indicates perfect alignment between the predicted and ground truth boxes, achieving this is rare in practice. Therefore, a detection is typically considered valid if its IoU with the ground truth exceeds a certain threshold—commonly set to 0.5.
Implementation Details and Comparative Methods. All experiments are implemented using PyTorch 1.10 on an NVIDIA GeForce RTX 3090 GPU. For a fair comparison, all adversarial patch generation methods are implemented with DETR as the surrogate detector.
We employ the Adam optimizer with an initial learning rate of 0.03. The learning rate is decayed by a factor of 0.97 when the change in total loss falls below over consecutive iterations. Input images are resized such that the longest side is 416 pixels, and then zero-padded to obtain a fixed size of 416 × 416 pixels. The batch size is set to 32, and the model is trained for a total of 500 epochs.
The adversarial patch is initialized with a size of 300 × 300 pixels. During testing, it is scaled to 0.13 times the height of the detected bounding box and centered on the target object. This scaling strategy results in patches of varying sizes depending on the scale of the target object, enabling more realistic evaluation across different object dimensions.
The comparative methods include Adversarial Patch (AdvPatch) [
13], Transfer-based Self-Ensemble Attack (T-SEA) [
32], Patch-Fool [
19], Gradient Normalization Scaling (GNS) [
37], and Pay No Attention (PNA) [
34]. A detailed overview of these methodologies is provided in
Table 2.
Among these, AdvPatch serves as the baseline method in our study. T-SEA enhances attack transferability through a self-ensemble strategy that aggregates predictions across multiple augmented views of the input. Patch-Fool is specifically designed for transformer-based models and aims to disrupt attention scores between the adversarial patch and other image regions. GNS improves the stability of the optimization process by scaling the backpropagated gradients, thereby enhancing the generalization of adversarial examples across different models. PNA further improves transferability by suppressing gradient propagation through the attention modules during backpropagation.
All methods have been adapted to train with DETR as the surrogate model. Among them, Patch-Fool, GNS, and PNA are explicitly tailored for attacking transformer-based architectures.
3.4. Experimental Results
The experimental results are summarized in
Table 3 and
Table 4, which report the performance of various adversarial patch generation methods under black-box settings across two datasets: INRIA Person and COCO Person.
We compare the average precision decline (i.e., mAP@0.5 drop) achieved by different attack methods. AdvPatch perturbs the detector by increasing background scores across all detection outputs, resulting in a precision reduction of 30.11% on the INRIA Person dataset and 23.48% on the COCO Person dataset.
T-SEA enhances transferability through extensive data augmentation and gradient manipulation during backpropagation. These augmentations include variations in brightness, contrast, saturation, and hue, as well as random rotation and occlusion of the adversarial patch. Additionally, T-SEA employs the ShakeDrop technique to fine-tune the backbone network during training, promoting better gradient aggregation and self-ensembling. Compared with AdvPatch, T-SEA achieves a 7.16% improvement on the INRIA Person dataset and a 3.27% improvement on the COCO Person dataset. However, its performance remains inferior to that of transformer-specific approaches such as PNA and GNS.
PNA and GNS are both tailored for attacking transformer-based models and primarily focus on manipulating gradient signals. GNS observes that mild gradients can significantly impact transferability and thus applies channel-wise normalization to scale them accordingly. In contrast, PNA suppresses gradient propagation through attention matrices during training, effectively reducing model robustness to adversarial perturbations. Both methods achieve notable improvements in cross-model generalization. As shown in
Table 3, their performances are comparable; however, neither matches the effectiveness of LQA. It is worth noting that LQA differs fundamentally from PNA. PNA suppresses gradients in attention maps during inference to enhance model robustness. In contrast, LQA constructs a differentiable proxy
of self-attention to guide perturbation toward maximizing detection failure. Thus, while both involve attention manipulation, their implementation paradigms (optimization guidance vs. attention modification) are orthogonal.
Patch-Fool, like LQA, targets the self-attention mechanisms within the encoder. However, instead of focusing on local regions, Patch-Fool disrupts global attention scores, leading to broader influence over predictions. Compared to AdvPatch, it improves transferability by 4.65% on the INRIA Person dataset and 5.37% on the COCO Person dataset. To quantitatively validate the advantage of LQA’s localization, we compare it with Patch-Fool using the Average Response Ratio (ARR), defined as the ratio of average attention gain on foreground tokens to that on background tokens.
LQA achieves an ARR of 4.8, significantly higher than Patch-Fool’s 1.3, indicating that its perturbation is more focused on relevant object regions, demonstrating that our localized strategy effectively suppresses spurious responses in background areas. This confirms that localization mitigates the collateral interference inherent in global methods, thereby enhancing both precision and transferability of the attack.
LQA shifts the focus from global attention disruption to local self-attention score manipulation, thereby enhancing its ability to perturb foreground features via adversarial patches. Furthermore, we strengthen the effect of adversarial patches on cross-attention mechanisms by leveraging joint attention maps. Consequently, compared to the AdvPatch baseline, LQA achieves improvements of 27.56% and 10.34% on the two datasets, respectively. Compared to the second-best-performing method, GNS, LQA demonstrates gains of 18.38% and 4.97%, respectively.
We optimize LQA with COCO dataset-pretrained detectors and evaluate its performance on both the INRIA Person and COCO Person test sets. Given the smaller size of the INRIA dataset compared to COCO, results on the COCO dataset serve to illustrate how different methods perform under training and test data distribution discrepancies. Our findings demonstrate that despite domain shifts, LQA achieves a 4.97% performance improvement over the second-ranked method, affirming its superior attack transferability.
3.5. Ablation Experiments
To investigate the contribution of different loss components to the transferability of LQA, we conduct ablation experiments on the INRIA Person dataset. AdvPatch is used as the baseline method. We progressively enhance AdvPatch by incorporating two key components—the encoder-focused loss
and the decoder-focused loss
—leading to the full LQA framework. The results of these ablation studies are summarized in
Table 5.
AdvPatch generates adversarial patches primarily through data augmentation and by increasing background confidence scores in the detector outputs. In comparison, introduces perturbations to the cross-attention mechanism within the decoder using joint attention matrices, leading to a moderate improvement in attack transferability over AdvPatch. The loss further enhances transferability by introducing localized disruptions to the self-attention mechanism, resulting in an improvement of over 20% compared to the AdvPatch baseline.
When both components are integrated into the proposed LQA framework, the combined loss formulation yields a substantial performance gain over using either loss individually, demonstrating the effectiveness of our unified optimization strategy.
To validate the necessity of adaptive weighting, we compare it against single loss (AdvPatch) and equal weights in
Table 6. Our method reduces mAP by 19.36 compared to equal weighting, demonstrating that dynamic balancing of
and
is crucial for optimal attack performance.
3.6. Detection Results
This subsection presents the detection results of LQA across various transformer-based object detectors.
Figure 6a displays the original detection results without any adversarial perturbation, while
Figure 6b–f illustrate the detection outputs when the adversarial patch generated by LQA is applied to Anchor-DETR, Conditional-DETR, DAB-DETR, Deformable-DETR, and DINO, respectively. The figure clearly demonstrates that LQA induces missed detections across all evaluated models, with varying degrees of impact depending on the detector architecture.
3.7. Real-World Verification
In this section, we conduct real-world validation experiments using Anchor-DETR and Deformable-DETR to evaluate the practical effectiveness of LQA in sensor-driven AI environments. Video footage is captured using a Xiaomi 14 Pro smartphone—a representative consumer-grade imaging sensor widely deployed in IoT and mobile perception systems—to emulate realistic input conditions faced by vision sensors in the wild. Prior to attack evaluation, we verify the baseline detection accuracy of the models under normal operating conditions. This setup allows us to assess how adversarial patches impact detector reliability when processed through real optical and digital sensor pipelines, where factors such as dynamic range, auto-exposure, and lens distortion inherently shape the input data.
We then apply the LQA-generated adversarial patch to a tablet screen placed within the scene, continuously recording the detector outputs during the interaction. Detections are considered valid if their confidence score exceeds a threshold of 0.7, consistent with the default settings of the detectors.
The transferability results across five heterogeneous transformer-based visible-object detectors, together with the comparison of real-world performance, demonstrate the practical viability of LQA.