RobustQuote: Using Reference Images for Adversarial Robustness

Lemarchant, Hugo; Liu, Hong; Nakashima, Yuta

doi:10.3390/app15105470

Open AccessArticle

RobustQuote: Using Reference Images for Adversarial Robustness

by

Hugo Lemarchant

^1,*

,

Hong Liu

² and

Yuta Nakashima

²

¹

Graduate School of Information Science and Technology, University of Osaka, Suita 565-0871, Japan

²

D3 Center, University of Osaka, Suita 565-0871, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(10), 5470; https://doi.org/10.3390/app15105470

Submission received: 17 April 2025 / Revised: 9 May 2025 / Accepted: 12 May 2025 / Published: 13 May 2025

(This article belongs to the Special Issue Adversarial Attacks and Cyber Security: Trends and Challenges)

Download

Browse Figures

Review Reports Versions Notes

Abstract

We propose RobustQuote, a novel defense framework designed to enhance the adversarial robustness of vision transformers. The core idea is to leverage trusted reference images drawn from a dynamically changing pool unknown to the attacker as contextual anchors to detect and correct adversarial perturbations in input images. By quoting semantic features from uncorrupted references, RobustQuote mitigates the propagation of corrupted features through the model and uses these references to compute the rectification term. RobustQuote consists of two key modules: a quotation mechanism that propagates the global semantic tokens (i.e., the [cls] tokens) from reference images, and a rectification mechanism that adjusts the image tokens of the adversarial input using contextual signals from those references. During training, the model is explicitly guided to detect and rectify adversarial inputs more aggressively than clean ones. This approach is modular and plug-and-play, making it compatible with a wide range of vision transformer architectures, such as DeiT. Experiments show that RobustQuote achieves adversarial robustness on par with TRADES-trained DeiT under strong threat models, even when the attacker is aware of the reference set. In a more realistic setting where the attacker lacks access to the references, RobustQuote outperforms the second-best defense by +12.2% adversarial accuracy against the C&W attack. Our findings highlight the underexplored potential of incorporating external, attacker-unknown context as a robust defense strategy. RobustQuote offers a promising direction for addressing evolving adversarial threats in modern vision models, directly aligning with the critical challenges in adversarial machine learning and cybersecurity.

Keywords:

vision transformer; adversarial attack; robustness; cross-attention

1. Introduction

Recent studies have observed that deep models are sensitive to adversarial examples, which have been shown to be threatening in various domains, such as computer vision [1,2], natural language processing [3], and speech recognition [4]. Research efforts aim to mitigate the impact of adversarial examples, such as adversarial training [5], input denoising [6], and attack detection [7].

Adversarial training is one of the most effective defense mechanisms, which generates adversarial examples and incorporates them into the training set [5]. This strengthens the decision boundaries [8] of the model subjected to such training. These new boundaries are harder to breach because they usually do not overfit the training distribution too closely. However, these boundaries can still be overcome with a larger perturbation or a new attack algorithm [9]. In brief, a model trained with these techniques will keep the reinforced weights and be static. A consequence of this is that an attacker finding a successful perturbation will be able to use it and replay it to consistently fool the model. We qualify defense strategies that exhibit this pattern as static.

Another way to defend against it would be to completely prevent the generation of attacks by stopping the correct flow of the gradient, for instance. This kind of defense, called gradient obfuscation [10], is often criticized and overlooked because of its intrinsic tendency to hide potential new vulnerabilities introduced in the defense method itself. For this reason, it is not desirable to prevent the generation of the attack according to the definition of gradient obfuscation [11], but rather to rely on a dynamic aspect that introduces discrepancies between the generation of the attack and the inference stages.

We propose making the adversarial perturbation less relevant to the final inference task by leveraging the generalization capabilities of the model to find citations and self-corrections from reference images. Following Figure 1, we compare the classification task to the process of testing a student’s knowledge. The input image is the question, the internal features are the student’s thinking process, and the predicted result is the student’s answer. So, adversarial examples can be considered as a deformed representation of the question, affecting the student’s thinking process, and in turn the answer. The attacker, or the evil examiner, may design a question that seems ambiguous, and even confuse the model/student in a given context, but can hardly generalize to all possible contexts. But how can a context interact with the input image? We follow our inspiration from the student’s test and argue that the quotation relies on generalized features that can be compared with a context, in a trackable way, but also in a way that varies with a different context. Specifically, we can request the student to retrieve information from reference documents to answer the question and to substantiate their reasoning using the same documents. This information retrieval and valorization, performed at inference time and generalizable on an arbitrary set of references, is the core of this work, putting the attacker in an asymmetric situation between attack generation and inference time.

To this end, our novel defense mechanism, called RobustQuote, is a newly designed network block that can be plugged into most vision transformer (ViT) [12] architectures. This block aims to use the classification token of reference images to create a new classification token free of perturbations from the input image. We achieve this by replacing the classification token of the query image with a mixture of equivalent tokens from reliable and unattacked references. We propose to rely on the generalization of class embedding to create a dynamic defense method. In practice, this means utilizing a new set of references during each inference, thereby breaking the attacker’s guarantee that the perturbation computed for a given batch of references will transfer to another batch. It also appears obvious that the technique’s weakness is in relying on dynamic references to prevent the attacker from generating relevant attacks, rather than preventing him from generating the attack altogether. Yet, even in this case of the attacker gaining complete knowledge of the inference context, we argue that this scenario is as good as the traditional transformer inference, with no context to perform correction.

Inspired by the works in cross-attention mechanism [13,14,15] and feature correction [16,17], we use the class embedding of references as the new values for the output class embedding (i.e., value vector in the terminology of attention mechanism), which constitutes the quoting module. At the same time, while we replace the classification token, we present a complementary rectification module, intervening over the rest of the image patches. Using previously selected supporting references, we generate a correction to all the patches of the input image, specifically trained to exclusively mitigate adversarial perturbations. Therefore, RobustQuote is a block composed of two modules that fulfill the above-mentioned functions and is a plug-and-play module that can be integrated into recent vision transformer architectures [12,18].

We empirically demonstrate the robustness of our method and its competitiveness against state-of-the-art (SotA) defense methods. On the CIFAR-10 [19] and ImageNette benchmarks [20], ref. [5], we achieve gains up to 9.6% under the PGD-100 attack, and up to 0.5% under stronger AutoAttack [21]. We evaluate our method against the common weaknesses of gradient obfuscation, confirming the extent of the obfuscation but also showing that we still maintain the backbone’s robust scores in the case of an adaptive attack. We record interesting effects of the placement of the block along the structure of ViT, connected to the learned representation of ViTs. We uncover a tension between early block and robust rectifications and later blocks and accurate predictions.

Codes are publicly available on GitHub: https://github.com/ErestorX/RobustQuote, accessed on 22 January 2025.

2. Related Works

The research for robust adversarial defense is dense, and several paradigms have shown significant results. Wang et al. [22] provide a comprehensive anatomy of recent defenses, such as adversarial training being a fundamental building block of many other defenses and layer regularization being another popular method.

2.1. Adversarial Training

Adversarial training (AT) [5] injects adversarial examples into the training dataset

D

to improve the robustness of the model against adversarial attacks. Given a neural network

F_{θ} (\cdot)

parameterized by

θ

and the input natural image X,

x \in R^{N \times D}

are the corresponding internal features computed within

F_{θ} (X)

, where N is the number of tokens and D is the dimension of token features. We also define y as the ground-truth label and

\hat{y} = arg max F_{θ} (X)

as the predicted category with respect to the input image. Then, AT can be formulated as a min–max robust optimization problem as follows:

min_{θ} E_{(X, y) \sim D} [max_{δ} J (F_{θ} (X + δ), y)],

(1)

where

J (\cdot, \cdot)

is the loss function, and

δ

is the adversarial perturbation that is bounded within the

l_{p}

norm constraint of

{∥δ∥}_{p} \leq ϵ

.

Inner maximization usually uses the projected gradient descent (PGD) attack method [5] to find the adversarial perturbation

δ

, and the outer minimization updates parameter

θ

to minimize

J (\cdot, \cdot)

with respect to the adversarial example

X + δ

via a default optimizer such as stochastic gradient descent (SGD) with momentum. Recently, adversarial regularization techniques have been proposed to further improve robustness, such as adversarial logit pairing (ALP) [23] and TRADES [24]. Simply speaking, these methods regularize the feature distance between natural and adversarial samples.

2.2. Layer Regularization

Taking advantage of our recent understanding of feature behavior during adversarial attacks, one may try to guide the features to be more robust via regularization. The main hypothesis is that not all features are equal during adversarial attacks, and some features learned from the dataset are often much weaker and non-robust [25]. These features, in turn, are exploited and propagate the perturbation, amplifying it until the final layers and in the logits. Various works have been made to remedy this behavior, mainly through gradient regularization and disentanglement techniques [26,27]. Regularizing the gradient aims to induce lesser sensitivity to local perturbations, rendering the attacks less effective against adversarial training methods. Conversely, [27] showed that batch normalization (BN) is detrimental to adversarial robustness since it promotes non-robust features, creating a conflict when jointly optimizing usefulness (accuracy) and robustness. Instead, they propose a disentanglement of robustness from usefulness that enables the network to learn high-quality robust features way quicker and only relies on non-robust features to achieve greater accuracy, thanks to BN.

Other works also promote architecture modification and feature pruning to find better feature separation [28,29]. Huang et al. propose multiple guidelines to build a robust residual module on top of empirical findings, such as the placement of activation functions and the importance of bottleneck blocks. Dhillon et al. prune the small activation from the model, in line with the idea from [27], to promote robustness instead of enforcing maximum usefulness. In parallel, [7] attempts to detect perturbations at inference time, statistically finding structures most impacted by adversarial attacks on transformer-based models.

At the same time, other works propose to interpret perturbations as noise and employ feature denoising to restore safer information propagation [30]. Finally, Kim et al. [17] directly tackle the problem of non-robust features and attempt to recalibrate them towards maximum usefulness, reducing the perturbation and recovering information.

2.3. Gradient Obfuscation

Gradient obfuscation [10] is the generic term describing the techniques relying on the randomization or falsification of the gradients used to calculate the adversarial perturbation. Despite the known limitations of gradient masking [11] as an incomplete defense against adversarial attacks, recent defenses [31] still rely on this effect. A defense mechanism induces stochastic gradients when it involves randomness, either by randomizing the network itself or by applying random transformations to the input before the classification. This randomness makes the gradients unpredictable, causing methods that rely on a single sample to incorrectly estimate the actual gradient. In addition, shattered gradients occur when a defense is non-differentiable, introduces numerical instability, or otherwise results in nonexistent or incorrect gradients.

3. Method

RobustQuote is designed for vision transformer models such as ViT [12] and DeiT [18]. See Figure 2. RobustQuote is a modular block that can be inserted into arbitrary positions between a consecutive pair of transformer blocks. A RobustQuote block limits the snowball effect of adversarial attack [32] by (i) reconstructing the input image’s [cls] token from those of the reference images so as to cut off the direct path from the input image to the prediction, and (ii) contextualizing input image’s patch features with other patch features so as to dilute the impact of the perturbations. In detail, the proposed RobustQuote consists of a quotation module (responsible for (i)) and a rectification module (responsible for (ii)), each of which is shown in Figure 3.

Let

I^{X}

be a natural image that is potentially attacked and classified into the category

ω \in Ω

. RobustQuote aims to utilize a trusted set

R = {I_{ω}^{R} ∣ ω \in Ω}

of reference images

I_{ω}^{R}

, sampled from the training set (detailed in Section 3.5). Then, we denote a learned transformer model as

F_{θ} (\cdot)

parameterized by

θ

, which uses a patch representation of

I^{X}

and

I_{ω}^{R}

, as well as a vector corresponding to the [cls] token. Let

X \in R^{N \times D}

denote the set of feature vectors for every image patch in

I^{X}

, and

x_{cls} \in R^{D}

the feature vector corresponding to the [cls] token (while

x_{cls}

and X are processed together in the ViT backbone, we represent them here separately). All reference images

I_{ω}^{R}

also go through the same backbone, and we denote their feature vectors as

R_{ω} \in R^{N \times D}

and

r_{ω, cls} \in R^{D}

for them (we assume the same number of patches in input and reference images for simplicity).

3.1. Quotation Module

We design the quotation module to substitute

x_{cls}

, which can be the primary carrier of adversarial perturbations, with unaltered (i.e., trustworthy)

r_{ω, cls}

’s. Let

R_{cls} \in R^{| Ω | \times D}

denote the set of all

r_{ω, cls}

’s. The quotation module calculates the scaled dot-product attention

A_{cls} \in R^{1 \times | Ω |}

as follows:

A_{cls} = Att (Q_{q} (x_{cls}), K_{q} (R_{cls})),

(2)

where

Q_{q} \in R^{1 \times D}

and

K_{q} \in R^{| Ω | \times D}

are linear projections, and

\begin{matrix} Att (Q, K) = Softmax (\frac{Q K^{⊤}}{\sqrt{D}}), \end{matrix}

(3)

with Q and K being arbitrary matrices whose number of rows is D. The quotation module then reconstructs

x_{cls}

with

r_{ω, cls}

’s weighted by

A_{cls}

, according to the following definition of cross-attention:

\begin{matrix} {\tilde{x}}_{cls} = P_{q} (A_{cls} V_{q} (r_{cls})), \end{matrix}

(4)

where

V_{q}

and

P_{q}

are linear projections. We replace

x_{cls}

with

{\tilde{x}}_{cls}

to sever the propagation of adversarial perturbations through

x_{cls}

.

The quotation module is trained so that

{\tilde{x}}_{cls}

is still useful for classification. Our expectation is that it learns to reconstruct the features corresponding to the cls token from semantically similar reference images.

3.2. Rectification Module

The rectification module (right module in Figure 3) aims to correct the adversarial features in

X_{p}

, referring to the unattacked information in

R_{ω}

’s as follows:

\begin{matrix} \tilde{X} = X + G (X, {R_{ω}}), \end{matrix}

(5)

where G calculates vectors to mitigate the effect of adversarial perturbation. Here, we propose to use the self-attention mechanism [33] and two derived strategies involving

R_{ω}

to a greater and greater degree to generate the term G.

Self-attention block: We use the traditional self-attention module without intervention of

R_{ω}

as a baseline, formulated as follows:

\begin{matrix} G_{sa} (X, {R_{ω}}) = P_{sa} (Att (Q_{sa} (X), K_{sa} (X)) V_{sa} (X)), \end{matrix}

(6)

where

Q_{sa} (\cdot)

,

K_{sa} (\cdot)

,

V_{sa} (\cdot)

, and

P_{sa} (\cdot)

are all linear projections. These projections allow us to find robust features in the original image’s patch and combine them to create an efficient rectification, guided by a secondary loss described in Section 3.4.

Reference-aware self-attention:

R_{ω}

patches are used as a comparison ground to decide the natural/corrupted state of X. This approach selects patch features in X that are similar to those in

R_{ω}

to correct patch features in X that most diverge from

R_{ω}

. Specifically, we use a slightly modified version of the scaled dot-product attention to find the attention weight for the correcting patches from X:

\begin{matrix} A_{ω}^{rr} = max (0, Q_{rr} (X) K_{rr} {(R_{ω})}^{⊤}), \end{matrix}

(7)

where

Q_{rr} (\cdot)

and

K_{rr} (\cdot)

are linear projections. We use the element-wise max function instead of softmax as we wish to use only positively relevant patch features in X, expecting that the similarity to a natural feature works as a proxy to the relevance for robust correction. Likewise, the attention for the correction’s targets in X is provided by the following:

\begin{matrix} A_{ω}^{ri} = max (0, - Q_{ri} (R_{ω}) K_{ri} {(X)}^{⊤}), \end{matrix}

(8)

where

Q_{ri} (\cdot)

and

K_{ri} (\cdot)

are linear projections. This time, we negate the similarity as we wish to target only patch features in the original X that are most different from natural features (this negation is used to emphasize our expectation and is unnecessary as it can be learned in

Q_{ri}

and

K_{ri}

).

The final attention map is the weighted average of these attention maps, provided by the following:

\begin{matrix} A_{rsa} = \sum_{ω \in Ω} a_{cls, ω} A_{ω}^{ri} A_{ω}^{rr}, \end{matrix}

(9)

where

a_{cls, ω}

is the attention weight in

A_{cls}

in Equation (2), corresponding to the category

ω

, which provides the semantic relevance between the input and the reference for

ω

. Eventually, we compute G with this attention map as follows:

G_{rsa} (X, {R_{ω}}) = P_{rsa} (A_{rsa} V_{rsa} (X)),

(10)

where

P_{rsa} (\cdot)

and

V_{rsa} (\cdot)

are linear projections.

Reconstruction: Similarly to the quotation module, we can reconstruct X directly from

R_{ω}

’s. The attention weight can be provided by the following:

\begin{matrix} A_{r} = \sum_{ω \in Ω} a_{cls, ω} Att (Q_{r} (R_{ω}), K_{r} (X)), \end{matrix}

(11)

and G is calculated as follows:

\begin{matrix} G_{r} (X, {R_{ω}}) = P_{r} (A_{r} V_{r} (R_{ω})), \end{matrix}

(12)

where

Q_{r} (\cdot)

,

K_{r} (\cdot)

,

V_{r} (\cdot)

, and

P_{r} (\cdot)

are linear projections, respectively.

3.3. RobustQuote Block

We feed the outputs of the quotation and rectification modules into a residual architecture as follows:

x_{cls} \leftarrow {\tilde{x}}_{cls} + mlp ({\tilde{x}}_{cls}), X \leftarrow \tilde{X} + mlp (\tilde{X}),

(13)

with

mlp

as a multi-layer perceptron following the same dimensions as the ones in ViT blocks, shared for two types of tokens. After these operations, the representations of

x_{cls}

and X are recombined and forwarded to the next transformer block of the network.

3.4. Training Loss

An output of the model with RobustQuote blocks (after softmax) can be denoted by

F_{θ} (I^{X}, {I_{ω}^{R}}) \in {[0, 1]}^{{| Ω |}}

as it takes both the input image and the reference images. Let

(I^{X}, y)

denote the input image and the corresponding ground-truth category. We train the model with a RobustQuote block using TRADES-based adversarial training [24], which minimizes the KL divergence between the natural and adversarial logits:

L_{cls} (I^{X}, I^{X} + δ, I_{ω}^{R}) = L_{CE} (F_{θ} (I^{X}, I_{ω}^{R}), y) + β L_{KL} (F_{θ} (I^{X} + δ, I_{ω}^{R})), F_{θ} (I^{X}, I_{ω}^{R})),

(14)

where

L_{CE}

is the cross-entropy loss,

L_{KL}

the KL divergence loss, and

β

is the trade-off term used to balance classification loss and adversarial robustness. In this equation,

δ

is the adversarial perturbation crafted by the PGD attack [5] (as in Section 2.1).

In addition, we introduce a loss term for our rectification given by G (i.e., one of

G_{sa}

,

G_{rsa}

, and

G_{r}

). When the input image

I^{X}

is attacked, G should produce patch features that cancel it; otherwise, G should be closer to 0, as there is nothing to cancel. Formally, let

{\hat{x}}_{p}

and

{\hat{x}}_{p}^{'}

be patch features for patch p in the set of all patches

P

in natural and attacked sets X and

X^{'}

of patch features after applying G (that is,

{\hat{x}}_{p} \in G (X, {R_{ω}})

and

{\hat{x}}_{p}^{'} \in G (X^{'}, {R_{ω}})

) for the same image

I^{X}

. We penalize the model when the length of

{\hat{x}}_{p}

is long compared to that of

{\hat{x}}_{p}^{'}

; that is, the loss term

L_{rect}

is provided by the following:

L_{rect} (I^{X}, I^{X} + δ, I_{ω}^{R}) = \frac{1}{| P |} \sum_{p \in P} max (0, \frac{∥ {\hat{x}}_{p} ∥_{2} - τ {∥ {\hat{x}}_{p}^{'} ∥}_{2}}{∥ {\hat{x}}_{p} ∥_{2}}),

(15)

where

τ \in [0, 1]

specifies the upper bound of the length of

{\hat{x}}_{p}

relative to

{\hat{x}}_{p}^{'}

.

Overall training loss is defined as follows:

L = (1 - λ) L_{cls} + λ \sum_{l \in L} L_{rect}^{l},

(16)

where

λ

is a hyperparameter to balance the two terms, l identifies the index of the layer where the RobustQuote block is installed, and L is the set of indices of all layers at which a RobustQuote block is inserted.

3.5. Reference Set Selection

The reference set

R

is a pivotal aspect of RobustQuote, and how this set is collected is crucial. The images in

R

should be natural images as they are used to provide supporting features from the categories and to define the unattacked distribution of patch features. Thus, we isolate a subset from the training dataset

R

. In practice,

R

is divided into

ω

subsets

R_{ω}

for each class according to the labels of each image. At each new inference (training, test, or evaluation), we draw a new

I_{ω}^{R}

with a random sample from each class in

R_{ω}

. Doing so ensures that the training is not catered to a given set of references but instead is guided by its interpretation of each class. This methodology of reference set construction is independent of the dataset properties, as no training is performed based on the features extracted from these images. The only property that this set must satisfy is the representation of each class in a sufficiently large quantity to ensure robustness to expectation-based attacks (see Section 3.6). In practice, the same references

I_{ω}^{R}

will be shared with all

I^{X}

in a batch to accelerate computation.

3.6. Source of the Robustness

RobustQuote tightly entangles the input image’s patch features with reference images in the inference process. The [cls] token is completely replaced with one created from the reference images in the RobustQuote block, and all patch features are also adjusted by the reference images (for

G_{rsa}

and

G_{r}

) so that the original adversarial perturbation is mitigated. Thus, the prediction is highly dependent on the choice of the reference images. As a consequence, a different set of reference images will have drastically different results in the results of RobustQuote.

This consequence secures the robustness of RobustQuote. Suppose that we deploy a model with RobustQuote blocks in a real-world application. In this scenario, the attacker can only upload the input image while the model obtains a randomized set

I_{ω}^{R}

from

R

. Even in the white-box setting, where the attacker obtains the architecture along with the model’s parameters, the selection of the new reference set

I_{ω}^{R}

is uncontrollable for the attacker, as the selection is carried out server-side after the attacked image is submitted. Images used in the reference set

R

can be publicly available, but one could choose to further refine the generation of this set or even use new images. We set our minimal robustness requirement so that the selection of

I_{ω}^{R}

from

R

is unpredictable to the attacker.

Even if the attackers have access to all possible reference images, they then require finding the ones that will actually be used in inference, which is improbable. For example, in the case of CIFAR10, there are approximately 5000 images of each of the 10 classes. By picking one image per class, you can generate

5000^{10} = 10^{36}

combinations of the references, which is beyond reasonable for a blind search, such as brute force.

We argue that this robustness comes at the cost of accuracy in the prediction or difficulty in training. Equation (2) implies that the features corresponding to the [cls] token are reconstructed from relevant reference images, which should be the image of the correct category. This means that RobustQuote requires a significant level of understanding of images, even to make correct classifications at the position where it is installed in the transformer model. Therefore, the position of the RobustQuote block may be limited to the latter part of the original network.

4. Experiments

As stated, the purpose of RobustQuote is to improve the robustness to attacks based on asymmetric (outdated) information between the attacker and the defender. To demonstrate our claims, we will evaluate our novel method against other state-of-the-art methods striving to normalize adversarial inputs, as well as submit our method to an attacker with symmetrical knowledge, demonstrating that our method’s major lever is the creation of that asymmetrical scenario. We also wish to study our connection with gradient obfuscation techniques, underlining comparable measurements from different causes. Finally, we will demonstrate the impact of the parameter values and design variations presented in Section 3.

4.1. Experimental Setups

We conducted experiments on the CIFAR10 [19] and ImageNette [20] datasets. We utilized DeiT3-T as our backbone model, as it is one of the best-performing ViT pre-trained models. RobustQuote was compared with several state-of-the-art techniques, such as ARD and PRM [34]. SACNet [35] is a two-layer regularization method, like DH-AT [36]. We also included feature regularization, as presented by Kim et al. [17], with the FSR model. Our evaluation was performed over PGD-20 [5], PGD-100, C&W-20 [37], and AutoAttack (AA) [21] with a perturbation budget of

ϵ = 8 / 255

(“PGD-k” refers to the projected gradient descent-based attack with k-step iterations, while “C&W-k” is the PGD attack with k-step iterations using C&W loss [37]).

We used pre-trained models from the open source Hugging-Face and Pytorch Image models project (the implementations of baseline models together with weights are from https://huggingface.co/models?pipeline_tag=image-classification&search=timm/deit3 accessed and up to date as of 22 January 2025). For each method, we update the ViT architecture with the most up-to-date DeiT3-T weights pre-trained with the ImageNet dataset [38]. All of our models and the SotA methods start with the same pre-trained parameters and are finetuned with the loss functions found in their respective publications. The training parameters, such as optimizer, learning rate, and epochs, however, are shared. We then proceeded to fine-tune the model for 40 epochs on the TRADES training pipeline. We used the SGD optimizer with a momentum of

0.9

and a weight decay of

1 \times 10^{- 4}

. The initial learning rate was set to

1 \times 10^{- 1}

, reduced by a factor of 10 at the fifth and second-to-last epochs. We empirically set

λ = 0.1

and

τ = 0.5

for our RobustQuote method and

β = 2

for the TRADES parameter. In our experiments, we used a single RobustQuote block, placed after the 7-th block of ViT (among 12 blocks), with the RSA rectification module. We used different sets of reference images, which were randomly sampled from the training dataset, in attack generation and inference, which is most likely to happen as it is rather easy to find a publicly shared dataset, but it is much harder to predict the content of a randomly sampled batch during inference for the previous attack generation as we discussed in Section 3.6.

4.2. Robustness Evaluation

We first evaluated the effectiveness of RobustQuote in improving robustness against adversarial attacks on DeiT3 backbones. The accuracy scores are summarized in Table 1. RobustQuote consistently improved accuracy under adversarial perturbation of DeiT3 backbones. We also observed a good natural accuracy, unlike defense methods such as FSR or SACNet, which degraded it by up to 4.5%. RobustQuote’s performance remained within 2.5% of the best-performing ARD+PRM at 78.54% on CIFAR10 and 82.73% on ImageNette. More interestingly, we observed a drastic increase in adversarial robustness of +7.7% on average on CIFAR10 and +4.3% on ImageNette compared to the best-performing ARD+PRM.

We also evaluated the evolution of adversarial robustness as the attack budget increases. Although the attack budget increased from PGD-20 to PGD-100 to C&W, which are increasingly more efficient at reducing accuracy on the other models, we observed almost identical accuracy (within 0.8%) with RobustQuote. Only AutoAttack produced a substantial accuracy drop of 12.9% on CIFAR10 and 7.5% on ImageNette. This is because the AutoAttack implementation requires multiple attacks on each sample, triggering, by design, a new set of reference images. This leads AutoAttack to look for the best attack on the weakest set of references. We explain the behavior of the previous PGDs and C&W attacks by arguing that the better an attack is at optimizing a perturbation for a given input, the worse this perturbation will perform in a new setting with the same input but with different references. In this paradigm, attacks that come closer to the definition of a universal perturbation will perform better than specific attacks.

Evaluation for Black-Box and Adaptive Attacks

We further investigated various attack scenarios to understand the robustness conditions of our RobustQuote model. First, we report in Figure 4 the scenario in which the adversary possesses perfect knowledge of the inference time. In this situation, the benefits of using reference images are completely lost, and the attacker knows in advance which reference images will be used during the inference. We observed a notable decrease in accuracy (ranging from 3.8% to 12.8%) under PGD and C&W, where RobustQuote previously exhibited a significant advantage over existing methods. In this context, RobustQuote remained within 0.8% of the original DeiT3, proving that additional evasions are not introduced with references.

In view of the aforementioned behaviors, one possible explanation of RobustQuote’s robustness is that it performs some kind of gradient obfuscation [10]. According to the definition in [11], we argue that RobustQuote does not prevent the attacker from obtaining useful gradients, as the attack can often be successful when the reference images are the same in attack generation and inference, according to Figure 4. Nevertheless, we evaluated RobustQuote under the various gradient obfuscation benchmarks to understand its limits.

Firstly, as shown in Section 3, we employed standard known operations in accordance with [10], suggesting that all operations should be differentiable and kept in a usable range. The first test consists of gradually increasing the perturbation budget

ϵ

until the performance approaches zero. We report the results in Table 2. When

ϵ \leq 16 / 255

, RobustQuote maintained the performance, while we observed an accuracy drop to 2.84% when

ϵ = 128 / 255

. This reveals that RobusQuote does provide useful gradients for generating stronger perturbations, taking advantage of a larger budget

ϵ

. This result supports our discussion above that RobustQuote does not obfuscate the gradients.

We performed experiments with a decision-based black-box attack, i.e., square attack [39]. The number of iterations was set to 1000, with a perturbation budget of

ϵ = 8 / 255

. We observe in Table 3 that the square attack fails to reduce adversarial performance compared to the white-box attack (i.e., PGD-20 attack). According to the claim in [10], we again demonstrate that RobustQuote does not obfuscate the gradients.

In Table 4, we examine the behavior of RobustQuote under a PGD-based transfer attack. We employed both DeiT3-T without adversarial training and DeiT3-T with FSR [17] as proxy models. We utilized the PGD-20 attack on proxy models with a perturbation budget of

ϵ = 8 / 255

, and then directly applied the crafted adversarial examples to the model with a RobustQuote block. The results in the last row of Figure 4 show similar behavior to the black-box attack, showing again that gradient-based attacks still outperform transfer-based attacks.

Finally, we evaluated the performance of the targeted adversarial attack, namely APGD-t [21]. This attack was remarkably more efficient than the default PGD attack, confirming our suspicion that guiding the attack toward a single class can effectively focus on a single reference. In Table 2, we observe that APGD-t can achieve better results than the PGD-20 attack, but RobustQuote can still defend against this adaptive adversarial attack.

In sum, our results show that RobustQuote exhibits one of the symptoms (PGD-20 attacks perform better than PGD-100) attributed to gradient obfuscation, and we do not strictly adhere to the definition. Based on these results, we believe that RobustQuote offers a brand-new defense strategy that is different from the existing ones.

4.3. Ablation and Parameter Sensitivity Studies

4.3.1. Variants in the Rectification Module

With this first study, we demonstrate that our proposition of rectifying the image features X is beneficial. Table 5 shows the scores when the rectification module (and associated loss

L_{rect}

) is removed (None), when a random matrix is used instead of

A t t

in Equation (6) (Random), when

G_{sa}

is used (Self), when

G_{rsa}

is used (RSA), and when

G_{R}

is used (Rec). We can see that None performs significantly worse than our full solution by 0.1%. Random also does not overcome RobustQuote, and still remains 0.05% behind, demonstrating that the robustness is not only because of mixing up the patch features, but also because guiding these permutations boosts the performance.

Self, RSA, and Rec are variants of RobustQuote. Self and Rec are both outperformed by RSA by 0.3% and 0.05%, respectively. This implies that external information is more reliable for finding corrections than the corrupted input, but finding the corrections from within that input is ultimately more stable.

4.3.2. $τ$ to Determine the Effect of $L_{rect}$

In the following section, we explore the impact of RobustQuote’s hyperparameters,

τ

and

λ

, as well as where to place a RobustQuote block and how many to use.

Starting from

τ

, we report the scores for different values in Table 6 and visualize them in Figure 5a. We see a clear best value at

τ = 0.5

, outperforming all other values in adversarial attacks by up to 0.7% in C&W, while lagging by only 0.1 in the worst case in AutoAttack. Overall,

τ = 0.5

provides a 0.05% average benefit over the second-best choice of

τ = 1.0

. It seems that values lower than 0.5 (imposing a greater difference of correction) have diminishing returns and it becomes too hard to find large adversarial corrections

∥ {\hat{x}}_{p}^{'} ∥_{2}

relative to

∥ {\hat{x}}_{p} ∥_{2}

. On the other hand, a larger

τ

, which hardly imposes a constraint on lengths, yields a lower performance than

τ = 0.5

.

4.3.3. $λ$ to Balance Loss Terms

We looked at the balance of the loss term

λ

. We summarize the results in Table 7 and Figure 5c. We again observed a sweet spot around

λ = 0.1

with diminishing returns as the importance of

L_{rect}

. Additionally, we observed that deactivating

L_{rect}

completely was also not beneficial and caused a loss of 0.1% accuracy on average. One could argue that this is the result of

τ ≪ 1

, and that this important constraint generates large

L_{rect}

loss terms compared to

L_{cls}

. We note that

τ

and

λ

must be optimized jointly in order to find the best combination for the task considered. In our experience,

τ

is the most impactful parameter and can be quickly approximated with any

λ

, so fixing

λ

first may be the most efficient strategy.

4.3.4. Positions and Number of RobustQuote Blocks

We studied the impact of the positions and the number of RobustQuote blocks in the ViT backbone, where we tried up to two blocks. Table 8 and Figure 5b summarize the scores for various combinations of one or two RobustQuote blocks in a ViT backbone. Inserting a single RobustQuote module in the middle of ViT significantly enhanced performance, with the best average of 56.64% accuracy under the considered attacks. We observed a small loss of 0.5% or 1% of robustness when using RobustQuoute in the later stages of the network and a large loss when it was placed in the earliest blocks. We hypothesize that there is a sweet spot that best leverages the tension between relying on cls tokens in the quotation and taking advantage of the rectifications made on the image tokens in the following blocks.

This echoes recent observations in [40], where the cls tokens form well after the 7-th block and are mainly propagated through the skip connection branch before this block. Based on our empirical observation, inserting a RobustQuote block around the 7-th block is a good choice. As an illustration, when using RobustQuote on the 4-th block, we dropped to 48.32% average robust accuracy due to the lack of relevant information in the cls token at this stage. We also examined the performance of multiplying the instances of RobustQuote blocks in the backbone, and we tested two variations. We observed that combining two instances resulted in lower robustness than the worst of the two instances used individually, implying that the two blocks may be competing with each other.

Overall, using a single block was preferable, and we added our RobustQuote after the 7-th transformer block.

4.4. Discussion

The experiments in Section Evaluation for Black-Box and Adaptive Attacks showed that the gains of RobustQuote are conditioned on the asymmetric knowledge between the attacker and the defender, regarding the references. As stated in Section 4.1, the standard robust scenario occurs when the attacker knows everything about the model, the weights

F_{θ}

, and the reference dataset

R_{ω}

, but ignores the exact images

I_{ω}^{R}

. We stated in Section 3.6 the requirements on

R_{ω}

to render improbable the random selection of

I_{ω}^{R}

prior to the inference time. We presented the hypothesis of a model deployed as a backend process, alongside which the defender has their own dataset

R_{ω}

to generate

I_{ω}^{R}

in the backend too. When the user or the attacker uploads their input image

I^{X}

, they will have no control or information on the references used during their inference. Only in the case where the user gains control over the references or when they can predict

I_{ω}^{R}

used at inference, can they generate the attacks with knowledge presented in Figure 4. In this case, our method is indeed neutralized and is then equivalent to the static adversarial training methods such as ARD-PRM. We believe that it is a reasonable defense strategy, as obtaining this level of control or information goes beyond the security of AI and now depends on the robustness of the overall design of the AI system deployed.

Finally, we address the computational costs of our method. Our module, added to the original backbone of DeiT3-T, represents an increase of 11% of the model’s weights. The set of references

I_{ω}^{R}

across the transformer block has a negligible impact in the training phase, as no gradient is tracked in the ViT blocks. In the inference stage, they occupy

\frac{ω}{B} \times 100

% additional memory alongside what

I^{X}

uses. In the RobustQuote block,

A_{c l s}

has a complexity of

O (ω)

and

A_{r s a}

has a complexity of

O (ω N^{2})

. Compared to a standard ViT block with complexity of

O (N^{2})

, this is an increase of

ω \times 100

%. As such, in the context of CIFAR10 and ImageNette with 10 classes, this represents the memory footprint of 10 standard ViT blocks in parallel. On a system with Nvidia V100-16GB GPU, using a batch size of

B = 64

with DeiT3-T and CIFAR10 images, the standard DeiT3-T computes an inference in 66 ms using 29.3 MB of VRAM, while our RobustQuote requires 170 ms and uses 64.7 MB. This highlights the limitation of datasets with a small number of classes

ω

. A possible improvement would be to find a class agnostic quotation and rectification system, with a fixed number of references, providing a natural anchor instead of a class-specific anchor. We consider such improvement as future work.

5. Conclusions

In this paper, we presented RobustQuote, which leverages reference images to reduce the vulnerability of a model against adversarial attacks. We empirically demonstrated the ability of RobustQuote to reduce the vulnerability of the model over two datasets and several adversarial attacks. The results showed that RobustQuote was particularly efficient against iterative attacks like PGD-100 and C&W-20. Despite relying on the attacker not knowing the references used in inference, we demonstrated that in the opposite case, we maintained accuracy comparable to the other state-of-the-art methods. We also walked through our methodology on how to best use information from reference images to correct adversarial perturbations. In future work, we will consider how to generalize our method to a larger number of categories. Currently, RobustQuote is inefficient for a larger number of categories because it needs to forward-reference images for all categories. We will explore category-agnostic reference images in which only the perturbation-free property will be leveraged to generate relevant normalization.

Author Contributions

Conceptualization, H.L. (Hugo Lemarchant); Methodology, H.L. (Hugo Lemarchant); Software, H.L. (Hugo Lemarchant); Validation, H.L. (Hugo Lemarchant); Formal analysis, H.L. (Hugo Lemarchant); Investigation, H.L. (Hugo Lemarchant); Writing—original draft, H.L. (Hugo Lemarchant); Writing—review & editing, H.L. (Hugo Lemarchant); Supervision, H.L. (Hong Liu) and Y.N.; Project administration, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by JST CREST grant number JPMJCR20D3.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: CIFAR10 https://www.cs.toronto.edu/~kriz/cifar.html, ImageNettehttps://www.image-net.org/challenges/LSVRC/index.php, all accessed on 22 January 2025.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DeiT	Data-efficient image Transformers
TRADES	Trade-off-inspired Adversarial Defense via Surrogate-loss minimization
C&W	Carlini and Wagner (an adaptive attack)
ViT	Vision Transformer
CIFAR	Canadian Institute for Advanced Research
PGD	Projected Gradient Descent
AT	Adversarial Training
SGD	Stochastic Gradient Descent
ALP	Adversarial Logit Pairing
BN	Batch Normalization
ARD - PRM	Three-letter acronym
SACNet	Linear dichroism
DH-AT	Dual-Head Adversarial Training
FSR	Feature Separation and Recalibration
AA	AutoAttack
APGD-t	Auto PGD targeted

References

Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Xiao, C.; Prakash, A.; Kohno, T.; Song, D. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1625–1634. [Google Scholar]
Komkov, S.; Petiushko, A. Advhat: Real-world adversarial attack on arcface face id system. In Proceedings of the IEEE International Conference of Pattern Recognition Systems, Chengdu, China, 16–18 July 2021; pp. 819–826. [Google Scholar]
Shin, T.; Razeghi, Y.; Logan IV, R.L.; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Online, 16–20 November 2020; pp. 4222–4235. [Google Scholar]
Qin, Y.; Carlini, N.; Cottrell, G.; Goodfellow, I.; Raffel, C. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proceedings of the International Conference Learning Representations, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 5231–5240. [Google Scholar]
Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. In Proceedings of the International Conference Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Liao, F.; Liang, M.; Dong, Y.; Pang, T.; Hu, X.; Zhu, J. Defense against adversarial attacks using high-level representation guided denoiser. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1778–1787. [Google Scholar]
Lemarchant, H.; Li, L.; Qian, Y.; Nakashima, Y.; Nagahara, H. Inference Time Evidences of Adversarial Attacks for Forensic on Transformers. arXiv 2023, arXiv:2301.13356. [Google Scholar]
Xu, Y.; Sun, Y.; Goldblum, M.; Goldstein, T.; Huang, F. Exploring and Exploiting Decision Boundary Dynamics for Adversarial Robustness. In Proceedings of the International Conference Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liu, H.; Zhong, Z.; Sebe, N.; Satoh, S. Mitigating robust overfitting via self-residual-calibration regularization. Artif. Intell. 2023, 317, 103877. [Google Scholar] [CrossRef]
Athalye, A.; Carlini, N.; Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the International Conference Learning Representations, PMLR, Vancouver, BC, Canada, 30 April–3 May 2018; pp. 274–283. [Google Scholar]
Papernot, N.; McDaniel, P.; Goodfellow, I.; Jha, S.; Celik, Z.B.; Swami, A. Practical black-box attacks against machine learning. In Proceedings of the ACM on Asia Conference on Computer and Communications Security, Abu Dhabi, United Arab Emirates, 2–6 April 2017; pp. 506–519. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference Learning Representations, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Li, P.; Gu, J.; Kuen, J.; Morariu, V.I.; Zhao, H.; Jain, R.; Manjunatha, V.; Liu, H. Selfdoc: Self-supervised document representation learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 5652–5660. [Google Scholar]
Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inform. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
Kim, W.J.; Cho, Y.; Jung, J.; Yoon, S.E. Feature Separation and Recalibration for Adversarial Robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 8183–8192. [Google Scholar]
Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H. Training data-efficient image transformers & distillation through attention. In Proceedings of the International Conference Learning Representations, PMLR, Virtual, 18–24 July 2021; pp. 10347–10357. [Google Scholar]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. Master’s Thesis, Department of Computer Science, University of Toronto, Toronto, ON, Canada, 2009. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Croce, F.; Hein, M. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In Proceedings of the the 37th International Conference on Machine Learning, Online, 13–18 July 2020; pp. 2206–2216. [Google Scholar]
Wang, Y.; Sun, T.; Li, S.; Yuan, X.; Ni, W.; Hossain, E.; Poor, H.V. Adversarial Attacks and Defenses in Machine Learning-Empowered Communication Systems and Networks: A Contemporary Survey. Commun. Surv. Tutor. 2023, 25, 2245–2298. [Google Scholar] [CrossRef]
Kannan, H.; Kurakin, A.; Goodfellow, I. Adversarial logit pairing. arXiv 2018, arXiv:1803.06373. [Google Scholar]
Zhang, H.; Yu, Y.; Jiao, J.; Xing, E.; El Ghaoui, L.; Jordan, M. Theoretically principled trade-off between robustness and accuracy. In Proceedings of the International Conference Learning Representations, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 7472–7482. [Google Scholar]
Ilyas, A.; Santurkar, S.; Tsipras, D.; Engstrom, L.; Tran, B.; Madry, A. Adversarial examples are not bugs, they are features. Adv. Neural Inform. Process. Syst. 2019, 32. [Google Scholar]
Ross, A.; Doshi-Velez, F. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Benz, P.; Zhang, C.; Kweon, I.S. Batch normalization increases adversarial vulnerability and decreases adversarial transferability: A non-robust feature perspective. In Proceedings of the International Conference Learning Representations, Vienna, Austria, 4 May 2021; pp. 7818–7827. [Google Scholar]
Huang, S.; Lu, Z.; Deb, K.; Boddeti, V.N. Revisiting Residual Networks for Adversarial Robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17 June 2023; pp. 8202–8211. [Google Scholar]
Dhillon, G.S.; Azizzadenesheli, K.; Bernstein, J.D.; Kossaifi, J.; Khanna, A.; Lipton, Z.C.; Anandkumar, A. Stochastic activation pruning for robust adversarial defense. In Proceedings of the International Conference Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
Bai, Y.; Zeng, Y.; Jiang, Y.; Xia, S.T.; Ma, X.; Wang, Y. Improving Adversarial Robustness via Channel-wise Activation Suppressing. In Proceedings of the International Conference Learning Representations, Vienna, Austria, 4 May 2021. [Google Scholar]
Hung-Quang, N.; Lao, Y.; Pham, T.; Wong, K.S.; Doan, K.D. Understanding the Robustness of Randomized Feature Defense Against Query-Based Adversarial Attacks. In Proceedings of the International Conference Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Yoon, J.; Kim, K.; Jang, J. Propagated perturbation of adversarial attack for well-known CNNs: Empirical study and its explanation. In Proceedings of the ICCV Workshop, Seoul, Republic of Korea, 27–28 October 2019; pp. 4226–4234. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inform. Process. Syst. 2017, 30. [Google Scholar]
Mo, Y.; Wu, D.; Wang, Y.; Guo, Y.; Wang, Y. When adversarial training meets vision transformers: Recipes from training to architecture. Adv. Neural Inform. Process. Syst. 2022, 35, 18599–18611. [Google Scholar]
Xu, Y.; Du, B.; Zhang, L. Self-attention context network: Addressing the threat of adversarial attacks for hyperspectral image classification. Trans. Image Process. 2021, 30, 8671–8685. [Google Scholar] [CrossRef] [PubMed]
Jiang, Y.; Ma, X.; Erfani, S.M.; Bailey, J. Dual head adversarial training. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Shenzhen, China, 18–22 July 2021; pp. 1–8. [Google Scholar]
Carlini, N.; Wagner, D. Towards evaluating the robustness of neural networks. In Proceedings of the Symposium on Security and Privacy (sp), San Jose, CA, USA, 22–24 May 2017; pp. 39–57. [Google Scholar]
Touvron, H.; Cord, M.; Jégou, H. Deit iii: Revenge of the vit. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; pp. 516–533. [Google Scholar]
Andriushchenko, M.; Croce, F.; Flammarion, N.; Hein, M. Square attack: A query-efficient black-box adversarial attack via random search. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 484–501. [Google Scholar]
Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do vision transformers see like convolutional neural networks? Adv. Neural Inform. Process. Syst. 2021, 34, 12116–12128. [Google Scholar]

Figure 1. How to use reference documents to achieve robust trustworthy inference.

Figure 2. Integration of RobustQuote in a vision transformer pipeline. The references may be dropped after the last RobustQuote block of the architecture.

Figure 3. Detailed quotation (left) and rectification (right) modules. The cross-attention in the quotation forwards references features

R_{c l s}

, while the rectification forwards input features X. We highlight in blue the use of cross-attention

A_{c l s}

to scale the most relevant attentions proposed in

A_{ω}^{r i} A_{ω}^{r r}

.

Figure 3. Detailed quotation (left) and rectification (right) modules. The cross-attention in the quotation forwards references features

R_{c l s}

, while the rectification forwards input features X. We highlight in blue the use of cross-attention

A_{c l s}

to scale the most relevant attentions proposed in

A_{ω}^{r i} A_{ω}^{r r}

.

Figure 4. Performance of adversarial attacks when the attackers do not know and then know the references used during inference while generating the perturbation

δ

.

Figure 4. Performance of adversarial attacks when the attackers do not know and then know the references used during inference while generating the perturbation

δ

.

Figure 5. Hyperparameter search. The average accuracies are calculated in Natural, PGD-20, PGD-100, C&W-20, and AutoAttack. (a) Search of the best

τ

, Table 6. (b) Search of best

λ

, Table 7. (c) Search of best Block Id, Table 8.

Figure 5. Hyperparameter search. The average accuracies are calculated in Natural, PGD-20, PGD-100, C&W-20, and AutoAttack. (a) Search of the best

τ

, Table 6. (b) Search of best

λ

, Table 7. (c) Search of best Block Id, Table 8.

Table 1. Robustness to adversarial attacks of state-of-the-art adversarial defenses and our RobustQuote-7. We measure the accuracy (in %) and report the best (higher the better) results in bold.

Attacks	DeiT3-T	ARD+PRM	FSR	SACNet	DH AT	Ours
# Params	5.4M	5.4M	5.4M	6M	12M	6M
CIFAR10
Natural	79.41	80.57	75.71	77.04	79.81	78.54
PGD-20	49.41	50.52	42.75	47.89	50.41	59.34
PGD-100	49.38	50.46	42.68	47.81	50.37	60.10
C&W-20	46.19	47.62	39.08	45.22	47.09	59.87
AutoAttack	45.56	46.90	38.32	44.42	46.40	47.26
ImageNette
Natural	82.42	85.32	79.06	82.45	83.41	82.73
PGD-20	56.25	56.51	49.04	55.54	56.54	62.55
PGD-100	56.08	56.28	48.79	55.16	56.43	61.89
C&W	55.75	56.69	45.48	55.01	56.25	61.71
AutoAttack	54.70	55.46	44.69	53.76	55.46	55.97

Table 2. Increasing

ϵ

in PGD-20 drives accuracy under 10%, i.e., the random prediction threshold.

Table 2. Increasing

ϵ

in PGD-20 drives accuracy under 10%, i.e., the random prediction threshold.

PGD-20	Acc (%)
$ϵ = 4 / 255$	70.60
$ϵ = 8 / 255$	59.34
$ϵ = 16 / 255$	34.95
$ϵ = 32 / 255$	10.07
$ϵ = 128 / 255$	2.84

Table 3. Gradient-based methods outperform black-box (Square) methods, proving the usefulness of gradients for generating adversarial perturbations.

Attack	Acc (%)
AutoAttack	47.26
APGD-t	50.34
Square	65.36

Table 4. Transferring adversarial attacks from surrogate models does not increase the performance of white-box attacks (PGD-20).

Surrogate	Acc (%)
DeiT3-T	62.61
RobustQuote-10	62.62

Table 5. Evaluation of the impact of using different design choices for the rectification module. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

Rectifier	Natural	PGD-20	PGD-100	C&W	AA
`None`	78.88	59.33	59.63	59.53	47.24
`Random`	78.88	58.96	59.98	59.84	47.17
`Self`	78.11	59.28	59.30	59.49	47.27
`RSA`	78.71	59.53	60.07	59.77	46.85
`Rec`	78.54	59.34	60.10	59.87	47.26

Table 6. Evaluation of the impact of using different temperatures

τ

to control the rectification amount. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

Table 6. Evaluation of the impact of using different temperatures

τ

to control the rectification amount. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

$τ$	Natural	PGD-20	PGD-100	C&W	AA
1.0	78.23	59.56	59.94	59.88	47.32
0.9	78.50	59.28	59.74	59.71	47.40
0.5	78.54	59.34	60.10	59.87	47.26
0.1	78.26	59.21	60.11	59.12	47.22

Table 7. Evaluation of the impact of using different values of

λ

. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

Table 7. Evaluation of the impact of using different values of

λ

. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

$λ$	Natural	PGD-20	PGD-100	C&W	AA
0.9	70.25	51.68	52.65	52.59	37.65
0.5	76.27	58.58	58.79	58.58	45.01
0.25	78.42	58.96	59.73	59.07	46.66
0.1	78.54	59.34	60.10	59.87	47.26
0.0	78.56	59.47	59.83	59.36	47.23

Table 8. Evaluation of the impact of various positions of our RobustQuote block within a network. The results are computed on a DeiT3-T backbone using the CIFAR10 dataset.

Block 4	Block 7	Block 10	Block 12	Natural	PGD-20	PGD-100	C&W	AA
✓				72.35	51.81	51.35	49.49	40.64
	✓			78.54	59.34	60.10	59.87	47.26
		✓		78.89	58.01	58.16	57.11	45.77
			✓	74.71	59.67	59.69	63.93	41.01
✓	✓			72.01	50.81	50.79	48.05	39.92
	✓	✓		77.42	56.01	55.71	54.80	45.89

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lemarchant, H.; Liu, H.; Nakashima, Y. RobustQuote: Using Reference Images for Adversarial Robustness. Appl. Sci. 2025, 15, 5470. https://doi.org/10.3390/app15105470

AMA Style

Lemarchant H, Liu H, Nakashima Y. RobustQuote: Using Reference Images for Adversarial Robustness. Applied Sciences. 2025; 15(10):5470. https://doi.org/10.3390/app15105470

Chicago/Turabian Style

Lemarchant, Hugo, Hong Liu, and Yuta Nakashima. 2025. "RobustQuote: Using Reference Images for Adversarial Robustness" Applied Sciences 15, no. 10: 5470. https://doi.org/10.3390/app15105470

APA Style

Lemarchant, H., Liu, H., & Nakashima, Y. (2025). RobustQuote: Using Reference Images for Adversarial Robustness. Applied Sciences, 15(10), 5470. https://doi.org/10.3390/app15105470

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RobustQuote: Using Reference Images for Adversarial Robustness

Abstract

1. Introduction

2. Related Works

2.1. Adversarial Training

2.2. Layer Regularization

2.3. Gradient Obfuscation

3. Method

3.1. Quotation Module

3.2. Rectification Module

3.3. RobustQuote Block

3.4. Training Loss

3.5. Reference Set Selection

3.6. Source of the Robustness

4. Experiments

4.1. Experimental Setups

4.2. Robustness Evaluation

Evaluation for Black-Box and Adaptive Attacks

4.3. Ablation and Parameter Sensitivity Studies

4.3.1. Variants in the Rectification Module

4.3.2. τ to Determine the Effect of L rect

4.3.3. λ to Balance Loss Terms

4.3.4. Positions and Number of RobustQuote Blocks

4.4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3.2. $τ$ to Determine the Effect of $L_{rect}$

4.3.3. $λ$ to Balance Loss Terms