RobustQuote: Using Reference Images for Adversarial Robustness
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIn this paper, the authors propose RobustQuote, a novel plug-and-play defense framework for Vision Transformers that enhances adversarial robustness by leveraging dynamically sampled, attacker-unknown reference images. The method introduces two key components: a quotation module and a rectification module which prevents attackers from crafting transferable perturbations, as the inference-time references are unpredictable. RobustQuote achieves better robustness under strong white-box and black-box attacks compared to existing defenses like ARD and PRM, without significantly sacrificing clean accuracies. This work fits the scope of this journal well and should attract potential readers. However, minor revisions are necessary for publication in Applied Sciences, as evidenced by the following points.
- In the introduction, the authors provide a clear and well-structured background on adversarial attacks and defenses, including adversarial training, gradient obfuscation, and limitations of static defenses. However, it would be beneficial for authors to condense the discussion on the weaknesses of existing defenses and qualitative comparisons, and instead put more emphasis on highlighting the novelty of using reference images during inference.
- The method’s robustness relies heavily on the random sampling of reference images, which assumes the attacker has no access to or prior knowledge of the reference set used at inference. While the paper acknowledges this, the robustness drops by ~10% when the attacker does know the references (Table 2), suggesting limited security in worst-case white-box cases. The authors should discuss this further and provide a potential solution to mitigate it if possible
- According to section 3.5, RobustQuote uses one reference image per class. For datasets with many classes, this becomes computationally expensive and memory intensive, which is briefly mentioned but not resolved. The authors should elaborate further on this issue. Additionally, it would be beneficial to provide a more quantitative analysis of the robustness–accuracy trade-off, for example, in terms of margin gains versus computational overhead.
- Most of the results in the paper are presented in tables, which can be harder for readers to quickly interpret or compare key findings. It would be beneficial to convert major tables (such as Table 1 and Table 2) into grouped bar charts showing accuracy under different attacks across methods. This would make robustness improvements more visually apparent.
- Some citations have inconsistent format.
Author Response
We appreciate your constructive and insightful feedback. We have carefully considered each of your comments and have made the necessary revisions to address your concerns as follows:
Comment 1:
In the introduction, the authors provide a clear and well-structured background on adversarial attacks and defenses, including adversarial training, gradient obfuscation, and limitations of static defenses. However, it would be beneficial for authors to condense the discussion on the weaknesses of existing defenses and qualitative comparisons, and instead put more emphasis on highlighting the novelty of using reference images during inference.
Reply 1:
We have revised the introduction to shorten our comment the limitations of static defenses while expleciting our dynamic sampling strategy of reference images. We believe that it is still necessary to briefly present the static aspect of adversarial training and the shortcoming of gradient obfuscation methods to properly introduce our goal and the pitfalls we are looking to avoid.
Comment 2:
The method’s robustness relies heavily on the random sampling of reference images, which assumes the attacker has no access to or prior knowledge of the reference set used at inference. While the paper acknowledges this, the robustness drops by ~10% when the attacker does know the references (Table 2), suggesting limited security in worst-case white-box cases. The authors should discuss this further and provide a potential solution to mitigate it if possible
Reply 2:
You correctly pointed out the drop in robustness when the attacker gains knowledge of the reference set (~10% as reported in Table 2). We have added a discussion paragraph 4.4 to address this. In summary, we precise that the only case where we report such drop is when the attackers knows the exact references fed to the network during the inference. All our previous experiments were done with the attacker knowing the reference set (CIFAR10 or ImageNette) but unaware of the references being drawn during the evaluated inference. A potential solution to the worst case situation would be to deny the user the action to submit the references during inference. In practice, for an AI-as-a-service like a chat-bot or a web-app, it appear trivial to make this task a backend process after the input image is submitted by the user/attacker.
Comment 3:
According to section 3.5, RobustQuote uses one reference image per class. For datasets with many classes, this becomes computationally expensive and memory intensive, which is briefly mentioned but not resolved. The authors should elaborate further on this issue. Additionally, it would be beneficial to provide a more quantitative analysis of the robustness–accuracy trade-off, for example, in terms of margin gains versus computational overhead.
Reply 3:
The concern regarding computational overhead with large datasets is valid. We also address this in the same dicussion paragraph 4.4. The computational cost of using references is equivalent to add that many images in the input batch. The references are processed throughout the transformer blocks normally. In the RobustQuote block, the cross attention in the quotation module is marginal compared (O(ω)) to the self attentions (O(N^2)). The cross attention in the rectification module has a complexity of O(ω.N.N_ω). This overhead is the source of the 7.7% adversarial robustness gains on CIFAR10 and 4.3% on ImageNette compared to the adersarially trained ARD-PRM model in Table 1.
Comment 4:
Most of the results in the paper are presented in tables, which can be harder for readers to quickly interpret or compare key findings. It would be beneficial to convert major tables (such as Table 1 and Table 2) into grouped bar charts showing accuracy under different attacks across methods. This would make robustness improvements more visually apparent.
Reply 4:
We understand your remark, and we experimented various vizualisations. We were able to propose a view of the Table 2 as suggested. However the Table 1 has too many entries too be readable in such a format. We could propose a scatter plot however, the lack of coherency between the various attacks maks this representation questionable.
Comment 5:
Some citations have inconsistent format.
Reply 5:
We appreciate the attention to detail regarding citation formatting. We have thoroughly reviewed all references to ensure consistency in style and formatting throughout the manuscript.
We sincerely thank you for your constructive feedback, which has significantly improved the clarity, structure, and scientific rigor of our manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsIn this paper, authors introduce a novel defense framework for Vision Transformers (ViTs) against adversarial attacks. RobustQuote enhances adversarial robustness by leveraging reference images—uncorrupted samples randomly selected from a pool unknown to the attacker—to detect and correct perturbations in the input image. It outperforms or matches state-of-the-art defenses (like TRADES, SACNet, FSR) in robustness to PGD-100, C&W, and AutoAttack on CIFAR-10 and ImageNette and gains up to +12.2% adversarial accuracy in realistic scenarios where attackers don’t know the reference set.
The paper's technical contributions are suitable for publications. The paper is well written with good organization structure. Some minor comments need to be handled.
Some comments:
- The link to https://github.com/ErestorX/RobustQuote does not work.
- The defense relies on having one clean reference image per class at inference time. This limits scalability to large datasets with many classes (e.g., ImageNet with 1,000 classes). Maintaining and forwarding all reference images during inference is computationally expensive and memory-heavy.
- Each prediction requires running reference images through the transformer, computing cross-attention with input, storing and managing the full set of feature maps. This adds a notable computation and latency cost, which could limit real-time or edge deployment.
- It's stated that a random clean image per class is selected at every inference, but How is this sampling done in batch settings? Are the same reference images used across a batch? Are class labels used at inference to select references?
- For methods like SACNet, FSR, etc., it's not 100% clear if they were retrained under the same setting or if results are taken from previous papers.
Author Response
Thank you very much for your thorough and perceptive review. We have carefully addressed each of your comments as outlined below:
Comment 1:
The link to https://github.com/ErestorX/RobustQuote does not work.
Reply 1:
We apologize for the invovenience regarding the GitHub link. This is the correct link but the project was still in a private repository. We made some adjustments and released the project to the public. The project will still be further updated to allow visitors easier acces to our method.
Comment 2:
The defense relies on having one clean reference image per class at inference time. This limits scalability to large datasets with many classes (e.g., ImageNet with 1,000 classes). Maintaining and forwarding all reference images during inference is computationally expensive and memory-heavy.
Reply 2:
We acknowledge the limitations posed by maintaining one reference image per class, particularly in datasets with numerous classes like ImageNet. We provide a more detailed answer on the computational costs in the Comment 3 answer. Regarding the scaling to larger dataset we think that a method searching for robust cues in a class-agnostic manner might solve this issue, fixing the number of reference image to an arbitrary number at the design stage. We infortunately don’t have finalized experiments answering this hypothesis.
Comment 3:
Each prediction requires running reference images through the transformer, computing cross-attention with input, storing and managing the full set of feature maps. This adds a notable computation and latency cost, which could limit real-time or edge deployment.
Reply 3:
The concern regarding computational overhead with large datasets is valid. We also address this in the same dicussion paragraph 4.4. The computational cost of using references is equivalent to add that many images in the input batch. The references are processed throughout the transformer blocks normally. In the RobustQuote block, the cross attention in the quotation module is marginal compared (O(ω)) to the self attentions (O(N^2)). The cross attention in the rectification module has a complexity of O(ω.N.N_ω). This overhead is the source of the 7.7% adversarial robustness gains on CIFAR10 and 4.3% on ImageNette compared to the adersarially trained ARD-PRM model in Table 1.
Comment 4:
It's stated that a random clean image per class is selected at every inference, but How is this sampling done in batch settings? Are the same reference images used across a batch? Are class labels used at inference to select references?
Reply 4:
Excellent point regarding batch sampling. In the revised Section 3.5, we clarify this concern. In summary we have ω sub-datasets for each class from the orginal dataset, and we randomly draw a sample from each of them. So in pratice yes, we use the class label to create these sub-datasets. And as for the batching, yes, the reference images are shared for all the regular input images in the same batch, to simplify the computation. The user can change this behavior by setting the batch size to 1 - using ω references for 1 input image.
Comment 5:
For methods like SACNet, FSR, etc., it's not 100% clear if they were retrained under the same setting or if results are taken from previous papers.
Reply 5:
To maintain fairness in evaluation, we have retrained all baseline methods (SACNet, FSR, etc.) under the same experimental conditions and datasets used for RobustQuote. This ensures direct comparability of results. We have clarified this in Section 4.1.
We are grateful for your insightful comments, which have led to substantial improvements in the clarity, completeness, and scientific robustness of our work.