Next Article in Journal
Quantitative Evaluation of Low-Dose CT Image Quality Using Deep Learning Reconstruction: A Comparative Study of Philips Precise Image and GE TrueFidelity
Previous Article in Journal
From Detection to Motion-Based Classification: A Two-Stage Approach for T. cruzi Identification in Video Sequences
Previous Article in Special Issue
Concealed Weapon Detection Using Thermal Cameras
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Segment and Recover: Defending Object Detectors Against Adversarial Patch Attacks

by
Haotian Gu
and
Hamidreza Jafarnejadsani
*
Department of Mechanical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA
*
Author to whom correspondence should be addressed.
J. Imaging 2025, 11(9), 316; https://doi.org/10.3390/jimaging11090316
Submission received: 25 July 2025 / Revised: 1 September 2025 / Accepted: 12 September 2025 / Published: 15 September 2025
(This article belongs to the Special Issue Object Detection in Video Surveillance Systems)

Abstract

Object detection is used to automatically identify and locate specific objects within images or videos for applications like autonomous driving, security surveillance, and medical imaging. Protecting object detection models against adversarial attacks, particularly malicious patches, is crucial to ensure reliable and safe performance in safety-critical applications, where misdetections can lead to severe consequences. Existing defenses against patch attacks are primarily designed for stationary scenes and struggle against adversarial image patches that vary in scale, position, and orientation in dynamic environments.In this paper, we introduce SAR, a patch-agnostic defense scheme based on image preprocessing that does not require additional model training. By integration of the patch-agnostic detection frontend with an additional broken pixel restoration backend, Segment and Recover (SAR) is developed for the large-mask-covered object-hiding attack. Our approach breaks the limitation of the patch scale, shape, and location, accurately localizes the adversarial patch on the frontend, and restores the broken pixel on the backend. Our evaluations of the clean performance demonstrate that SAR is compatible with a variety of pretrained object detectors. Moreover, SAR exhibits notable resilience improvements over state-of-the-art methods evaluated in this paper. Our comprehensive evaluation studies involve diverse patch types, such as localized-noise, printable, visible, and adaptive adversarial patches.

1. Introduction

Deep neural networks (DNNs) are increasingly deployed in the physical world for safety-critical computer vision tasks, such as face authentication on smartphones, driving assistance in autonomous cars, and intruder detection on surveillance cameras. However, DNNs are known to be vulnerable to evasion attacks, perturbations that when combined with data inputs, cause intentional misclassifications. It poses a serious threat to real-world object detection systems, since they are easy to implement physically. While there is an abundance of adversarial patch attacks on object detectors [1,2,3,4,5,6,7,8] and classifiers [9,10,11,12,13,14,15,16], defenses against such attacks have not been extensively studied. Securing object detectors is more challenging due to the complexity of the task. Defenses against adversarial patch attacks on object detectors can be mainly grouped into five types: (i) adversarial training, (ii) detection and removal [17,18,19,20,21,22,23], (iii) detection and mitigation [24,25,26,27], (iv) detection and restoration [28,29], and (v) certifiably robust defenses [30,31,32,33,34,35,36,37,38,39].
While adversarial patches are localized, they can affect predictions not only locally but also on objects that are farther away in the image because object detection algorithms utilize spatial context for reasoning. This effect is especially significant for deep learning models, as a small localized adversarial patch can greatly disturb feature maps on a large scale due to large receptive fields of neurons. Removing them after detecting from the images will lose object pixels for detection. Instead, restoring the broken pixels will minimize the adverse effects of adversarial patches, both locally and globally, on detection. In this paper, we present the Segment-and-Recover (SAR) defense, a “detect-and-restore” strategy-based framework, which can robustify any object detector against patch attacks without retraining the object detectors. (e.g., All of adversarial patch attacked target in Table 1 are identified by varied detector).
The main contributions of the paper can be condensed as follows:
  • By integration of the patch-agnostic defense-based frontend with an additional broken pixel restoration backend, we developed Segment and Recover (SAR) for detecting adversarial image patches and recovering the object detection accuracy;
  • We revealed adversarial patches in the high-frequency domain and proposed a recompression-based patch localization frontend, which is agnostic to patch appearance, shape, and location;
  • We conducted extensive evaluations and comparative studies on state-of-the-art approaches for adversarial patches of varying sizes, tasks, and attack models. The results demonstrate that our method outperforms all the evaluated state-of-the-art approaches, particularly in terms of object detection accuracy.

2. Related Work

2.1. Adversarial Patch Attacks

Adversarial patch attacks were first introduced in [9]. It proposed universal, robust, targeted adversarial image patches to prevent classifiers [43,44,45,46] from identifying the target object in an image. This patch aims to find an input that maximizes the loss function. Using this loss function, the goal of the adversary is to directly minimize that probability until it falls below the detection threshold of the network. RP_2 [10], QR Patch [12], PS-GAN [13], DiAP [14], camera stickers [15], and EOT [16] are introduced to attack the classifier, but they are not able to fool object detectors such as Faster R-CNN [41] and YOLO [40]. The reason is that modern object detectors first locate objects of different sizes at different locations in the image and then perform classifications. DPATCH [1] is the first patch attack against object detectors, proposed as an iteratively trained adversarial patch that effectively attacks bounding box regression and object classification simultaneously. To include more training loss, EAVISE [47] creates an adversarial patch that is successfully able to hide persons from a person detector, where the optimization goal consists of the non-printability score, the maximum objectness score, and the total variation in the image. Instead of considering the non-printability score in the training loss, [48] adds the total patch saliency to train the adversarial patch. Some localized patches are introduced [2,4,6,7,8,49] to attack detectors such as YOLO and Faster R-CNN. Furthermore, a translucent patch [50] is proposed as a small opaque dot placed upon the camera lens. Assuming sufficient lighting, such adversary overlays can be well approximated by an alpha-blending operation between the original image and an appropriately sized and colored dot.

2.2. Defenses Against Patch Attacks

Generally, the defense methods can be categorized as detection-based and image-preprocessing-based methods. The detection-based method expects the detector to identify the adversarial risk while guaranteeing the detection accuracy of the target. The preprocessing-based method, which includes “Detect-and-Smooth”, “Detect-and-Remove”, and “Detect-and-Inpaint” strategies, expects to detect and post-process the adversary in images to defend against the visible attack.
Adversarial training introduces discovered adversarial examples and the corresponding ground-truth labels to the training. Ideally, the model will learn how to restore the ground truth from the adversarial perturbations and perform robustly on future adversarial examples. A recent study showed promising results for defending against patch attacks using adversarial training [51]. This technique, however, suffers from the high cost of generating adversarial examples and (at least) doubles the training cost of DNN models due to its iterative retraining procedure. Its effectiveness also depends on having a technique for efficiently generating adversarial examples similar to the one used by the adversary, which may not be the case in practice. As pointed out by Papernot in [52], it is essential to include adversarial examples produced by all the known attacks in adversarial training, since this defensive training is non-adaptive.
Some defense methods, such as detection and mitigation [24,25,26,27], propose to remove and refill the identified candidate regions using inpainting to mitigate the potential adverse effects. Local gradient smoothing (LGS [24] has been used to develop an effective method to estimate noise location in the gradient domain and transform those high-activation regions caused by adversarial noise in the image domain while having minimal effect on the salient object that is important for correct classification. In [27], signal-based feature extraction is used to “detect and mitigate” the adversarial patch. The first step is compressing the input image for error-level analysis (ELA) to identify the adjacent regions between the adversarial patch and the original image. However, accurately determining the key adversarial pixel is unrealistic because the adversarial gradient of the patch is unknown beforehand. Furthermore, if the patch covers too much area of the object, the refilled area contains too many pixels, which are far away from the original object. It will occlude the objects to be detected or classified.
To provide more general defense, researchers have explored locating patch areas and removing adversarial effects with a detect-and-remove-based strategy [17,18,19,20,21,22,53]. Hayes [17] proposed digital watermarking (DW), a “detect-and-remove” strategy with localized and visible adversarial perturbations to defend against adversarial patches for non-blind and blind image inpainting, inspired by the procedure of digital watermarking removal. In this method, a saliency map of the image was constructed to help remove small holes and mask the adversarial image, blocking adversarial perturbations. While Segment and Complete (SAC) [19] performs better for the PGD-generated adversarial patch [54], it struggles to defend natural-looking patches. Furthermore, SAC depends on the patch location and scale estimation. The PAD [20] is proposed for various adversarial patch localizations without relying on prior attack knowledge (e.g., appearance, shape, size, and quantity) and removal without additional training. Although the “removing” operation can eliminate the effect of the adversarial patches, it destroys the contextual information within the original image. Therefore, these incomplete images will affect the performances of downstream tasks, like image classification or object detection. Compared with this “detecting-and-removing” strategy, an alternative option should be a “detecting-and-recovering” strategy, i.e., recovering or repairing the original content covered by the adversarial patch after detecting the adversarial region. Jedi [28] and RLID [29] are proposed based on this strategy.

3. Problem Setup

In this section, we first introduce the image detector model, followed by the adversarial patch attack. Also, we present notations and terminology used in this paper. Table 2 provides a summary of the notations.

3.1. Image Object Detector

Object classification is a standard task in computer vision. Given an input image and a set of class labels, the classification algorithm outputs the most probable label (or a probability distribution over all the labels) for the image. Object classifiers are limited to categorizing a single object per image. Object detectors can locate and classify multiple objects in a given scene. The deep-learning-method-based object detectors can be classified into two categories: two-stage strategy detectors, such as FastRCNN [55], RCNN [56], SPPNet [57], Faster RCNN [41], R-fcn [58], MaskRCNN [59], and ResNet [44], and one-stage detectors, including DetectorNet [60], OverFeat [61], YOLO [40], and DETR [42]. Below, we detail DETR, YOLO, and Faster RCNN from the above two categories.
For DETR and YOLO, one-stage region-based frameworks, class probabilities and bounding box offsets are predicted directly with a single feedforward CNN network. This architecture leads to a faster processing speed. Due to such excellent efficiency and high-level accuracy, YOLO and DETR are good choices in real-time processing systems, such as the traffic light detection module in Apollo (an open platform for autonomous driving) and the object detection module in satellite imagery.
Faster RCNN, a two-stage detection framework, includes a preprocessing step for region proposals and a category-specific classification step to determine the category labels of the proposals. Faster RCNN is proposed to improve the RCNN, which is quite computationally expensive despite its high object detection accuracy. Instead of using the time-consuming selective search algorithm on the feature map to identify the region proposals, Faster RCNN uses a separate network to make the region proposals. Hence, Faster RCNN is much faster than its predecessors and can even be used for real-time object detection.

3.2. Attack Formulation

Attacker’s Objective: The attack is focused on object detection in image frames. We use X R W × H × C to represent the distribution of images, where each image ( x X ) has a width of W, a height of H, and a number of channels of C. We define Y = { 0 , 1 , · · · , N 1 } as the label space, where the number of classes is N. We use F ( x ) : X Y to describe the model inference that takes an image ( x X ) as input and forecasts the class label as y Y . In this paper, we focus on the hostile patch attacks against image detection models. Formally, given a deep-learning- or transformer-based model (f), an image (x), and its true class label (y), the goal of the attacker is to find an image ( x A ( x ) X ) such that f ( x ) = y , where y is an unreasoned class label defined by the attacker, and y y .
Attacker’s Capability: The attacker can indiscriminately modify pixels within a confined region, which can be located anywhere in the image, including over the salient object. We assume that all the manipulated pixels are within the image frame region. Officially, we assume the adversary can arbitrarily reform an image (x) within a constraint set ( A ( x ) ). We use a binary pixel block (p) to denote the restricted region and p P { 0 , 1 } W × H , where the pixels within the hostile division are set at 1. Then, the constraint set ( A ( x ) ) can be expressed as x = { ( 1 p ) x + p x | x , x X , x R W × H × C , p P } , where ⊙ refers to the element-wise product operator, and x is the content of the adversarial patch.

4. “Detect-And-Inpaint” Strategy-Based Robust Vision Framework

In this section, we introduce the patch defense pipeline based on image preprocessing, as shown in Figure 1. For patch attacks in the physical scenes, we aim to eliminate the visible adversarial noises from examples by processing the inputs from the perspectives of the appearance inconsistency and adversarial attack effect. The input processed by our method can enable the subsequent classification model or detection model to make correct inferences. Our proposed pipeline can achieve a universal defense against adversarial patch attacks. Unlike PAD as the baseline framework, our modified and enhanced framework is constructed of a patch-region-localizing (frontend) and an inpainting (backend) pipeline. The localizing module aims to localize and segment the patch areas. The inpainting module performs the lost pixel recovery and adversarial perturbation elimination of the located possible patch areas. Instead of employing a simple and fast inpainting method that is commonly used in previous works, such as DW [17], SentiNet [18], SAC [19], PAD [20], and PatchZero [21], filling the patch area with all black pixels, our backend outperforms to guarantee the accuracies of the detection and classification when the patch occludes the object.

4.1. Patch Localizing and Feature Extraction (Frontend)

In our framework, we use FastSAM to segment the patch regions. Due to the limitations of the recognition capabilities of FastSAM, an extra prompt algorithm is needed to identify the adversarial patch among random segmentations. Since adversarial perturbations are often subtle and high frequency, we identify regions that have been digitally hacked or that have higher frequencies compared to those of adjacent regions. The patch prompt method is on lines 13–23 of Algorithm 1. Using a compression algorithm, such as JPEG, the pixel values are converted to the frequency domain using a discrete cosine transform (DCT). The distributions of DCT coefficients in authentic and adversarial-patch-modified regions will differ.
Algorithm 1 Segment and Recover (SAR)
Input: 
input image x, window size ( w x , w y ) , matching threshold T, Base Detector BaseDetector ( · ) .
Output: 
robust detection D * , inpainted images and CAUTION
  1:
procedure  Sar ( x , w x , w y , T )
  2:
a m  AdvPredictor ( x , w x , w y , T )
  3:
p m  AdvDetector ( x , a m )
  4:
x  PixelRestoration ( x , p m )
  5:
D  BaseDetector ( x )
  6:
if  a = = T r u e  then
  7:
       D * C A U T I O N
  8:
else
  9:
       D * D
10:
end if
11:
return  D *
12:
end procedure
  
 
13:
procedure  AdvPredictor ( x , w x , w y , T )
14:
f m Fe(x)
15:
X , Y , _ Shape(fm)
16:
a m ZeroArray[X, Y, N + 1]
17:
for each valid ( i , j )  do
18:
       l , v  Jpeg ( f m [ i : i + w x : j + w y ] )
19:
       a m [ i : i + w x , j : j + w y ] a m [ i : i + w x , j : j + w y ] + v
20:
end for
21:
a m Binarize(am, T, wx, wy)
22:
return  a m
23:
end procedure
 
 
▹ Adversary detection (frontend)
▹ Adversary localization
▹ Broken pixel restoration (backend)
▹ Conventional detection
 
▹ Trigger a caution
 
 
 
 
 

 
▹ Extract feature map
▹ Get the shape of fm
▹ Initialization
▹ Every window location
 
 
 
▹ Binarization
 
 
 
 
24:
procedure  AdvDetector ( x , a m )
25:
s e g FastSamAutomaticMaskGenerator(x)
26:
L , T , R , B Shape(seg, am)
27:
A b o x 1 Area_Seg [ L 1 , T 1 , R 1 , B 1 ]
28:
A b o x 2 Area_Am [ L 2 , T 2 , R 2 , B 2 ]
29:
for each valid A b o x 1  do
30:
      for each valid A b o x 2  do
31:
         A i n t e r ( R i n t e r L i n t e r ) × ( B i n t e r T i n t e r )
32:
         A u n i o n A b o x 1 + A b o x 2 A i n t e r
33:
         I o U A i n t e r A u n i o n
34:
        if  I o U < 0.9  then
35:
             return 0
36:
        else
37:
              p m Binarize(am)
38:
             return pm
39:
        end if
40:
    end for
41:
end for
42:
end procedure

▹ Extract segmentation layer by layer
▹Get the left, top, right, and bottom of seg
▹ Initialization area of segmentation
▹ Initialization area of adversarial patch map
▹ Every segmentation
▹ Every Am
▹ Area of intersection
▹ Area of union
▹ Calculate IoU
 
▹ No overlap
 
▹ Binarization
▹ Return patch map
 
 
 
 
43:
procedure  PixelRestoration ( x , p m )
44:
while  i T   and n > 0   do
45:
       C H × W 2 × 2 C R H × W × C
46:
       R H × W 2 × 2 C C H × W 2 × C
47:
       R H × W 2 × 2 C R H × W 2 × 2 C
48:
       C H × W 2 × 2 C R H × W 2 × C
49:
       R H × W × C C H × W 2 × C
50:
end while
51:
return  x
52:
end procedure
 
 
 
▹ Apply real FFT to input tensor
▹ Concatenate real and imaginary parts
▹ Apply a convolution block in the frequency domain
 
 
▹ Apply an inverse transform to recover a spatial structure
 
 
 
Since high frequencies are often removed in JPEG images by setting the respective DCT-coefficients at 0 in patches of 8 × 8 pixels, we use JPEG compression [62] to isolate the patch-represented high-frequency area. When the image is recompressed using the JPEG compression standard, the DCT coefficients are converted in compression and then processed using quantization. By applying quantization, the coefficients representing the high frequencies are set at zero, causing the high frequencies to disappear from the image. The quantization process can be expressed as follows:
F Q ( x , y , i ) = r o u n d ( F ( x , y , i ) Q ( x , y , i ) )
where Q ( x , y , i ) represents the corresponding quantization step size. F ( x , y , i ) represents the DCT coefficient of channel i at location ( x , y ) . We quantify the pixel values before and after recompression as follows:
H c d ( x , y , Q r ) = 1 c i = 1 c [ F Q c ( x , y , i ) F Q r ( x , y , i ) ] 2
where c denotes the number of channels, and Q c and Q r represent the quality factors of the clean and adversarial-patch-attacked images. By subtracting the original images from a compressed image, only the high frequencies of an image are isolated, which aids in identifying real-world patches.
We perform a texture analysis by calculating the dissimilarity texture feature to gauge the homogeneity of pixel intensities in an image. High values indicate noisy textures, while low values indicate flat textures. To assess the local difference of the patch-attacked area in the texture of an image, a sliding window is built to capture the texture properties. The local mutual information heat map ( H m i ) within the current window ( W c u r ) can be expressed as the average of the mutual information between W c u r and its neighboring W i as follows:
H m i [ x c u r : x c u r + d , y c u r : y c u r + d ] = 1 n i = 1 n w i W i w c W c u r p ( w i , w c ) l o g p ( w i , w c ) p ( w i ) p ( w c )
We fuse the normalized local mutual information heat map ( H m i ) and the recompression difference heat map ( H c d ) as follows:
H ( x , y ) = r m i × H m i + ( 1 r m i ) × H c d
where r m i denotes the weight of the mutual information heat map. We set up r m i to be 0.5 .
The second stage is detecting and segmenting all the objects, including the adversarial patches, that impact the inference accuracy of object classification and detection in the image. FastSAM [63], a unified model, achieves a high-precision, class-agnostic segmentation for video and image segmentation, exhibiting distinct capabilities in zero-shot tasks. Those visible perturbations could be any shape, scale, or location. This stage separates the specific object(s) of interest from the segmented panorama, which depends on the provided prompts. These prompts range from foreground/background point sets to rough boxes or masks, free-form text, or any information that indicates the content to be segmented within an image. However, the location, scale, and shape of adversarial perturbations are unknown beforehand. So, we only rely on FastSAM’s zero-shot segmentation capability to isolate the edges of all the regions in the patch-attacked images and obtain masks for adversarial patches. We then match each mask with H p and consider all the masks with an intersection-over-union (IoU) value of greater than the threshold as the final patch masks on lines 24–42 of Algorithm 1. The equation for the IoU is given as follows:
I o U ( m a s k , H p ) = a r e a ( m a s k H p ) a r e a ( m a s k )

4.2. Patch Mask Inpainting with Fourier Convolutions (Backend)

The detection accuracy often struggles with large missing areas of objects, complex geometric structures, and high-resolution images. Although [24,25,26,27] propose to remove and refill the identified candidate regions, using inpainting to mitigate the potential adverse effects, accurately determining the key adversarial pixel is unrealistic because the adversarial gradient of the patch is unknown beforehand. Furthermore, the refilled area will occlude the objects to be detected or classified because it contains too many pixels, which are far away from the original object, if the adversarial patch covers too much area of the object. To alleviate this issue, we develop a backend for large mask inpainting to restore the broken pixels. The restored pixels guarantee that a detector or a classifier can identify the patched–occluded object. The method on lines 43–52 of Algorithm 1 is to inpaint a color image (x) masked by a binary mask of unknown pixels (m), where the masked image is denoted as x m . The mask (m) is assembled with the masked image ( x m ), generating a four-channel input tensor ( x = s t a c k ( x m , m ) ). We use a preorder inpainting network ( f θ ( · ) ), which we also refer to as the generator. Taking x , the inpainting network proceeds with the input in a fully convolutional manner and produces an inpainted three-channel color image ( x ^ = f θ ( x ) ). For wide masks, the whole receptive field of a generator at a specific position may be inside the mask, thus observing only missing pixels. We have a channel-wise fast Fourier transform (FFT)-based fast Fourier convolution (FFC) to generate the receptive field that covers the entire image. The FFC splits channels into local and global branches. The local branch conducts the conventional convolutions. The global branch implements the real FFT to account for the global context. We finally fuse the local and global branches together.

5. Evaluation Study

In this section, we evaluate the performance of the proposed SAR framework in standard benchmark datasets under a range of patch-based adversarial attacks. We further compare our method with state-of-the-art approaches using established evaluation metrics reported in the literature.

5.1. Metrics

5.1.1. Clean Performance Metrics

Average Precision (AP) [64]: We measure the accuracy of the detector in identifying and classifying objects within an image to report the average precision. We modify the confidence threshold, from 0 to 1, to document the precision and recall at various thresholds and calculate mAP as the averaged precision at 0.5 . This calculated AP can be considered as an approximation of the area under the curve (AUC) for precision–recall curves. We note that mAP is one of the most widely used performance metrics in object detection benchmark competitions and research papers.
False-Alert Rate (FAR) [65]: The FAR is determined as the ratio of clean images on which SAR will prompt a false caution when an adversary is available in image frames. The FAR is also intimately related to the confidence entry of the detector: A higher confidence threshold causes fewer anticipated bounding boxes, leading to more inexplicable objectness and, finally, a higher FAR.

5.1.2. Provable Robustness Metrics

Patch Localization Recall (PLR) [66]: We use patch localization recall to evaluate the localization performance of the “detect-and-inpaint” strategy. It represents the percentage of applied patches that have been detected by the detection-based frontend with an intersection-over-union value that exceeds 0.9 .
Certified Recall (CR@0.5) Values [36]: We use certified recall as the robustness metric against patch-hiding attacks. The certified recall is defined as the percentage of ground-truth objects that have provable robustness against any patch-hiding attack. Recall that an object has provable robustness when Algorithm 1 identifies the objects from an adversary.

5.2. Evaluation Setup and Benchmarks

Adversarial patch attacks: To assess the defense performance of SAR against diverse types of patches, we employ nine varied patches generated by EAVISE [47], DPatch [1], and the YOLO adversarial patch [48], representing the localized-noise, printable, and object-occluded patches. DPatch produces constraint-sized patches (40 mm × 40 mm, 75 mm × 75 mm, and 100 mm × 100 mm) placed in the upper left corner of each image. The YOLO adversarial patch, covering ( 20 % , 30 % , and 40 % ) of the bounding box, is placed over the target objects. The bounding boxes are generated by running the same detector over the dataset. EAVISE creates a small patch (around 40 mm × 40 mm) that is used to hide people from object detectors, where the objective was to minimize the products of the object score and the class score (OBJ-CLS), only the object score (OBJ), and only the class score (CLS).
Dataset and Architecture: We apply the adversarial patch on the Visdrone dataset [67] to generate the patch-attacked image subset. We create a class of 400 adversarial-patch-attacked images as a subset from the original set. Those adversarial patches are overlaid at the target object detection location and do not fully occlude the target object, only covering 10 % to 30 % of the bounding box’s size. The adversarial goal is to evade the detection of target objects.
Detection tasks: For YOLO [40], a one-stage region-based framework, class probabilities and bounding box offsets are predicted directly with a single feedforward CNN network. This architecture leads to a faster processing speed. DETR [42], which stands for detection transformer, is a deep learning model that utilizes a transformer encoder–decoder architecture. It is known for streamlining the object detection pipeline by directly predicting a set of objects without relying on many handcrafted components, like non-maximum suppression or anchor boxes. Faster RCNN [41], a two-stage detection framework, includes a preprocessing step for region proposals and a category-specific classification step to determine the category labels of the proposals. These detector models are pretrained on MS COCO [68].
Benchmarks: We compare SAR with four state-of-the-art adversarial patch defenses: LGS [24], Jedi [28], RLID [29], and Jujutsu [25], corresponding to “detect-and-mitigate”, “detect-and-recover”, and “detect-and-remove” strategies.

5.3. Results: Clean Performance

In this section, we evaluate the clean performance of SAR with three different base object detectors and three datasets. In Table 3, we evaluate the AP and FAR in a clean recall. Also, we plot the precision–recall curves, shown in Figure 2, for the PASCAL VOC [69] challenge dataset.
SAR has a low FAR and a high AP. Table 3 shows that when SAR uses various clean detectors as its base, it achieves a very low false-alarm rate (FAR) of just 0.2% and maintains high average precision (AP). These results indicate that SAR has only a negligible effect on clean detection performance. In fact, its mean average precision (mAP) is nearly identical to those of the base detectors.
SAR is greatly compatible with various object detectors. We report the mAPs of clean samples, after the defense, in Table 3. We can conclude that when we use DETR, YOLOv11, or Faster R-CNN as base detectors on diverse patches, the clean AP, as well as the precision–recall curves of the SAR, as shown in Figure 2, is close to that of its base detector (with a 1.1 % drop on Faster R-CNN, a 3.7 % rise on YOLOv11, and a 0.4 % rise on DETR). These results show that SAR is greatly functional with various object detectors.

5.4. Results: Provable Robustness

In this subsection, we present the robustness evaluation studies and manifest the provable robustness of the defense, SAR, against any hostile patch attack.

5.4.1. Patch Localization Performance

Patch localization is a critical step in the detection frontend defense pipeline, as it forms the foundation for restoring the broken pixels. We compare our “detect-and-inpaint” approach alongside PAD [20], SentiNet [18], and Jujutsu [25]. All four use patch detection as their frontend. In our experiments, where patches could appear anywhere on the object or in the image, our method achieved the highest patch localization recall rate (see Table 4). This observation demonstrates that our method can accurately detect potential risks from images, especially when the adversary conceals the object from detection. Admittedly, masking a larger image area can slightly reduce the object detection mAP, but our method consistently delivers the most accurate localization among preprocessing frontends.

5.4.2. Defense Against Adaptive Patches in Object Tracking

We further evaluate the defense performance under adaptive patch attacks, where the adversary targets the detector in the vision-based object-tracking pipeline shown in Figure 3. Using the AirSim simulation environment [72], we simulated a scenario in which a quadrotor drone tracks a car in an urban setting, using its onboard camera. The YOLO model [40] is used to process the image frame and localize the car with respect to the drone, which the tracking controller then uses to move the drone. To perform the adaptive attack, we first use YOLO to generate bounding boxes for each image frame. Then we place adversarial patches, crafted by the AD_YOLO method [48], on the detected objects at three different patch sizes.
From Table 3, SAR demonstrates strong provable robustness when targets are covered with patches at various scales. This result indicates that SAR significantly increases the difficulty of successful adversarial attacks: To bypass SAR, the adversary must place the patch directly on the target object. We also note that in our threat model, the patch is allowed to appear anywhere on the object, including in its most salient regions. Patches covering such critical features typically make robust detection extremely challenging. Nonetheless, SAR remains highly effective, even when the patch obscures the most informative part of the object. Overall, SAR achieves the strongest robustness among all five baselines under these adaptive, scale-varying, and strategically placed patch attacks.

5.4.3. Defense Against Physical (Printable) Patch Attacks

We further validate the effectiveness of SAR against a wider range of patch types, specifically physical (printable) patches. The printable patches include three adversary tasks: OBJ, OBJ-CLS, and CLS. We evaluate the certified recall at a clean recall of 0.5 in Table 3. Provable robustness boosts as clean recall exaggerates, and the performances of DETR, YOLOv11, and Faster R-CNN approach that of an ideal clean detector when the recall is close to one. As shown in Table 3, when using a perfect clean detector, SAR can certify the robustness for 71 % of the objects when under a patch attack (for various adversary tasks), meaning that none of the attackers within our hostile model can effectively attack these certified objects.

5.4.4. Defense Against Localized Patch Attacks

The defense performance against localized-noise attacks (DPatch [1]) is shown in Table 3, where the attacker targets only the object detectors. The localized noise encompasses three different patch sizes. As shown in Table 3, SAR is robust across different patch sizes in both datasets and has the highest ( C R @ 0.5 ) value compared to those of the baselines. In addition, adversarial patches create counterfeit detections and vague foreground objects at diverse scales. SAR masks out adversarial patches and restores model predictions. These results manifest that SAR is a generalizable method that can be applied to easier or more challenging detection tasks.

5.5. Ablation Study

We perform an ablation study to investigate the individual impacts of the image-inpainting backend. We first conduct a runtime analysis of SAR to manifest its lightweight overhead. Next, we compare the AP in the “detect-and-inpaint” and “detect-and-remove” strategies in Figure 4.
Real-time Test: We tested the real-time performance of the “detection-and-inpaint” pipeline within a vision-based tracking-to-movement system, as shown in Figure 3. In this setup, an adversarial patch attacks targets in the tracking system: Patches of sizes 30 % , 35 % , and 40 % of the bounding box are placed over the tracking target (e.g., cars) to prevent detection. As the tracking target moves in the image frame, the overlaid patch location and scale change from frame to frame, and, hence, we consider it to be an “adaptive” attack. We embed the “detect-and-inpaint” pipeline to preprocess the adversarial image frame to robustify the system. Considering only one available GPU, the SAR runtime can be calculated as t p r e d i c t o r + t d e t e c t o r + t i n p a i n t , which achieves the real-time image restoration performance in Table 5. Overall, SAR leads to a roughly 2× slowdown compared to that of an undefended detector.
The Effect of a Broken Pixel Restoration Backend: We calculate the APs for popular detectors in a subset of the VisDrone dataset [67]. As shown in Figure 4, when using the pixel restoration backend, the detector is able to identify objects, even when adversarial patches cover between 10% and 40% of the detection-bounding box. The figure also shows that AP improves as the broken pixels are restored, with the performances of DETR, YOLOv11, and Faster R-CNN approaching that of an exemplary clean detector. As a result, the pixel-recovering process minimizes the adverse effects. Table 1 presents the object detection results obtained using DETR, YOLOv11, and Faster R-CNN. When adversarial images are preprocessed with the “detect-and-remove” strategy, the black mask used to cover the patch can obscure the object, leading to detection failure due to excessive pixel loss. In contrast, when using the inpainting module to repair the corrupted regions, the objects are accurately detected, as shown in Table 1.

6. Generalization and Limitations

Generalization to Various Adversarial Patches: Since adversarial patches may not always be square in the real world, we further evaluate SAR with adversarial patches of diverse shapes while varying the number of pixels in the patch. In this paper, we used the following state-of-the-art adversarial patches for a comprehensive evaluation: LaVAN [11], Adversarial yolo [48], Adversarial Patch [9], DPatch [1], EAVISE [47], DPatch [1], Extended RP_2 [2], Physical patch + PGD [73], the Patch-Against-Person Detector [47], DPAttack [3], RPAttack [4], the Patch Against Aerial Detection [5], Patch Noobj [6], Translucent Patch [50], Dynamic Patch [74], Invisibility Patch [75], the Patch Exploiting Contextual Reasoning [49], Illusion Breaker [76], and the Natural-Looking Patch [77]. Each of the patch categories is a unique attack scheme, and these patches represent the majority of the existing visible adversarial patch schemes, to the best of our knowledge. The results are shown in Figure 5 and Figure 6. SAR demonstrates strong robustness under circular, rectangular, triangular, diamond-shaped, and ellipsoidal patch attacks.
Limitations: The proposed defense is effective to protect against localized and visible attacks in its current form; however, we have shown that the Translucent Patch [50] allows attackers to bypass the defense. This highlights the need for further research on how best to defend against such attacks. One promising direction is to exploit the translucent structures of images in contrast to the unnatural structures of adversarial perturbations. For example, we observed empirically that most salient objects in ImageNet samples contain thin, continuous regions, representing the most influential image areas, whereas our modified attack generates sparse, noisy patterns. However, additional research is needed to determine whether this kind of structural reasoning can be generalized to defend against localized and visible adversarial perturbations in other domains.

7. Conclusions

In this paper, we identify frequency domain characteristics of adversarial patches that are independent of their appearance, shape, scale, and location. Leveraging these characteristics, we propose a generalized patch-agnostic defense method, SAR, which performs adversarial patch localization and restoration against patch-hiding attacks. SAR adapts robust image classifiers for robust object detection using an objectness-explaining strategy. Our evaluation of the different datasets demonstrates that SAR outperforms the defense approaches evaluated in this paper against patch-hiding attacks and exhibits a high degree of compatibility, with a clean performance comparable to those of state-of-the-art object detectors.

Author Contributions

Conceptualization, H.G. and H.J.; methodology, H.G. and H.J.; software, H.G. and H.J.; validation, H.G. and H.J.; formal analysis, H.G. and H.J.; investigation, H.G. and H.J.; resources, H.G. and H.J.; data curation, H.G. and H.J.; writing—original draft preparation, H.G. and H.J.; writing—review and editing, H.G. and H.J.; visualization, H.G. and H.J.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Science Foundation (NSF) under award number 2137753.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in https://github.com/robotics-star/SAR (accessed on 10 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Liu, X.; Yang, H.; Liu, Z.; Song, L.; Li, H.; Chen, Y. Dpatch: An adversarial patch attack on object detectors. arXiv 2018, arXiv:1806.02299. [Google Scholar]
  2. Song, D.; Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Tramer, F.; Prakash, A.; Kohno, T. Physical adversarial examples for object detectors. In Proceedings of the 12th USENIX Workshop on Offensive Technologies (WOOT 18), Baltimore, MD, USA, 13–14 August 2018. [Google Scholar]
  3. Wu, S.; Dai, T.; Xia, S. Dpattack: Diffused patch attacks against universal object detection. arXiv 2010, arXiv:2010.11679. [Google Scholar]
  4. Huang, H.; Wang, Y.; Chen, Z.; Tang, Z.; Zhang, W.; Ma, K.K. Rpattack: Refined patch attack on general object detectors. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
  5. Adhikari, A.; Hollander, R.d.; Tolios, I.; van Bekkum, M.; Bal, A.; Hendriks, S.; Kruithof, M.; Gross, D.; Jansen, N.; Pérez, G.; et al. Adversarial patch camouflage against aerial detection. arXiv 2020, arXiv:2008.13671. [Google Scholar] [CrossRef]
  6. Lu, M.; Li, Q.; Chen, L.; Li, H. Scale-adaptive adversarial patch attack for remote sensing image aircraft detection. Remote Sens. 2021, 13, 4078. [Google Scholar] [CrossRef]
  7. Zhao, Y.; Yan, H.; Wei, X. Object hider: Adversarial patch attack against object detectors. arXiv 2020, arXiv:2010.14974. [Google Scholar] [CrossRef]
  8. Chen, S.T.; Cornelius, C.; Martin, J.; Chau, D.H. Shapeshifter: Robust physical adversarial attack on faster r-cnn object detector. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the European Conference, ECML PKDD 2018, Dublin, Ireland, 10–14 September 2018; Proceedings, Part I 18; Springer: Cham, Switzerland, 2019; pp. 52–68. [Google Scholar]
  9. Brown, T.B.; Mané, D.; Roy, A.; Abadi, M.; Gilmer, J. Adversarial patch. arXiv 2017, arXiv:1712.09665. [Google Scholar]
  10. Evtimov, I.; Eykholt, K.; Fernandes, E.; Kohno, T.; Li, B.; Prakash, A.; Rahmati, A.; Song, D. Robust physical-world attacks on machine learning models. arXiv 2017, arXiv:1707.08945. [Google Scholar]
  11. Karmon, D.; Zoran, D.; Goldberg, Y. Lavan: Localized and visible adversarial noise. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 2507–2515. [Google Scholar]
  12. Chindaudom, A.; Siritanawan, P.; Sumongkayothin, K.; Kotani, K. AdversarialQR: An adversarial patch in QR code format. In Proceedings of the Joint 9th International Conference on Informatics, Electronics & Vision (ICIEV) and 4th International Conference on Imaging, Vision & Pattern Recognition (icIVPR), Kitakyushu, Japan, 26–29 August 2020; pp. 1–6. [Google Scholar]
  13. Liu, A.; Liu, X.; Fan, J.; Ma, Y.; Zhang, A.; Xie, H.; Tao, D. Perceptual-sensitive gan for generating adversarial patches. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 1028–1035. [Google Scholar]
  14. Zhou, X.; Pan, Z.; Duan, Y.; Zhang, J.; Wang, S. A data independent approach to generate adversarial patches. Mach. Vis. Appl. 2021, 32, 67. [Google Scholar] [CrossRef]
  15. Li, J.; Schmidt, F.; Kolter, Z. Adversarial camera stickers: A physical camera-based attack on deep learning systems. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 3896–3904. [Google Scholar]
  16. Athalye, A.; Engstrom, L.; Ilyas, A.; Kwok, K. Synthesizing robust adversarial examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 284–293. [Google Scholar]
  17. Hayes, J. On visible adversarial perturbations & digital watermarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1597–1604. [Google Scholar]
  18. Chou, E.; Tramer, F.; Pellegrino, G. Sentinet: Detecting localized universal attacks against deep learning systems. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 21 May 2020; pp. 48–54. [Google Scholar]
  19. Liu, J.; Levine, A.; Lau, C.P.; Chellappa, R.; Feizi, S. Segment and complete: Defending object detectors against adversarial patch attacks with robust patch detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 21–24 June 2022; pp. 14973–14982. [Google Scholar]
  20. Jing, L.; Wang, R.; Ren, W.; Dong, X.; Zou, C. Pad: Patch-agnostic defense against adversarial patch attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 17–21 June 2024; pp. 24472–24481. [Google Scholar]
  21. Xu, K.; Xiao, Y.; Zheng, Z.; Cai, K.; Nevatia, R. Patchzero: Defending against adversarial patch attacks by detecting and zeroing the patch. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4632–4641. [Google Scholar]
  22. Xiang, C.; Valtchanov, A.; Mahloujifar, S.; Mittal, P. Objectseeker: Certifiably robust object detection against patch hiding attacks via patch-agnostic masking. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 22–25 May 2023; pp. 1329–1347. [Google Scholar]
  23. McCoyd, M.; Park, W.; Chen, S.; Shah, N.; Roggenkemper, R.; Hwang, M.; Liu, J.X.; Wagner, D. Minority reports defense: Defending against adversarial patches. In Proceedings of the International Conference on Applied Cryptography and Network Security, Rome, Italy, 22–25 June 2020; pp. 564–582. [Google Scholar]
  24. Naseer, M.; Khan, S.; Porikli, F. Local gradients smoothing: Defense against localized adversarial attacks. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA, 7–11 January 2019; pp. 1300–1307. [Google Scholar]
  25. Chen, Z.; Dash, P.; Pattabiraman, K. Jujutsu: A two-stage defense against adversarial patch attacks on deep neural networks. In Proceedings of the ACM Asia Conference on Computer and Communications Security, Melbourne, Australia, 5–9 June 2023; pp. 689–703. [Google Scholar]
  26. Chattopadhyay, N.; Guesmi, A.; Shafique, M. Anomaly unveiled: Securing image classification against adversarial patch attacks. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 929–935. [Google Scholar]
  27. Bunzel, N.; Frick, R.A.; Klause, G.; Schwarte, A.; Honermann, J. Signals Are All You Need: Detecting and Mitigating Digital and Real-World Adversarial Patches Using Signal-Based Features. In Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems, Singapore, 2 July 2024; pp. 24–34. [Google Scholar]
  28. Tarchoun, B.; Ben Khalifa, A.; Mahjoub, M.A.; Abu-Ghazaleh, N.; Alouani, I. Jedi: Entropy-based localization and removal of adversarial patches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4087–4095. [Google Scholar]
  29. Chen, J.; Wei, X. Defending Adversarial Patches via Joint Region Localizing and Inpainting. arXiv 2023, arXiv:2307.14242. [Google Scholar] [CrossRef]
  30. Huang, Y.; Li, Y. Zero-shot certified defense against adversarial patches with vision transformers. arXiv 2021, arXiv:2111.10481. [Google Scholar]
  31. Metzen, J.H.; Yatsura, M. Efficient certified defenses against patch attacks on image classifiers. arXiv 2021, arXiv:2102.04154. [Google Scholar] [CrossRef]
  32. Brendel, W.; Bethge, M. Approximating cnns with bag-of-local-features models works surprisingly well on imagenet. arXiv 2019, arXiv:1904.00760. [Google Scholar]
  33. Zhang, Z.; Yuan, B.; McCoyd, M.; Wagner, D. Clipped bagnet: Defending against sticker attacks with clipped bag-of-features. In Proceedings of the 2020 IEEE Security and Privacy Workshops (SPW), Francisco, CA, USA, 21 May 2020; pp. 55–61. [Google Scholar]
  34. Xiang, C.; Bhagoji, A.N.; Sehwag, V.; Mittal, P. {PatchGuard}: A provably robust defense against adversarial patches via small receptive fields and masking. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Vancouver, BC, Canada, 11–13 August 2021; pp. 2237–2254. [Google Scholar]
  35. Xiang, C.; Mittal, P. Patchguard++: Efficient provable attack detection against adversarial patches. arXiv 2021, arXiv:2104.12609. [Google Scholar]
  36. Xiang, C.; Mittal, P. Detectorguard: Provably securing object detectors against localized patch hiding attacks. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 3177–3196. [Google Scholar]
  37. Levine, A.; Feizi, S. (De) randomized smoothing for certifiable defense against patch attacks. Adv. Neural Inf. Process. Syst. 2020, 33, 6465–6475. [Google Scholar]
  38. Lecuyer, M.; Atlidakis, V.; Geambasu, R.; Hsu, D.; Jana, S. Certified robustness to adversarial examples with differential privacy. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 23 May 2019; pp. 656–672. [Google Scholar]
  39. Lin, W.Y.; Sheikholeslami, F.; Shi, J.; Rice, L.; Kolter, J.Z. Certified robustness against physically-realizable patch attack via randomized cropping. In Proceedings of the ICLR 2021 Conference Blind Submission, Virtual, 5 October 2021. [Google Scholar]
  40. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  41. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  42. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
  43. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
  44. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  45. Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
  46. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  47. Thys, S.; Van Ranst, W.; Goedemé, T. Fooling automated surveillance cameras: Adversarial patches to attack person detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. [Google Scholar]
  48. Shrestha, S.; Pathak, S.; Viegas, E.K. Towards a robust adversarial patch attack against unmanned aerial vehicles object detection. In Proceedings of the 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 3256–3263. [Google Scholar]
  49. Saha, A.; Subramanya, A.; Patil, K.; Pirsiavash, H. Adversarial patches exploiting contextual reasoning in object detection. arXiv 2019, arXiv:1910.00068. [Google Scholar]
  50. Zolfi, A.; Kravchik, M.; Elovici, Y.; Shabtai, A. The translucent patch: A physical and universal attack on object detectors. In Proceedings of the IEEE/CVF Conference On Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15232–15241. [Google Scholar]
  51. Rao, S.; Stutz, D.; Schiele, B. Adversarial training against location-optimized adversarial patches. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 429–448. [Google Scholar]
  52. Papernot, N.; McDaniel, P.; Sinha, A.; Wellman, M. Towards the science of security and privacy in machine learning. arXiv 2016, arXiv:1611.03814. [Google Scholar] [CrossRef]
  53. Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
  54. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv 2017, arXiv:1706.06083. [Google Scholar]
  55. Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  56. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 142–158. [Google Scholar] [CrossRef] [PubMed]
  57. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  58. Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 379–387. [Google Scholar]
  59. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  60. Li, H.; Zhang, S.; Li, X.; Su, L.; Huang, H.; Jin, D.; Chen, L.; Huang, J.; Yoo, J. Detectornet: Transformer-enhanced spatial temporal graph neural network for traffic prediction. In Proceedings of the 29th International Conference on Advances in Geographic Information Systems, Beijing, China, 2–5 November 2021; pp. 133–136. [Google Scholar]
  61. Sermanet, P.; Eigen, D.; Zhang, X.; Mathieu, M.; Fergus, R.; LeCun, Y. Overfeat: Integrated recognition, localization and detection using convolutional networks. arXiv 2013, arXiv:1312.6229. [Google Scholar]
  62. Barbier, J.; Filiol, É.; Mayoura, K. Universal JPEG steganalysis in the compressed frequency domain. In Proceedings of the International Workshop on Digital Watermarking, Jeju Island, Republic of Korea, 8–10 November 2006; pp. 253–267. [Google Scholar]
  63. Zhao, X.; Ding, W.; An, Y.; Du, Y.; Yu, T.; Li, M.; Tang, M.; Wang, J. Fast segment anything. arXiv 2023, arXiv:2306.12156. [Google Scholar] [CrossRef]
  64. Zhu, M. Recall, Precision and Average Precision; Department of Statistics and Actuarial Science, University of Waterloo: Waterloo, ON, USA, 2004; Volume 2, p. 6. [Google Scholar]
  65. Banerjee, A.; Chitnis, U.B.; Jadhav, S.L.; Bhawalkar, J.S.; Chaudhury, S. Hypothesis testing, type I and type II errors. Ind. Psychiatry J. 2009, 18, 127–131. [Google Scholar] [CrossRef]
  66. Oksuz, K.; Cam, B.C.; Akbas, E.; Kalkan, S. Localization recall precision (LRP): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 504–519. [Google Scholar]
  67. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef]
  68. Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, Proceedings of the 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
  69. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  70. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  71. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 1, pp. 886–893. [Google Scholar]
  72. Shah, S.; Dey, D.; Lovett, C.; Kapoor, A. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference; Springer: Cham, Switzerland, 2017; pp. 621–635. [Google Scholar]
  73. Lee, M.; Kolter, Z. On physical adversarial patches for object detection. arXiv 2019, arXiv:1906.11897. [Google Scholar] [CrossRef]
  74. Hoory, S.; Shapira, T.; Shabtai, A.; Elovici, Y. Dynamic adversarial patch for evading object detection models. arXiv 2020, arXiv:2010.13070. [Google Scholar] [CrossRef]
  75. Wang, Y.; Lv, H.; Kuang, X.; Zhao, G.; Tan, Y.A.; Zhang, Q.; Hu, J. Towards a physical-world adversarial patch for blinding object detection models. Inf. Sci. 2021, 556, 459–471. [Google Scholar] [CrossRef]
  76. Shack, J.; Petrovic, K.; Saukh, O. Breaking the Illusion: Real-world Challenges for Adversarial Patches in Object Detection. arXiv 2024, arXiv:2410.19863. [Google Scholar]
  77. Hu, Y.C.T.; Kung, B.H.; Tan, D.S.; Chen, J.C.; Hua, K.L.; Cheng, W.H. Naturalistic physical adversarial patch for object detectors. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 7848–7857. [Google Scholar]
Figure 1. Segment-and-Recover (SAR) patch-agnostic defense framework.
Figure 1. Segment-and-Recover (SAR) patch-agnostic defense framework.
Jimaging 11 00316 g001
Figure 2. PR curves of SAR defenses with various detectors in the PASCAL VOC [69] dataset.
Figure 2. PR curves of SAR defenses with various detectors in the PASCAL VOC [69] dataset.
Jimaging 11 00316 g002
Figure 3. Using the AirSim simulation environment [72], we simulated a scenario in which a quadrotor drone tracks a car in an urban setting, using its onboard camera. The YOLO model [40] is used to process the image frame and localize the car with respect to the drone, which the tracking controller then uses to move the drone. In our simulations, the vision-based object tracking is subject to adaptive patch attacks, where we overlay an adversarial patch on the target to mislead the victim object detector within the vision-based tracking system. Lastly, we apply SAR to preprocess the attacked-image frames.
Figure 3. Using the AirSim simulation environment [72], we simulated a scenario in which a quadrotor drone tracks a car in an urban setting, using its onboard camera. The YOLO model [40] is used to process the image frame and localize the car with respect to the drone, which the tracking controller then uses to move the drone. In our simulations, the vision-based object tracking is subject to adaptive patch attacks, where we overlay an adversarial patch on the target to mislead the victim object detector within the vision-based tracking system. Lastly, we apply SAR to preprocess the attacked-image frames.
Jimaging 11 00316 g003
Figure 4. Provable robustness accuracies of defenses with different detectors in the patch-attacked Visdrone dataset [67].
Figure 4. Provable robustness accuracies of defenses with different detectors in the patch-attacked Visdrone dataset [67].
Jimaging 11 00316 g004
Figure 5. Visualization examples illustrating the patch localization process across different adversarial patch types. The highlighted yellow in third column denote the adversarial pixel.
Figure 5. Visualization examples illustrating the patch localization process across different adversarial patch types. The highlighted yellow in third column denote the adversarial pixel.
Jimaging 11 00316 g005
Figure 6. Visualization examples illustrating the patch localization process across different adversarial patch types. The highlighted yellow in third column denote the adversarial pixel.
Figure 6. Visualization examples illustrating the patch localization process across different adversarial patch types. The highlighted yellow in third column denote the adversarial pixel.
Jimaging 11 00316 g006
Table 1. (Top Row) Due to the added patches, the PAD [20] (baseline approach) yields significantly degraded inference accuracy when evaluated using three different object detection models: YOLOv11 [40], Faster RCNN [41], and DETER [42]. (Bottom Row) The inference results of SAR (our approach), which present the detection performance with a high level of confidence.
Table 1. (Top Row) Due to the added patches, the PAD [20] (baseline approach) yields significantly degraded inference accuracy when evaluated using three different object detection models: YOLOv11 [40], Faster RCNN [41], and DETER [42]. (Bottom Row) The inference results of SAR (our approach), which present the detection performance with a high level of confidence.
YOLOv11x [40]Faster RCNN [41]DETR [42]
PADJimaging 11 00316 i001Jimaging 11 00316 i002Jimaging 11 00316 i003
SARJimaging 11 00316 i004Jimaging 11 00316 i005Jimaging 11 00316 i006
Table 2. Table of notations.
Table 2. Table of notations.
NotationDescription
X R W × H × C image space
Y = { 0 , 1 , · · · , N 1 } label space
M ( x ) : X Y model predictor from x X
A ( x ) constraint set
p P { 0 , 1 } W × H binary pixel block
a m adversarial patch mask map
Table 3. CR@0.5 values under various hostile patch attacks. The bold numbers highlight the best performances.
Table 3. CR@0.5 values under various hostile patch attacks. The bold numbers highlight the best performances.
DetectorDefenseCleanFARPrintable Patch (EAVISE [47])Localized Noise (DPatch [1])Adaptive Patch (Ad_yolo [48])
OBJ-CLS OBJ CLS 40 × 40 75 × 75 100 × 100 20 % 30 % 40 %
Undefended0.66n/a0.530.570.610.550.390.550.300.530.36
LGS [24]0.550.0830.420.560.600.320.330.570.430.350.25
DETRJedi [28]0.390.070.620.580.480.550.390.550.610.460.35
RLID [29]0.620.0220.540.690.380.690.680.410.460.520.55
Jujutsu [25]0.580.0680.620.580.680.320.330.570.430.350.25
SAR (Ours)0.700.0020.820.860.700.690.680.740.660.620.75
Undefended0.57n/a0.610.460.350.350.510.490.640.690.58
LGS [24]0.710.0190.330.250.340.370.320.350.590.700.62
YOLOv11Jedi [28]0.620.0210.350.440.260.350.490.370.680.340.61
RLID [29]0.610.070.460.620.550.490.680.550.640.690.58
Jujutsu [25]0.590.0650.430.350.250.320.330.570.620.580.68
SAR (Ours)0.740.0030.750.660.610.550.710.740.720.760.70
Undefended0.61n/a0.350.510.490.300.530.360.530.570.61
LGS [24]0.570.0830.370.320.350.660.460.550.540.690.70
FasterJedi [28]0.350.0220.350.490.370.350.440.260.680.740.61
R-CNNRLID [29]0.570.3160.290.680.440.330.250.340.590.780.62
Jujutsu [25]0.620.0190.320.330.570.430.350.250.620.580.68
SAR (Ours)0.650.0060.650.690.750.710.620.550.820.860.88
Table 4. Patch localization recall (%). The bold numbers highlight the best performances.
Table 4. Patch localization recall (%). The bold numbers highlight the best performances.
DefenseSARPADSentiNetJujutsu
Patch + Dataset
LaVAN [11]ImageNet [70]87.244.539.6012.17
DPatch [1]Pascal VOC [69]91.4739.6026.9428.03
EAVISE [47]Inria [71]74.2028.0335.0227.08
Ad_yolo [48]VisDrone-2019 [67]93.2910.8519.2234.30
Table 5. Per-example runtime division.
Table 5. Per-example runtime division.
Base DetectorObjectnessObjectnessInpaintSAR
yolov11 Predictor Detector Lama yolov11
Small69.0 ms0.2 ms0.55 ms54.2 ms54.95 ms
Medium32.2 ms0.4 ms0.25 ms56.5 ms57.15 ms
Large54.8 ms0.3 ms0.35 ms56.2 ms56.85 ms
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gu, H.; Jafarnejadsani, H. Segment and Recover: Defending Object Detectors Against Adversarial Patch Attacks. J. Imaging 2025, 11, 316. https://doi.org/10.3390/jimaging11090316

AMA Style

Gu H, Jafarnejadsani H. Segment and Recover: Defending Object Detectors Against Adversarial Patch Attacks. Journal of Imaging. 2025; 11(9):316. https://doi.org/10.3390/jimaging11090316

Chicago/Turabian Style

Gu, Haotian, and Hamidreza Jafarnejadsani. 2025. "Segment and Recover: Defending Object Detectors Against Adversarial Patch Attacks" Journal of Imaging 11, no. 9: 316. https://doi.org/10.3390/jimaging11090316

APA Style

Gu, H., & Jafarnejadsani, H. (2025). Segment and Recover: Defending Object Detectors Against Adversarial Patch Attacks. Journal of Imaging, 11(9), 316. https://doi.org/10.3390/jimaging11090316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop