Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation

Gu, Haotian; Yoon, Hyung Jin; Jafarnejadsani, Hamidreza

doi:10.3390/automation6030044

Open AccessArticle

Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation

by

Haotian Gu

¹,

Hyung Jin Yoon

² and

Hamidreza Jafarnejadsani

^1,*

¹

Department of Mechanical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA

²

Department of Mechanical Engineering, Tennessee Technological University, Cookeville, TN 38505, USA

^*

Author to whom correspondence should be addressed.

Automation 2025, 6(3), 44; https://doi.org/10.3390/automation6030044

Submission received: 11 July 2025 / Revised: 7 August 2025 / Accepted: 2 September 2025 / Published: 9 September 2025

(This article belongs to the Section Robotics and Autonomous Systems)

Download

Browse Figures

Versions Notes

Abstract

In vision-guided autonomous robots, object detectors play a crucial role in perceiving the environment for path planning and decision-making. However, adaptive adversarial patch attacks undermine the resilience of detector-based systems. Strengthening object detectors against such adaptive attacks enhances the robustness of navigation systems. Existing defenses against patch attacks are primarily designed for stationary scenes and struggle against adaptive patch attacks that vary in scale, position, and orientation in dynamic environments. In this paper, we introduce Ad_YOLO+, an efficient and effective plugin that extends Ad_YOLO to defend against white-box patch-based image attacks. Built on YOLOv5x with an additional patch detection layer, Ad_YOLO+ is trained on a specially crafted adversarial dataset (COCO-Visdrone-2019). Unlike conventional methods that rely on redundant image preprocessing, our approach directly detects adversarial patches and the overlaid objects. Experiments on the adversarial training dataset demonstrate that Ad_YOLO+ improves both provable robustness and clean accuracy. Ad_YOLO+ achieves

85.4 %

top-1 clean accuracy on the COCO dataset and

74.63 %

top-1 robust provable accuracy against pixel square patches anywhere on the image for the COCO-VisDrone-2019 dataset. Moreover, under adaptive attacks in AirSim simulations, Ad_YOLO+ reduces the attack success rate, ensuring tracking resilience in both dynamic and static settings. Additionally, it generalizes well to other patch detection weight configurations.

Keywords:

patch-enabled image attack; adversarial robustness; object detection model; vision-based object tracking

1. Introduction

The emergence of Convolutional Neural Networks (CNNs) and Deep Neural Networks (DNNs) has revolutionized computer vision and natural language processing (NLP), delivering remarkable results. These models are widely applied in autonomous vehicles, facial recognition, and medical image analysis, enabling real-time object classification and detection. However, object detectors remain vulnerable to patch-hiding attacks, which aim to disable detection models. These attacks can be executed in the physical world by attaching adversarial patches to real-world objects [1] or dynamically placing patches based on an object’s location within an image frame. Such threats pose significant risks to vision-based tracking systems in autonomous robots, potentially compromising their reliability and safety [2].

Defenses against adversarial patch attacks on object detectors can be categorized into five main types: (i) Locating and eliminating adversarial patches [3,4,5,6,7,8,9]—Methods such as PAD [6], PatchZero [7], and SAC [5] focus on identifying and removing adversarial regions. (ii) Detecting and neutralizing adversarial patches [10,11,12,13,14,15,16,17]—Techniques like UDFilter [14], local gradient smoothing [10], DIFFender [15], PatchCleanser [16], and PatchCure [17] aim to detect and mitigate patch effects. (iii) Detecting and inpainting adversarial patches [18,19,20]—Approaches such as Jedi [18], Adv-Inpainting [19], and RLID [20] replace adversarial regions with reconstructed image content. (iv) Certifiably robust-based defenses methods [21,22,23,24,25,26,27,28,29]—Represented by [29]. (v) Detection-based methods—Represented by [30], these leverage adversarial training to improve robustness. The first four categories rely on image preprocessing techniques, while the fifth is based on adversarial training. Despite the extensive research on adversarial patch attacks and defenses, most existing methods focus on digital attacks and classification tasks. Securing object detectors remains particularly challenging due to the diverse patch generation strategies, as well as variations in scale and placement.

In this paper, we propose Ad_YOLO+, an efficient and effective object detection approach designed to defend against white-box patch-enabled image attacks. The core idea of Ad_YOLO+ is to leverage adversarial training to develop a model capable of detecting both adversarial patches and target objects simultaneously—without requiring image preprocessing. Our method is tailored for dynamic environments, such as vision-based object tracking, where an attacker overlays printed adversarial patterns of varying sizes and locations onto detected targets in each image frame. This specially crafted robust adversarial patch accounts for physical variations and object scale fluctuations across frames, ensuring strong attack effectiveness against both dynamic and static objects. We address two key research questions: (1) Patch-agnostic detection—How can we detect adaptive adversarial patches in real time without prior knowledge of their size or location? (2) Certified robustness—How can we achieve provable robustness while maintaining high clean performance? The Ad_YOLO+ model is derived in three steps, as outlined in the following sections. An overview of our approach is illustrated in Figure 1.

Step 1: Adversarial Training Dataset Generation. To enable Ad_YOLO+ to detect both adversarial patches and target objects simultaneously, we constructed a specialized adversarial training dataset comprising patch-attacked images and the COCO dataset [31]. Using pretrained YOLOv5s weights, we first detect objects in each image frame and then adaptively apply adversarial patches to the detected bounding boxes, generating adversarial samples. The patch dataset is derived from VisDrone-2019 [32] and includes 80 non-targeted image attacks, ensuring diverse and challenging adversarial scenarios.
Step 2: Patch Category Preparation. Despite its robustness, Ad_YOLO+ must still detect adversarial patches as a distinct category (e.g., both the adversarial patch and the attacked target in Table 1 are identified by Ad_YOLO+). To accomplish this, we introduced a new patch category in the training YAML file, enabling precise detection of adversarial patches.
Step 3: Training Ad_YOLO+ model. We use YOLOv5x as the backbone model and incorporate an additional patch detection layer as the final layer. The overall Ad_YOLO+ architecture includes a detection head and neck structure. The training schematic is illustrated in Figure 1. The network is trained for 300 epochs on a single GPU with a batch size of 16, using a specially designed adversarial training and validation dataset. The optimization process combines objectiveness loss, localization loss, and classification loss to enhance detection performance.

The main contributions of the paper are as follows:

We propose Ad_YOLO+, an object detection model designed to be robust against adversarial patches of varying sizes and locations in a vision-based tracking task. By treating adversarial patches as a distinct category during training, Ad_YOLO+ effectively identifies both adversarial patches and target objects simultaneously.
We evaluate Ad_YOLO+ in the AirSim virtual simulation environment. Our results show that Ad_YOLO+ achieves an $A P_{0.5} = 71.44 %$ for tracking targets—significantly outperforming YOLOv5s $A P_{0.5} = 8.76 %$ —while also detecting adversarial patches with $A P_{0.5} = 85.35 %$ on the specially generated adversarial training dataset ( $A P_{0.5}$ represents average precision at an intersection over union (IoU) threshold of $0.5$ ).
We implement a verification procedure to assess whether Ad_YOLO+ provides provable robustness against adaptive adversarial patches in real time under a defined threat model. Additionally, our comparison results demonstrate that Ad_YOLO+ outperforms the evaluated state-of-the-art approaches in defense against patch attacks, particularly in terms of object detection accuracy.

2. Relevant Work

Given the vulnerability of well-trained CNNs and DNNs to adversarial attacks, a crucial research direction is enhancing their reliability. In recent years, numerous defense strategies have been proposed [34,35], but most struggle to adapt to increasingly complex threats, such as adaptive adversarial patches. Additionally, many existing defenses are limited to models trained on datasets closely aligned with specific attack generation schemes [36,37]. While image denoising-based methods are effective in mitigating invisible perturbations for static image classification, they pose challenges in real-time applications like vision-based navigation. The computational overhead of image preprocessing, combined with the vulnerability of multi-task learning systems, can significantly impact the performance and responsiveness of vision-based models in dynamic environments.

2.1. Detection-Based Defense Methods for Adversarial Patches

Several detection- and removal-based methods, such as PAD [6], PatchZero [7], and SAC [5], leverage high-frequency signatures to detect adversarial patches, remove them from input images, and then pass the masked images to object detectors. SAC [5], a U-Net-based patch segmentation method, is designed to protect object detectors from patch attacks. It is trained on pre-generated adversarial images to remove untargeted patch attacks that vary in size and location. This self-adversarial training approach is effective when the patch segmenter’s output is within a certain Hamming distance of the ground-truth patch masks. However, the method faces limitations when dealing with targeted attacks that disable localization and label inference, and it performs poorly when encountering natural-looking patches that were not included in the training dataset, resulting in near-zero Patch Localization Recall.

2.2. Adversarial Training Method

Adversarial training is an effective brute-force method for enhancing model robustness by incorporating malicious examples alongside clean data during training. This approach helps the model predict the same object label for both legitimate and perturbed examples during testing. For visible perturbations, such as adversarial patches, Ad_YOLO [30] was proposed to detect both the adversarial patch and its targeted object. The training dataset for Ad_YOLO combines the COCO dataset with adversarially patched images. However, models trained using patch-based adversarial training are limited to defending against specific adversarial schemes. Moreover, adversarial training increases computational resource usage and training complexity. Additionally, these models may not be adaptive to all types of malicious attacks, as they block attacks that resemble the adversarial examples seen during training while leaving other vulnerabilities exposed to new white-box attackers. Changes to the loss function during adversarial training can also lead to unexpected drops in prediction confidence for legitimate objects [38].

2.3. Gradient Masking

Gradient masking is a technique that modifies input samples using smoothing functions to reduce the adversarial effect without altering the underlying DNN [39]. Recently, this approach has been applied in both classification and detection tasks to defend against adversarial patches. Naseer et al. [40] proposed local gradient smoothing (LGS) [10], which identifies and removes areas with abnormally high-frequency gradients in the input image, treating these regions as adversarial patches. Similarly, Yu et al. [10] introduced the feature norm clipping (FNC) defense, which adaptively clips deep norm features affected by universal adversarial patches. However, the effectiveness of these defenses is heavily dependent on the proper selection of thresholds, making them sensitive to tuning.

3. The Problem Statement and Proposed Method

In this section, we introduce the patch generation scheme and adaptive patch attack in a vision-based tracking task.

3.1. Adversarial Adaptive Patch Generation

Attack Capability: The goal of the adversarial attack is to prevent the object detector from accurately identifying the specific tracking target. The attacker aims to use the adversarial patch to reduce the detection confidence below the identification threshold. In a navigation scenario, the patch adaptively locates the target object within continuous image frames, overlaying a patch scaled to 30% of the bounding box size. This ensures the patch does not completely obscure the object, causing the object-tracking controller to lose track of the target. The integration of the adaptive patch into the object-tracking framework is illustrated in Figure 2. We use YOLOv5 [33] as the baseline object detector for the tracking control system.

Patch Generation Scheme [41]: We employ an adaptive patch (ADpatch) generation approach based on [41]. To train the malicious patch, our goal is to identify the pattern

{\hat{P}}_{u}

that maximizes the loss of the object detector relative to the true class label

\hat{y}

and bounding box label

\hat{B}

. Specifically, the patch is optimized to maximize the probability of the detected object class. A maximum probability extractor is used to extract the highest class probability for the target classes. During training, the patch, derived from the pixels extracted by the probability extractor, is applied to the image frame at a random position, scale, and rotation. This ensures the patch is robust against shifting, scaling, and rotation. The YOLOv5 object detector model is involved in the optimization process.

Adversarial Patch Transformation and Training Loss: For patch training, we use the VisDrone-2019 [32] dataset consisting of thousands of static images in various resolutions and scenes acquired by the UAV platform and split across train, validation, and test datasets separately. The adversarial patch generation involves solving the following equation:

\begin{matrix} L_{p a t c h} = α L_{o b j} + β L_{t v} + γ L_{n p s} \end{matrix}

(1)

The

L_{o b j}

objectiveness score loss reduces the class score predicted by the detector. The total variation loss

L_{t v}

enhances the smoothness of the adversarial patch. The non-printability score loss

L_{n p s}

explains how well a printer can physically represent the colors in the patch. By reducing this loss, the adversarial patch can be printed in the physical world. In [41], the physical adversarial patch is generated using detection for images of real-world cars, trucks, buses and people by maximizing the object detector loss function listed below:

\begin{matrix} P (x, l) = \underset{P \in {P^{'} : | | P^{'} | |_{\infty} \leq ϵ}}{a r g m a x} L_{p a t c h} (h (A (x, l, P)); y), \end{matrix}

(2)

where h denotes the object detector,

A (x, l, P)

is a function that applies the patch P on location l of input image x, y denotes the set of image objects and the bounding boxes,

ϵ

is the attack budget and

L_{p a t c h}

represents the patch optimization loss function.

Patch Adaptation Scheme: We accommodate the dynamic conditions of the physical world while optimizing the adversarial patches. Also, we focus on placing the adversarial patches in the proper position with the adaptive size due to the varied scale of the object in each image frame. For the patch on target, our goal is to overlay an adversarial patch in the center with the proper size. To achieve that, we use the coordinate

(x_{1}, y_{1}, x_{2}, y_{2})

of the detection result to compute the center coordinate of the adversarial patch

l_{p^{*}}

as

\begin{matrix} l_{p^{*}} = (\frac{x_{1} + x_{2}}{2}, \frac{y_{1} + y_{2}}{2}) . \end{matrix}

(3)

Then, considering the varied scales of the objects, a scale-adaptive patch scheme is used to make the area of the adversarial patch and the targeted object keep a proper ratio

r_{s}

\begin{matrix} r_{s} = \frac{w_{p^{*}} \times h_{p^{*}}}{w_{t} \times h_{t}}, \end{matrix}

(4)

where

w_{p^{*}}

and

h_{p^{*}}

are equal to

\sqrt{r_{s} \times w_{t} \times h_{t}}

. The adversarial patch applier function

M_{p^{*}}

is formulated as

\begin{matrix} M_{p^{*}} = P A (p^{*}, l_{p^{*}}, w_{p^{*}}, h_{p^{*}}), \end{matrix}

(5)

which aims to paste a visible adversarial pattern on the corresponding position with an adaptive size.

3.2. Robust Object Detection Model: Ad_YOLO+

We developed the Ad_YOLO+ model to simultaneously identify adversarial patches and restore targeted objects. To enhance the model’s ability to defend against a wide range of object categories, we created a new dataset of patch-attacked images, which is combined with COCO to support the adversarial training process.

3.2.1. Ad_YOLO+ Framework

Our Ad_YOLO+ model builds upon Ad_YOLO and YOLOv5x architectures. Figure 1 illustrates the overall defense framework and the datasets used. The base model of Ad_YOLO+ is YOLOv5x, which integrates both detection and segmentation components. It retains the original 24 convolutional layers and 2 fully connected layers with randomly initialized weights, similar to YOLO, but with an additional patch detection layer. The final layer of the model predicts class probabilities, including a category for the adversarial patch, along with the normalized bounding box coordinates to localize the object within the image frame.

3.2.2. VisDrone-2019 Dataset for Adversarial Training

Motivated by the adversarial training approach, we constructed a dataset that combines the COCO dataset [31] with specially generated patch-attacked images. We use the VisDrone dataset and YOLOv5 model to generate the patch-attacked images, as shown in Figure 1. In this process, pixel alterations are applied within a bounded region to disrupt object tracking by either misclassifying objects or lowering the detection confidence threshold for the target model. The adversarial patch P typically follows a square shape, occupying

30 %

of the target object’s bounding box area. To build the adversarial training dataset, we apply the patch P to the location

(x y)

identified by YOLOv5 in each input image, then scale and rotate it as needed. The process of generating this adversarial training dataset is outlined in Algorithm 1.

Algorithm 1 Adversarial training dataset generation.

Input:: pretrained adversarial patch P, image size $(W, H)$ , object detector F, Visdrone dataset $D_{V i s d r o n e}$ , COCO dataset $D_{C O C O}$ .
Output:: COCO_Visdrone adversarial training dataset
1:: Procedure DatasetGeneration(P, F, D)
2:: $c o o r d i n a t e \leftarrow F [i m a g e]$
3:: $i m a g e s i z e \leftarrow D_{V i s d r o n e}$
4:: for $j \leftarrow 1$ to $i m a g e s i z e - 1$ do
5:: if $p r o p e r r a t i o \leq 30 %$ then
6:: $(x_{1}, y_{1}, x_{2}, y_{2}) \leftarrow F [i m a g e]$
7:: add P on $i m a g e$
8:: end if
9:: end for
10:: merge $D_{C O C O}$ and $D_{V i s d r o n e}$
11:: return $A d v e r s a r i a l T r a i n i n g D a t a s e t$
12:: end procedure

3.2.3. Adversarial Training Process

We now conduct the adversarial training optimization to obtain a robust detector.

\begin{matrix} a r g m i n \underset{(x, y) \in D}{m a x} \underset{d (x, x^{*}) \leq ϵ}{m a x} \sum_{x} L_{θ} (x, y) + \sum_{P \in {P^{'} : | | P^{'} | |_{\infty} \leq ϵ}} L_{θ} (h (A (x, l, P)); \hat{y}), \end{matrix}

(6)

where D is a dataset,

x^{*}

represents the patch-attacked image, and x is the clean image.

L_{θ} (x, y)

inherits the full form of the YOLOv5 loss function, which includes object coordinates, object confidence, and no object detection.

A (x, l, P)

is a patch application function and

\hat{y}

is a patch category. During the training, we optimize the loss function (6). The training algorithm iteratively optimizes

θ

for outer minimization and fixes

θ

to find the inner maximized value. The adversarial training algorithm is summarized in Algorithm 2.

This optimization addresses the limitations of patch size and quantity. We split this training into two stages. In the first stage, we use the COCO dataset to train the reference model. During the training, we exclude the patch detection layer inside the Ad_YOLO+. In the second stage, we take the reference weights to train the model using the stochastic gradient descent (SGD) optimizer with a momentum of

0.937

and a weight decay of

4 \times 10^{- 4}

. The learning rate starts at

0.001

and is linearly increased to

0.1

during the first 10 epochs, then gradually decreases after each epoch using a learning rate scheduler. Training halts if there is no improvement in validation accuracy over 100 epochs. The network is trained on a single GPU for 300 epochs, with a batch size of 16, using specially designed adversarial training and validation datasets and a unified learning rate of

10^{- 3}

. To maintain inference accuracy on clean images, we split each batch into

80 %

clean images and

20 %

adversarial patch-attacked images.

Algorithm 2 Adversarial training of Ad_YOLO+.

Input:: Adversarial training dataset D; Total iteration N; Learning rate $η$
Output:: Ad_YOLO+ weights
1:: Procedure Training(D, N, $η$ )
2:: for $i \leftarrow 0$ to N do
3:: $(x_{i}, y_{i}) \sim D$ ▹ Sample a batch
4:: $x_{i}^{*} \leftarrow L (f_{θ} (x_{i}^{*}); y_{i})$ ▹ Forward propagation to calculate the loss
5:: $θ \leftarrow θ - η \sum_{i} \nabla θ L (f_{θ} (x_{i}^{*}); y_{i})$ ▹ Backward to update the parameter
: ▹ This is SGD algorithm
6:: end for
7:: end procedure

4. Evaluation of Effectiveness of Adversarial Patch Attacks

In this section, we evaluate the effectiveness of the pretrained adversarial patch attack models in both static settings and dynamic object-tracking scenarios. The following subsections include the model details and the quantitative evaluation results.

4.1. Adaptive Adversarial Patch Attack Evaluation

4.1.1. Adversarial Patches for Static Images

First, we present the meticulously crafted adversarial patches designed to target YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. To generate and optimize these adversarial patches, we use the VisDrone-2019 dataset, with the respective detector weights mentioned above utilized during the training process.

4.1.2. Simulation Environment for Patch Attack in Dynamic Setting

We evaluate the adaptive patch attack in the object-tracking task within the context of a navigation application, with the evaluation framework shown in Figure 2. The white-box adversarial patch is implemented in the simulation, and a video of our simulation case studies is available on YouTube (Youtube: (https://www.youtube.com/watch?v=mJRxWRMgxMM, accessed on 2 December 2024)). In the simulation, we integrate the state-of-the-art YOLOv5 object detector for real-time object label inference and localization, applied to both malicious and clean image frames. The object-tracking controller is responsible for detecting and tracking specific categories while maintaining a set distance. Our adversarial patch targets the car objects, preventing detection and deactivating the tracking controller. The simulation environment is built on the PyTorch library 2.3.0, with NVIDIA driver version 535.129.03, CUDA 11.8, CUDNN 8.9.6, and Python 3.11.7. The Unreal Engine and AirSim provide the photo-realistic simulation environment, with development and implementation conducted under ROS Melodic.

4.1.3. Attack Efficiency

We evaluate the effectiveness of our proposed adversarial patch against YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x detectors, which are embedded in the vision-based tracking controller to track specific objects in navigation tasks. The adversarial patch is placed on the target object to disrupt the object-tracking task, and we report the average precision (

A P_{0.5}

) for the car category. The attack transferability of the patches across different detectors is summarized in Table 2. The first row lists the YOLO object detector models, while the second row shows the weights used in the patch training. It is evident that, in all white-box attack scenarios, our patch training method significantly reduces the detection

A P

of the detectors, achieving

90.8 %

in static object tracking and

87.63 %

in dynamic object tracking. Additionally, while the adversarial patches exhibit limited transferability across detectors, they are highly effective at dropping the

A P_{0.5}

of the matched weights based on the YOLO detector.

5. Evaluation Results

In this section, we further evaluate the robustness of the Ad_YOLO+ model, which has been adversarially trained using a specially designed dataset, against adversarial patches. Our results show a significant improvement in robustness compared to the baseline Ad_YOLO model. We assess the model’s performance under both non-adaptive and adaptive patch attacks and highlight Ad_YOLO+’s ability to generalize across different adversarial patches.

5.1. Evaluation Settings

In this subsection, we introduce the simulation framework, dataset, and evaluation metrics.

5.1.1. Dataset

We use the VisDrone-2019 [32] dataset to evaluate the Ad_YOLO+ model subject to the static image attack. We evaluate 1000 test images and report mean average precision (mAP) at the intersection over union (IoU)

0.5

.

Dataset Preparation: In the case of static image attacks, the patch location and size are automatically adjusted based on the detected bounding box, as illustrated in the first row of Table 1. For vision-based tracking, the patch location and size are dynamically updated frame-by-frame according to the inferred bounding box. We conducted 30 rounds of evaluations, including both autonomous driving with a dynamic tracking target and autonomous flight with a static tracking target, and reported the mean and standard deviation of the mAP.

5.1.2. Simulation Framework

Figure 3 illustrates our simulation framework for evaluating the Ad_YOLO+ model’s performance against adaptive patch attacks in a vision-based tracking scenario. As depicted, the first object detector is used to localize the target of the attack. Once the target location is identified, we apply the adversarial patch to disrupt the vision-based object-tracking system. The second detector is then replaced with the Ad_YOLO+ model, which enhances the tracking controller’s resilience against patch-attacked image frames.

5.1.3. Evaluation Metrics

We use robustness performance, clean performance, and lost predictions as our evaluation metrics.

Robustness performance [26]: Provable Robustness Accuracy. We use provable robustness accuracy as the evaluation metric for model robustness. This term refers to the percentage of data points in a test set for which the model can be mathematically guaranteed to maintain its predictions, even when subjected to visible adversarial perturbations.

Clean Performance [26]: Average Precision (AP). We use AP as the evaluation metric for clean performance. Clean accuracy refers to the standard test accuracy when evaluated on clean, unperturbed data. Average precision is defined as the precision–recall (PR) curve with a single value. In our evaluation, we report

A P_{0.5}

.

Lost Predictions [18]: This metric models the performance degradation caused by an exaggerated patch-covering area. It represents the percentage of negatively affected correct predictions normalized to the set where the detection of patch-covered objects failed.

5.1.4. Computation Setup

All experiments are conducted on a workstation with one GeForce RTX 2080 Ti GPU. We choose a square patch that takes

30 %

of the detected bounding box to evaluate the adaptive and non-adaptive attacks and report the evaluation metrics.

5.2. Performance Results for Ad_YOLO+

Robustness of Ad_YOLO+ Model for Static Images (Non-Adaptive): Table 1 shows the inference results comparing Ad_YOLO+ against preprocessing-based patch defense methods, where YOLOv5x is used for object detection. When adversarial images are preprocessed with the “detect and remove” (represented by PatchZero [7]), “detect and mitigate” (represented by DIFFender [15]) and “detect and inpaint” (represented by Jedi [18]) strategies, the mask used to cover, mitigate or recover the patch can obscure the object, leading to detection failure due to excessive pixel loss. In contrast, when using Ad_YOLO+ to detect the corrupted regions, the objects are accurately detected as well, as shown in Table 1. In addition, Table 1 illustrates that while the adversarial patches successfully compromised the baseline YOLOv5 detector, they had minimal impact on the performance of the Ad_YOLO+ model. Ad_YOLO+ effectively defended against the patches, accurately identifying both the patch category and its location. As shown in the figure, the addition of adversarial patches led to a significant degradation in the inference accuracy of YOLOv5, while Ad_YOLO+ maintained high-confidence detection performance, demonstrating its robustness against such attacks.

Robustness of Ad_YOLO+ for Vision-Based Tracking (Adaptive Patch Attack): To assess the performance of Ad_YOLO+ under various patch sizes (as a percentage of the target object’s bounding box area), we report the clean accuracy, provable robust accuracy, and lost prediction metrics in Figure 4. Notably, the clean accuracy in the detection phase for clean images is similar to that of YOLOv5s in both static and dynamic target tracking scenarios. We observe that the provable robust accuracy decreases as the patch size covering the target increases, while the lost prediction rate rises with larger patches. The highest provable robust accuracy of

80.18 %

is achieved when the tracking target is static, with only a

1.71 %

drop in accuracy when tracking a dynamic target with a

30 %

patch size. However, the robust accuracy declines significantly when the patch size is increased to

50 %

.

We further evaluate the performance of Ad_YOLO+ as an object detector in both dynamic and static object-tracking tasks, where the goal of the adversary is to conceal the tracking target using a patch attack. The evaluation framework is depicted in Figure 3. Ad_YOLO+ achieves a comparable image processing speed to YOLOv5x, processing 18 fps. In our simulation scenario (the video of our simulation case studies is available on Youtube: (https://www.youtube.com/watch?v=TSge8Ga0zR4, accessed on 2 December 2024), Ad_YOLO+ is able to detect the adversarial patch while recovering the correct prediction for patch-attacked image frames in both static and dynamic tracking conditions. Consequently, the tracking system successfully tracks the target even when the image frames are subjected to an adversarial patch attack, as demonstrated in Figure 5.

Robustness of Ad_YOLO+ Across Different Patch Generation Models and Datasets: We report the provable robustness accuracy of Ad_YOLO+ across different detector-based patches and datasets. The first row of Figure 6 illustrates the adversarial patches generated when various detector weights are involved in the training optimization. These patches exhibit significant variation depending on the detector model used. We assessed the generalization of Ad_YOLO+ against a range of these patches, as shown in Figure 6. In the evaluation, adversarial patches are adaptively placed on the target in each image frame with varied location and scale. The patch inference results for Ad_YOLO+ are displayed in Figure 6, and the corresponding evaluation metrics are summarized in Table 3, confirming the generalization of our defense method against diverse patches.

It is important to note that Ad_YOLO+ is trained using a specific adversarial training dataset, with a representative patch model used to generate the patch-attacked image dataset. Our results show that once a patch model is incorporated into the adversarial training optimization, Ad_YOLO+ is capable of detecting other patches, demonstrating its ability to generalize. This suggests that a unified pixel pattern can effectively represent a broader class of patches generated by optimization algorithms during adversarial training. Additionally, Ad_YOLO+ exhibits high provable robustness across different datasets. For both the VisDrone-2019 [32] dataset and image frames sourced from the AirSim simulation environment, Ad_YOLO+ achieves a robust provable accuracy of

85.34 %

for adaptive patches.

Baseline Comparison (Ad_YOLO+ vs. Ad_YOLO): We compared the performance of Ad_YOLO and Ad_YOLO+ in defending against adversarial patches. During the adversarial training process for Ad_YOLO+, patches of varying sizes and locations were incorporated into the training dataset, enabling the model to adapt to these dynamic changes. In contrast, Ad_YOLO was not specifically trained to handle such variations in patch size and location. As a result, Ad_YOLO+ is better suited for applications in dynamic environments, such as vision-based object tracking, where patch size and location can change over time. In addition, our results show that the Ad_YOLO+ model has a higher clean performance accuracy compared to Ad_YOLO, as shown in Table 4.

5.3. Ablation Study

In this subsection, we present an ablation study to evaluate the individual effects of the detection backbone and the number of neck layers in Ad_YOLO+ on image preprocessing speed and mAP. Our findings indicate that a detection backbone with 8 layers, 4 layers in the detection neck, and the final patch detection layer delivers the most stable overall performance.

Number of Training Epochs: Figure 7 demonstrates the training loss of Ad_YOLO+ against adversarial patch attacks with different numbers of training epochs. The defense performance is improved as the training loss decreases. The loss functions start decreasing at the epoch

E = 15

. Therefore, we set the training epoch E as 300 for optimal choice.

Impact of Different Backbone Models: We explored the combination of five different YOLO backbone models with the patch detection layer to identify the most stable model configuration. Specifically, we built five robust detectors based on YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x, each paired with the patch detection layer. We evaluated their defense performance using the VisDrone dataset against pretrained adversarial square patches, which cover

30 %

of the target in each image. The quantitative results of these models, tested across various base models and patch configurations, are summarized in Table 5. Our analysis shows that YOLOv5x with the patch detection layer achieves the best average precision (AP) for the car category. However, the YOLOv5x-based Ad_YOLO+ is computationally intensive, requiring the longest convergence time of

6.23

h per epoch. Additionally, the image processing speed with YOLOv5x is 18 fps.

6. Limitations

We evaluated the limitations of the Ad_YOLO+ model in terms of its generalization to patch attacks generated from different datasets. While our model showed some tolerance to variations in patches, as discussed in Section 5.2 (Figure 6), the model showed some generalization limits when applied to patches generated from datasets other than the one used during training. To understand these limitations, we analyzed its defense performance against patches generated from various datasets. Our results, shown in Figure 8, indicate that Ad_YOLO+ exhibits a reduced provable robustness accuracy on patches generated from datasets such as MS-COCO [31], ImageNet [42], ImageNette [43], and COCO [31] for different patch sizes. Specifically, when patches were placed on various object categories in the dataset, Ad_YOLO+ failed to detect the adversarial patches.

7. Discussion and Concluding Remarks

In this paper, we introduce Ad_YOLO+, an object detection model that is location- and scale-agnostic, designed to be robust against adversarial patches. To the best of our knowledge, this is the first model specifically developed to defend against adaptive adversarial patches in a vision-based object-tracking task. Thanks to its efficient patch localization, Ad_YOLO+ outperforms its baseline, Ad_YOLO, achieving a

2.28 %

improvement in provable robustness accuracy and a

2.61 %

improvement in clean accuracy.

We evaluated Ad_YOLO+ against adaptive patch attacks, where both the location and scale of the patch changed dynamically in a vision-based object-tracking task. The simulation results demonstrate that Ad_YOLO+ exhibits strong robustness against such attacks. Additionally, we tested the model’s provable robustness accuracy for patches generated with different YOLO models. Our findings show that Ad_YOLO+ displays a notable level of generalizability across patch variations, enhancing its resilience in object-tracking tasks.

Author Contributions

Conceptualization, H.G., H.J.Y. and H.J.; methodology, H.G., H.J.Y. and H.J.; software, H.G., H.J.Y. and H.J.; validation, H.G., H.J.Y. and H.J.; formal analysis, H.G., H.J.Y. and H.J.; investigation, H.G., H.J.Y. and H.J.; resources, H.G., H.J.Y. and H.J.; data curation, H.G., H.J.Y. and H.J.; writing—original draft preparation, H.G., H.J.Y. and H.J.; writing—review and editing, H.G., H.J.Y. and H.J.; visualization, H.G., H.J.Y. and H.J.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the National Science Foundation (NSF) under award number 2137753.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. A survey on adversarial attacks and defences. CAAI Trans. Intell. Technol. 2021, 6, 25–45. [Google Scholar] [CrossRef]
Zhai, C.; Wu, W.; Xiao, Y.; Zhang, J.; Zhai, M. Jam traffic pattern of a multi-phase lattice hydrodynamic model integrating a continuous self-stabilizing control protocol to boycott the malicious cyber-attacks. Chaos Solitons Fractals 2025, 197, 116531. [Google Scholar] [CrossRef]
Hayes, J. On visible adversarial perturbations & digital watermarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1597–1604. [Google Scholar]
Chou, E.; Tramer, F.; Pellegrino, G. Sentinet: Detecting localized universal attacks against deep learning systems. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 21–21 May 2020; pp. 48–54. [Google Scholar]
Liu, J.; Levine, A.; Lau, C.P.; Chellappa, R.; Feizi, S. Segment and complete: Defending object detectors against adversarial patch attacks with robust patch detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 14973–14982. [Google Scholar]
Jing, L.; Wang, R.; Ren, W.; Dong, X.; Zou, C. PAD: Patch-agnostic defense against adversarial patch attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 24472–24481. [Google Scholar]
Xu, K.; Xiao, Y.; Zheng, Z.; Cai, K.; Nevatia, R. Patchzero: Defending against adversarial patch attacks by detecting and zeroing the patch. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 4632–4641. [Google Scholar]
Xiang, C.; Valtchanov, A.; Mahloujifar, S.; Mittal, P. Objectseeker: Certifiably robust object detection against patch hiding attacks via patch-agnostic masking. In Proceedings of the 2023 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 21–25 May 2023; pp. 1329–1347. [Google Scholar]
McCoyd, M.; Park, W.; Chen, S.; Shah, N.; Roggenkemper, R.; Hwang, M.; Liu, J.X.; Wagner, D. Minority reports defense: Defending against adversarial patches. In Proceedings of the International Conference on Applied Cryptography and Network Security, Rome, Italy, 19–22 October 2020; pp. 564–582. [Google Scholar]
Naseer, M.; Khan, S.; Porikli, F. Local gradients smoothing: Defense against localized adversarial attacks. In Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 7–11 January 2019; pp. 1300–1307. [Google Scholar]
Chen, Z.; Dash, P.; Pattabiraman, K. Jujutsu: A two-stage defense against adversarial patch attacks on deep neural networks. In Proceedings of the ACM Asia Conference on Computer and Communications Security, Melbourne, VIC, Australia, 10–14 July 2023; pp. 689–703. [Google Scholar]
Chattopadhyay, N.; Guesmi, A.; Shafique, M. Anomaly unveiled: Securing image classification against adversarial patch attacks. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Abu Dhabi, United Arab Emirates, 27–30 October 2024; pp. 929–935. [Google Scholar]
Bunzel, N.; Frick, R.A.; Klause, G.; Schwarte, A.; Honermann, J. Signals are all you need: Detecting and mitigating digital and real-world adversarial patches using signal-based features. In Proceedings of the 2nd ACM Workshop on Secure and Trustworthy Deep Learning Systems, Singapore, 2–20 July 2024; pp. 24–34. [Google Scholar]
Mao, Z.; Chen, S.; Miao, Z.; Li, H.; Xia, B.; Cai, J.; Yuan, W.; You, X. Enhancing robustness of person detection: A universal defense filter against adversarial patch attacks. Comput. Secur. 2024, 146, 104066. [Google Scholar] [CrossRef]
Kang, C.; Dong, Y.; Wang, Z.; Ruan, S.; Chen, Y.; Su, H.; Wei, X. DIFFender: Diffusion-Based Adversarial Defense against Patch Attacks. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; pp. 130–147. [Google Scholar]
Xiang, C.; Mahloujifar, S.; Mittal, P. PatchCleanser: Certifiably robust defense against adversarial patches for any image classifier. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 2065–2082. [Google Scholar]
Xiang, C.; Wu, T.; Dai, S.; Petit, J.; Jana, S.; Mittal, P. PatchCURE: Improving certifiable robustness, model utility, and computation efficiency of adversarial patch defenses. In Proceedings of the 33rd USENIX Security Symposium (USENIX Security 24), Philadelphia, PA, USA, 14–16 August 2024; pp. 3675–3692. [Google Scholar]
Tarchoun, B.; Ben Khalifa, A.; Mahjoub, M.A.; Abu-Ghazaleh, N.; Alouani, I. Jedi: Entropy-based localization and removal of adversarial patches. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 4087–4095. [Google Scholar]
Li, Y.; Duan, M.; Xiao, B. Adv-Inpainting: Generating natural and transferable adversarial patch via attention-guided feature fusion. arXiv 2023, arXiv:2308.05320. [Google Scholar]
Zhang, Y.; Zhao, S.; Wei, X.; Wei, S. Defending adversarial patches via joint region localizing and inpainting. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Urumqi, China, 18–20 October 2024; pp. 236–250. [Google Scholar]
Huang, Y.; Li, Y. Zero-shot certified defense against adversarial patches with vision transformers. arXiv 2021, arXiv:2111.10481. [Google Scholar]
Metzen, J.H.; Yatsura, M. Efficient certified defenses against patch attacks on image classifiers. arXiv 2021, arXiv:2102.04154. [Google Scholar] [CrossRef]
Brendel, W.; Bethge, M. Approximating CNNs with bag-of-local-features models works surprisingly well on imagenet. arXiv 2019, arXiv:1904.00760. [Google Scholar]
Zhang, Z.; Yuan, B.; McCoyd, M.; Wagner, D. Clipped bagnet: Defending against sticker attacks with clipped bag-of-features. In Proceedings of the IEEE Security and Privacy Workshops (SPW), San Francisco, CA, USA, 21–21 May 2020; pp. 55–61. [Google Scholar]
Xiang, C.; Mittal, P. Patchguard++: Efficient provable attack detection against adversarial patches. arXiv 2021, arXiv:2104.12609. [Google Scholar]
Xiang, C.; Mittal, P. Detectorguard: Provably securing object detectors against localized patch hiding attacks. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; pp. 3177–3196. [Google Scholar]
Levine, A.; Feizi, S. (De) Randomized smoothing for certifiable defense against patch attacks. Adv. Neural Inf. Process. Syst. 2020, 33, 6465–6475. [Google Scholar]
Lecuyer, M.; Atlidakis, V.; Geambasu, R.; Hsu, D.; Jana, S. Certified robustness to adversarial examples with differential privacy. In Proceedings of the IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; pp. 656–672. [Google Scholar]
Xiang, C.; Bhagoji, A.N.; Sehwag, V.; Mittal, P. PatchGuard: A provably robust defense against adversarial patches via small receptive fields and masking. In Proceedings of the 30th USENIX Security Symposium (USENIX Security 21), Online, 11–13 August 2021; pp. 2237–2254. [Google Scholar]
Ji, N.; Feng, Y.; Xie, H.; Xiang, X.; Liu, N. Adversarial yolo: Defense human detection patch attacks via detecting adversarial patches. arXiv 2021, arXiv:2103.08860. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, A.; Wang, J.; Liu, X.; Cao, B.; Zhang, C.; Yu, H. Bias-based universal adversarial patch attack for automatic check-out. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIII 16. pp. 395–410. [Google Scholar]
Akhtar, N.; Mian, A.; Kardan, N.; Shah, M. Advances in adversarial attacks and defenses in computer vision: A survey. IEEE Access 2021, 9, 155161–155196. [Google Scholar] [CrossRef]
Thys, S.; Van Ranst, W.; Goedemé, T. Fooling automated surveillance cameras: Adversarial patches to attack person detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–17 June 2019. [Google Scholar]
Brown, T.B.; Mané, D.; Roy, A.; Abadi, M.; Gilmer, J. Adversarial patch. arXiv 2017, arXiv:1712.09665. [Google Scholar]
Chakraborty, A.; Alam, M.; Dey, V.; Chattopadhyay, A.; Mukhopadhyay, D. Adversarial attacks and defences: A survey. arXiv 2018, arXiv:1810.00069. [Google Scholar] [CrossRef]
Athalye, A.; Carlini, N.; Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the International Conference on Machine Learning, Stockholm, Sweden, 10–15 July 2018; pp. 274–283. [Google Scholar]
Lei, X.; Cai, X.; Lu, C.; Jiang, Z.; Gong, Z.; Lu, L. Using frequency attention to make adversarial patch powerful against person detector. IEEE Access 2022, 11, 27217–27225. [Google Scholar] [CrossRef]
Shrestha, S.; Pathak, S.; Viegas, E.K. Towards a robust adversarial patch attack against unmanned aerial vehicles object detection. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Detroit, MI, USA, 1–5 October 2023; pp. 3256–3263. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Howard, J. A Smaller Subset of 10 Easily Classified Classes from Imagenet, and a Little More French. 2020. Available online: https://github.com/fastai/imagenette (accessed on 1 February 2024).

Figure 1. Illustration of the training of the proposed Ad_YOLO+ method for robust object detection. First, a series of transformation operations are conducted to make the adversarial patch accommodate physical dynamic conditions. Second, adversarial patches of the proper size and location are pasted on the targets to generate the adversarial example. Third, the adversarial examples are mixed with the clean dataset to generate the adversarial training dataset. Next, this adversarial training dataset will be fed into training Ad_YOLO+ to minimize the loss function, including adversarial objectiveness loss (

L_{o b j}

), localization score (

L_{l o c a l i z a t i o n}

), and classification loss (

L_{c l a s s i f i c a t i o n}

). Finally, repeat the above procedures until the end of the training process.

Figure 1. Illustration of the training of the proposed Ad_YOLO+ method for robust object detection. First, a series of transformation operations are conducted to make the adversarial patch accommodate physical dynamic conditions. Second, adversarial patches of the proper size and location are pasted on the targets to generate the adversarial example. Third, the adversarial examples are mixed with the clean dataset to generate the adversarial training dataset. Next, this adversarial training dataset will be fed into training Ad_YOLO+ to minimize the loss function, including adversarial objectiveness loss (

L_{o b j}

), localization score (

L_{l o c a l i z a t i o n}

), and classification loss (

L_{c l a s s i f i c a t i o n}

). Finally, repeat the above procedures until the end of the training process.

Figure 2. Vision-based object tracking subject to adaptive patch attacks. We first use an object detector to localize the target. Then, we overlay an adversarial patch onto the target to mislead the second victim object detector within the vision-based tracking system. Both the first and second detectors are based on YOLOv5 [33].

Figure 3. The evaluation framework based on AirSim simulation for Ad_YOLO+ in an object-tracking scenario.

Figure 4. Ad_YOLO+ performance for different patch sizes. In each row, provable robust accuracy, clean accuracy, and lost prediction are available in sub-figures (a–c). The first row denotes the static object tracking case. The second row represents the dynamic object tracking scenario.

Figure 5. Vision-based static and dynamic object tracking with YOLOv5 and Ad_YOLO+ as detectors of the tracking system. Sub-figure (a) denotes the static object tracking case. Sub-figure (b) represents the dynamic object tracking scenario. In each sub-figure, we compare the tracking behavior with YOLOv5 and Ad_YOLO+ as detectors in the tracking systems.

Figure 6. (Top Row) Visualized adversarial patches generated with varied detector weights involved in the training optimization. (Bottom Row) The inference result of Ad_YOLO+.

Figure 7. The diagrams show how bounding box loss in sub-figure (a), object loss in sub-figure (b), and class loss in sub-figure (c) change as the training epoch increases.

Figure 8. (Top Row) Visualized adversarial patches generated from various datasets. (Bottom Row) The inference result of Ad_YOLO+.

Table 1. (Top Row) Conventional patch defense approaches, such as PatchZero [7], DIFFender [15], Jedi [18], preprocess the attacked image; that is, they mask or inpaint the patches in the image as shown in the top row. On the contrary, our method does not require an image preprocessing stage. (Bottom Row) As shown in the bottom row images, Ad_YOLO+ (our approach) achieves detection performance with high confidence, while PatchZero [7], DIFFender [15], and Jedi [18] (comparison baselines) yield degraded inference accuracy when used with standard YOLOv5x [33] for object detection.

Attacked Image	PatchZero [7]	DIFFender [15]	Jedi [18]

Ad_YOLO+	PatchZero [7] + YOLOv5x [33]	DIFFender [15] + YOLOv5x [33]	Jedi [18] + YOLOv5x [33]

Table 2. Attack efficiency for different YOLO model weights. We use red color to highlight the most effective attack in each colomn.

	Yolov5n	Yolov5s	Yolov5m	Yolov5l	Yolov5x
Weights	Yolov5n	Yolov5s	Yolov5m	Yolov5l	Yolov5x
Yolov5n	90.8%	$0.15 %$	$0.13 %$	$0.34 %$	$0.21 %$
Yolov5s	$0.14 %$	79.8%	$0.12 %$	$0.16 %$	$0.03 %$
Yolov5m	$0.11 %$	$0.23 %$	59.13%	$0.17 %$	$0.16 %$
Yolov5l	$0.9 %$	$0.07 %$	$0.17 %$	78.91%	$0.09 %$
Yolov5x	$0.04 %$	$0.12 %$	$0.03 %$	$0.12 %$	85.4%
Yolov5n	83.90%	$0.018 %$	$0.02 %$	$0.05 %$	$0.17 %$
Yolov5s	$0.16 %$	83.60%	$0.03 %$	$0.14 %$	$0.012 %$
Yolov5m	$0.08 %$	$0.014 %$	87.63%	$0.06 %$	$0.31 %$
Yolov5l	$0.014 %$	$0.04 %$	$0.019 %$	84.57%	$0.13 %$
Yolov5x	$0.013 %$	$0.12 %$	$0.04 %$	$0.18 %$	85.89%

Table 3. Performance of Ad_YOLO+ model for patches shown in Figure 6.

	Yolov5n	Yolov5s	Yolov5m	Yolov5l	Yolov5x
Evaluation	Yolov5n	Yolov5s	Yolov5m	Yolov5l	Yolov5x
Clean Accuracy	$74.10 %$	$72.17 %$	$69.41 %$	$71.34 %$	$75.21 %$
Robust Accuracy	$64.34 %$	$66.40 %$	$65.12 %$	$67.16 %$	$68.03 %$
Lost Prediction	$0.11 %$	$0.23 %$	$0.13 %$	$0.17 %$	$0.16 %$

Table 4. Results of YOLOv5x, Ad_YOLO and Ad_YOLO+ on COCO test set. Mean average precision and per-class average precision are shown. We use red color to have Ad_YOLO+ stand out from the comparison.

Class	Giraffe	Zebra	Bird	Boat	Bottle	Bus	Dog	Horse	Bike	Person	Plant
YOLOv5x	$83.18$	$80.53$	$73.61$	$66.86$	$51.12$	$76.19$	$79.31$	$78.19$	$78.52$	$78.70$	$78.52$
Ad_YOLO	$82.31$	$79.36$	$71.05$	$68.82$	$50.19$	$76.40$	$78.18$	$76.14$	$77.93$	$77.67$	$48.90$
Ad_YOLO+	83.67	92.22	59.22	55.72	62.60	84.74	79.59	89.42	64.94	81.29	56.77
Class	Bus	Car	Cat	Chair	Cow	Table	No Sofa	Train	tv	Sheep	mAP
YOLOv5x	$76.19$	$83.33$	$81.86$	$56.64$	$68.88$	$73.97$	$70.59$	$74.06$	$82.31$	$51.20$	$72.02$
Ad_YOLO	$76.40$	$81.44$	$83.50$	$53.42$	$68.13$	$72.82$	$73.94$	$85.98$	$71.98$	$68.76$	$72.35$
Ad_YOLO+	84.74	71.44	90.02	57.43	85.51	45.85	73.94	90.72	82.85	80.50	74.63

Table 5. Results on impact of individual parameters of model configuration. We use red color to have Ad_YOLO+ stand out from the comparison.

Backbone Model	Model Convergence Efficiency	Image Processing Speed	mAP	Clean Accuracy
Ad_YOLOv5n	$0.15$ h/Epoch	60 fps	$0.02 %$	$47.1 %$
Ad_YOLOv5s	$0.24$ h/Epoch	48 fps	$0.18 %$	$56.8 %$
Ad_YOLOv5m	$0.58$ h/Epoch	35 fps	$59.13 %$	$76.4 %$
Ad_YOLOv5l	$5.14$ h/Epoch	28 fps	$67.1 %$	$78.91 %$
Ad_YOLOv5x	6.23 h/Epoch	18 fps	74.63%	85.4%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gu, H.; Yoon, H.J.; Jafarnejadsani, H. Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation. Automation 2025, 6, 44. https://doi.org/10.3390/automation6030044

AMA Style

Gu H, Yoon HJ, Jafarnejadsani H. Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation. Automation. 2025; 6(3):44. https://doi.org/10.3390/automation6030044

Chicago/Turabian Style

Gu, Haotian, Hyung Jin Yoon, and Hamidreza Jafarnejadsani. 2025. "Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation" Automation 6, no. 3: 44. https://doi.org/10.3390/automation6030044

APA Style

Gu, H., Yoon, H. J., & Jafarnejadsani, H. (2025). Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation. Automation, 6(3), 44. https://doi.org/10.3390/automation6030044

Article Menu

Robust Object Detection Under Adversarial Patch Attacks in Vision-Based Navigation

Abstract

1. Introduction

2. Relevant Work

2.1. Detection-Based Defense Methods for Adversarial Patches

2.2. Adversarial Training Method

2.3. Gradient Masking

3. The Problem Statement and Proposed Method

3.1. Adversarial Adaptive Patch Generation

3.2. Robust Object Detection Model: Ad_YOLO+

3.2.1. Ad_YOLO+ Framework

3.2.2. VisDrone-2019 Dataset for Adversarial Training

3.2.3. Adversarial Training Process

4. Evaluation of Effectiveness of Adversarial Patch Attacks

4.1. Adaptive Adversarial Patch Attack Evaluation

4.1.1. Adversarial Patches for Static Images

4.1.2. Simulation Environment for Patch Attack in Dynamic Setting

4.1.3. Attack Efficiency

5. Evaluation Results

5.1. Evaluation Settings

5.1.1. Dataset

5.1.2. Simulation Framework

5.1.3. Evaluation Metrics

5.1.4. Computation Setup

5.2. Performance Results for Ad_YOLO+

5.3. Ablation Study

6. Limitations

7. Discussion and Concluding Remarks

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI