Next Article in Journal
High-Fidelity Modeling and Stability Analysis of Microgrids by Considering Time Delay
Previous Article in Journal
PixCon: Pixel-Level Contrastive Learning Revisited
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Delving into YOLO Object Detection Models: Insights into Adversarial Robustness

by
Kyriakos D. Apostolidis
and
George A. Papakostas
*
MLV Research Group, Department of Informatics, Democritus University of Thrace, 65404 Kavala, Greece
*
Author to whom correspondence should be addressed.
Electronics 2025, 14(8), 1624; https://doi.org/10.3390/electronics14081624
Submission received: 12 March 2025 / Revised: 10 April 2025 / Accepted: 16 April 2025 / Published: 17 April 2025
(This article belongs to the Special Issue Feature Papers in Artificial Intelligence)

Abstract

:
This paper provides a comprehensive study of the security of YOLO (You Only Look Once) model series for object detection, emphasizing their evolution, technical innovations, and performance across the COCO dataset. The robustness of YOLO models under adversarial attacks and image corruption, offering insights into their resilience and adaptability, is analyzed in depth. As real-time object detection plays an increasingly vital role in applications such as autonomous driving, security, and surveillance, this review aims to clarify the strengths and limitations of each YOLO iteration, serving as a valuable resource for researchers and practitioners aiming to optimize model selection and deployment in dynamic, real-world environments. The results reveal that YOLOX models, particularly their large variants, exhibit superior robustness compared to other YOLO versions, maintaining higher accuracy under challenging conditions. Our findings serve as a valuable resource for researchers and practitioners aiming to optimize YOLO models for dynamic and adversarial real-world environments while guiding future research toward developing more resilient object detection systems.

1. Introduction

Object detection is a pivotal task in computer vision, integral to a wide range of applications including autonomous vehicles [1], robotics [2], surveillance [3], and various fields like medicine [4] and agriculture [5]. Specifically, YOLO models have been successfully applied in medicine for tasks such as the localization and segmentation of lumbar spine vertebrae deformities [6] and detection of breast lesions [7]. In agriculture, YOLO has been employed for roadside monitoring and the detection of invasive alien plant species [8], as well as the real-time detection of invasive weed seedlings like Solanum rostratum Dunal [9]. Among the numerous algorithms developed for real-time object detection, YOLO (You Only Look Once) stands out due to its exceptional balance between accuracy and speed. Introduced by Redmon et al. [10] in 2015, YOLOv1 marked a significant advancement in the field. Since then, the YOLO algorithm has undergone continuous evolution, with new versions being released almost annually, culminating in the latest iteration, YOLOv11 (Figure 1).
Apart from YOLO, various object detectors have been proposed in recent years. R-CNN (region-based convolutional neural network) [11] is a pioneering model that uses selective search to propose regions and then classifies each region using a convolutional neural network. Fast R-CNN [12] is an improved version of R-CNN that integrates region proposal and classification into a single network, significantly speeding up the detection process. Furthermore, Faster R-CNN [13] advances Fast R-CNN by introducing a region proposal network (RPN) that shares features with the detection network, enabling end-to-end training and real-time performance. Another significant object detector is SSD (Single Shot MultiBox Detector) [14], which predicts bounding boxes and class scores for multiple objects in a single forward pass, facilitating real-time detection. Additionally, EfficientDet [15] is a family of object detection models that use EfficientNet [16] as the backbone and employ a compound scaling method to achieve a balance between accuracy and efficiency. Lastly, DETR (Detection Transformer) [17] is a novel model that leverages transformers for end-to-end object detection, eliminating the need for traditional components like region proposal networks and non-maximum suppression, and achieving competitive accuracy with a simpler architecture.
Modern deep neural networks (DNNs) are integral to the success of the aforementioned algorithms. However, numerous studies have shown that DNNs are susceptible to adversarial attacks involving imperceptible perturbations and noisy data [18,19,20]. This vulnerability poses a significant challenge to the deployment of object detectors in real-world conditions, potentially compromising their reliability and effectiveness. Adversarial attacks, which involve adding subtle perturbations to input images, can significantly degrade the performance of deep learning models, leading to incorrect predictions. These attacks exploit the vulnerabilities inherent in deep networks, raising critical concerns about the security of YOLO models in safety-sensitive applications. Alongside adversarial attacks, real-world image corruptions—such as noise, blur, and weather-induced distortions—further impact model performance. As such, robustness to these disruptions is crucial for deploying YOLO models in environments where pristine images are not guaranteed.
Several reviews on YOLO models exist, including [21,22,23], yet none have thoroughly examined their effectiveness under noisy data and adversarial examples. This work aims to address this gap by evaluating the robustness of YOLO model series under Fast Gradient Sign Method (FGSM) [24] and Projected Gradient Descent (PGD) [25] attacks, along with other image corruption techniques, to test their resilience. By analyzing the behavior of these models across different scenarios, we seek to identify the strengths and weaknesses of each YOLO variant. Our findings provide insights into the vulnerabilities of YOLO models, guiding future research towards more resilient object detection systems and informing practitioners on model suitability under challenging real-world conditions.
The remainder of this paper is organized as follows: Section 2 presents an overview of existing works that address robustness in YOLO algorithms, highlighting key approaches and gaps in the current literature. Section 3 provides a detailed overview of the dataset, experimental setup, and methodologies applied to evaluate the robustness of YOLO models against adversarial attacks and image corruptions. In Section 4, we discuss the evolution of the YOLO architectures, highlighting key advancements and innovations across each version. Section 5 presents our experimental results, analyzing the robustness of YOLO models under adversarial perturbations and image corruptions. In Section 6, we provide a comparative discussion of the findings and insights into the models’ resilience. Finally, Section 7 concludes the paper by summarizing our contributions and suggesting future research directions.

2. Robustness in YOLO Algorithms

Recent research has increasingly focused on assessing and improving the robustness of YOLO-based object detectors, particularly under adversarial attacks and visual corruptions. Several studies have revealed the susceptibility of YOLO models to a wide range of digital adversarial attacks, such as FGSM, PGD, and C&W, from Jain [26], demonstrating that YOLOv5 can be easily misled in traffic sign detection. To better exploit YOLO’s structure, Dai et al. [27] proposed an Adaptive Deformation Method (ADM) targeting bounding box predictions, achieving higher attack success rates on YOLOv4 and YOLOv5. In parallel, Bayer et al. [28] introduced “eigenpatches”—PCA-derived adversarial patches—that efficiently fool YOLOv7 with reduced computational cost. Addressing the physical domain, Schack et al. [29] and Lian et al. [30] emphasized the real-world limitations of patch attacks and provided comprehensive benchmarks (e.g., PADetBench) to evaluate robustness under realistic environmental transformations. Beyond adversarial attacks, several works evaluated YOLO under image corruptions. Liu et al. [31] benchmarked detectors on COCO-C and BDD100K-C, showing that YOLO models generally handle common corruptions better than two-stage counterparts. In aerial vision, Arsenos et al. [32] found YOLOv5 and YOLOX to be particularly robust under simulated fog, rain, and blur, while Ding and Luo [33] introduced SDNIA-YOLO, enhancing detection under extreme weather using image-adaptive enhancements and stylized data augmentation. Collectively, these studies highlight both the vulnerabilities and promising resilience of YOLO models in adverse visual conditions, motivating further research into robustness-aware training and evaluation.

3. Materials and Methods

Ensuring the robustness of object detection models is crucial for real-world applications, particularly in safety-critical domains such as autonomous driving and surveillance. This study evaluates the resilience of various YOLO models against adversarial attacks and image corruptions, aiming to identify the most robust architectures. To achieve this, we employed a standardized experimental setup, using the COCO dataset and a diverse selection of YOLO versions ranging from YOLOv3 to YOLOv11, including various model sizes (n, s, m, l, x). These models were selected to cover the full spectrum of advancements in the YOLO framework, from early iterations to the most recent state-of-the-art versions. By leveraging pre-trained weights from official repositories, we ensured that all models were tested in their standard configurations, allowing for a fair comparison. Testing multiple model sizes also provided insight into how architectural complexity influences robustness. Figure 2 shows the pipeline we followed for the experiments and the robustness evaluation. The following sections describe the dataset, attack methodologies, and evaluation metrics used to systematically assess each model’s resilience under different perturbation scenarios.

3.1. Evaluation Metrics and Dataset

In object detection tasks, the accurate evaluation of a model’s performance is critical for understanding its ability to identify and localize objects within an image. Two key metrics often employed for this purpose are Intersection over Union (IoU) and Average Precision (AP). IoU measures the overlap between the predicted and actual locations of objects, providing a quantitative assessment of localization accuracy. It serves as a fundamental criterion for determining how well a detection model aligns its predictions with the ground truth. On the other hand, AP offers a comprehensive measure of a model’s detection capabilities by considering the trade-off between precision and recall across varying confidence thresholds. Together, IoU and AP help in evaluating both the accuracy of object localization and the quality of object recognition, making them integral to benchmarking object detection models.
In YOLOv1 and YOLOv2, the PASCAL VOC 2007 [34] and VOC 2012 datasets [35] were employed for training and benchmarking purposes. Starting with YOLOv3, however, the Microsoft COCO (Common Objects in Context) [36] dataset was utilized.
To ensure a fair evaluation of the models, it is essential to assess them using the same dataset. Given that YOLOv1 and YOLOv2 were trained and evaluated on different datasets, they are excluded from this evaluation. We conducted the attacks on the COCO dataset, which is utilized by YOLOv3 and the subsequent versions.

AP and IoU Explanation

Average Precision (AP) is a widely used metric for evaluating the performance of classification models, particularly in object detection and information retrieval tasks. It is calculated as the area under the Precision–Recall (PR) curve, which represents the trade-off between precision and recall across different classification thresholds.
Precision is defined as the ratio of true positive predictions to the total number of positive predictions made by the model, capturing the accuracy of positive predictions. It is given by the following:
P r e c i s i o n = T r u e   P o s i t i v e s   ( T P ) T r u e   P o s i t i v e s   T P + F a l s e   P o s i t i v e s   ( F P )
Recall measures the proportion of actual positive instances that are correctly identified by the model. It is calculated as follows:
R e c a l l = T r u e   P o s i t i v e s   ( T P ) T r u e   P o s i t i v e s   T P + F a l s e   N e g a t i v e s   ( F N )
The Precision–Recall curve is generated by plotting precision against recall while varying the decision threshold of the model. As the threshold changes, the model’s predictions become more or less conservative, resulting in different precision and recall values. AP is computed as the integral of the Precision–Recall curve, which effectively measures the area under this curve (AUC). This can be expressed as follows:
A P = n R n R n 1 P n
where Pn is the precision at the n-th threshold, Rn is the recall at the n-th threshold. Mean Average Precision (mAP) extends this concept by averaging the AP values across multiple classes. It is particularly useful when evaluating multi-class classification tasks, where the performance of each class is important. Given C classes, mAP is calculated as follows:
m A P = 1 C i = 1 C A P i
where APi represents the Average Precision for the i-th class.
This metric provides a balanced assessment of the model’s performance across all classes by considering both precision and recall. It is especially helpful in applications where false positives (which affect precision) and false negatives (which affect recall) have different impacts on the overall system. By summarizing the area under the Precision–Recall curve, AP and mAP capture the model’s ability to maintain high precision while achieving high recall.
Intersection over Union (IoU) is a widely used metric for evaluating the localization accuracy of object detection models. It is especially effective for assessing how closely a predicted bounding box aligns with the ground truth bounding box for a given object. IoU provides a quantitative measure of the overlap between two regions, offering insights into the localization errors in object detection.
To compute the IoU between a predicted bounding box Bp and a ground truth bounding box Bg, we follow these steps:
Intersection: The intersection area is defined as the region where the predicted bounding box overlaps with the ground truth bounding box. This area is computed by identifying the coordinates that form the overlapping region between Bp and Bg.
Union: The union area is the total area covered by both the predicted and ground truth bounding boxes, without double-counting the overlapping region. It represents the sum of the areas of Bp and Bg minus the intersection area.
IoU Calculation: IoU is calculated as the ratio of the intersection area to the union area.
I o U = A r e a   o f   I n t e r s e c t i o n A r e a   o f   U n i o n = | B p   B g | | B p   B g |
where ∣BpBg∣ denotes the area of the intersection between the predicted and ground truth bounding boxes, and ∣BpBg∣ represents the area of the union.
The resulting IoU score ranges from 0 to 1, where:
  • An IoU of 1 indicates perfect alignment between the predicted and ground truth bounding boxes.
  • An IoU of 0 indicates no overlap between the two bounding boxes.
  • Higher IoU values indicate better localization accuracy, while lower values signify larger discrepancies between the predicted and actual bounding boxes.
IoU is commonly used as a threshold-based metric in object detection models. For instance, predictions are often considered accurate if they achieve an IoU above a specified threshold (e.g., 0.5). By providing a reliable measure of how closely a model’s predictions match the actual object locations, IoU serves as a crucial criterion for model evaluation and comparison in tasks involving object detection and localization.
While these metrics provide a fundamental understanding of object detection performance, real-world conditions often introduce challenges such as adversarial perturbations and environmental corruption. In the following section, we examine the impact of these factors on YOLO models.

3.2. Image Attacks

3.2.1. Adversarial Attacks

In the realm of adversarial machine learning, the robustness of neural networks is frequently tested using adversarial attacks, which are carefully crafted perturbations added to the input data to deceive the model into making incorrect predictions. Among the various techniques developed to generate adversarial examples, FGSM and PGD are two of the most prominent and widely studied methods. These attacks not only reveal vulnerabilities in machine learning models but also help in understanding and improving model robustness. In the following sections, we delve into the mechanics of FGSM and PGD attacks, examining their methodologies and impact on model performance.
FGSM is a simple yet effective technique for generating adversarial examples, as introduced by Goodfellow et al. in 2014 [24]. The core idea behind FGSM is to perturb the input data in a direction that maximizes the loss of the model. This is achieved by adding a small, carefully calculated noise to the input data, which is computed based on the gradient of the loss with respect to the input. Given an input x, a true label y, a model θ, and a loss function J(θ,x,y), the adversarial example x′ is generated as follows:
x = x + ϵ s i g n ( x J ( θ , x , y ) )  
Here, ϵ is a small scalar value representing the perturbation magnitude, and sign(∇xJ(θ,x,y)) is the sign of the gradient of the loss function with respect to the input. Figure 3 shows the impact of ϵ on images.
PGD is an iterative method for generating adversarial examples that extends the idea of FGSM. Introduced by Madry et al. in 2017 [25], PGD performs multiple steps of gradient ascent to find an adversarial example, with each step projecting the perturbed input back into the valid data space, typically constrained by an l p -norm ball around the original input. Starting from an initial adversarial example x 0 = x , PGD iteratively updates the adversarial example using the following rule:
x t + 1 = P r o j B e x ( x t + a s i g n ( x t J ( θ , x t , y ) ) )
Here, x t is the adversarial example at iteration t, and α is the step size. P r o j B e x denotes the projection operation that ensures x remains within an l p -norm ball of radius ϵ centered around x. Figure 4 shows the clean and the attacked images with different ϵ values.

3.2.2. Image Corruptions

To thoroughly evaluate the robustness of YOLO algorithms, it is essential to test their performance under various types of image corruptions (Figure 5) that mimic real-world challenges. In this study, we introduce a diverse set of image corruptions to assess the resilience of YOLO models. These corruptions include fog, which simulates reduced visibility conditions; pixelate, which degrades image resolution; Gaussian noise, which adds random variations in pixel intensity; impulse noise, which introduces sporadic intensity spikes; defocus blur, which mimics the effect of an out-of-focus camera; and motion blur, which replicates the effect of camera or object movement. By incorporating these common yet challenging corruptions, we aim to comprehensively test and understand the robustness of YOLO models in various adverse scenarios. All the above techniques were implemented according to a package provided by Hendrycks and Dietterich [19] in order to benchmark the robustness of neural networks.

3.3. Attack Settings and Preprocessing

We evaluated the robustness of YOLO models using 5000 test images from the COCO dataset. Each model was tested using its official implementation, ensuring that all results were based on clean, pre-trained models. Since YOLOv1 and YOLOv2 are trained and evaluated in smaller datasets, they are excluded from our experiments. Furthermore, we utilized all available sizes for each YOLO model to comprehensively evaluate their behavior in relation to model capacity. However, YOLOv3, YOLOv4, and YOLOR-CSP were tested as individual models without size variations. Additionally, we included both the official YOLOv3 and the Ultralytics implementation, referred to as YOLOv3u, to compare potential differences in robustness arising from distinct implementations.
This diverse selection allowed for a nuanced analysis of robustness across varying model capacities and implementations. From the perspective of adversarial robustness, we employed FGSM and PGD, two widely recognized and well-established attack methods that serve as strong baselines in the field. These attacks were selected due to their computational efficiency, ease of implementation, and broad acceptance in the literature as standard tools for benchmarking robustness. FGSM provides a fast, single-step perturbation useful for stress-testing models under minimal constraints, while PGD offers a more iterative and powerful variant that simulates a stronger adversary. Together, they allow for a clear and interpretable robustness evaluation, especially when applied across a large suite of pre-trained object detection models like the YOLO series. Importantly, these attacks were implemented under a black-box threat model, which closely mirrors real-world deployment conditions. In practical scenarios—particularly when models are integrated into proprietary systems or commercial APIs—attackers typically do not have access to the model’s internal parameters, architecture, or gradients. By evaluating robustness in this setting, our study aims to reflect realistic adversarial conditions and provide insights into how YOLO models perform when exposed to plausible external threats, rather than idealized white-box situations. Therefore, black-box attacks are more realistic, making the evaluation closer to actual deployment conditions. We used the Adversarial Robustness Toolbox (ART) [37] to generate adversarial images. The FGSM and PGD attacks were applied with five different perturbation levels, denoted by ϵ values of 0.01, 0.03, 0.05, 0.07, and 0.1. For the PGD attack specifically, we utilized the L∞ -norm constraint, a maximum of 10 iterations, and a step size α = 0.1. Table 1 shows the details of the adversarial attacks. Here, lower values of ϵ correspond to minimal noise, introducing subtle perturbations, while higher values of ϵ represent more significant noise levels, introducing larger perturbations. These varied ϵ levels allow for a controlled assessment of the model’s robustness under increasing intensities of adversarial noise. Furthermore, we resized all images in 640×640 pixels to apply adversarial attacks. The resize led to a degradation in mAP, so we consider this mAP as a clean baseline to be fair. However, since resizing itself introduces an additional transformation, we can consider it an implicit attack.
For image corruption, we utilized a library developed by Hendrycks and Dietterich [19] and applied various types of noise, including Gaussian noise, impulse noise, defocus blur, motion blur, pixelation, and fog noise. These corruption types can be applied directly to images without necessitating resizing. We consider that the combination of adversarial attacks with image corruption provides a comprehensive evaluation of the robustness of YOLO models.

4. YOLO Model Series

YOLO (You Only Look Once) [10] is a real-time object detection method that frames object detection as a single regression problem rather than a series of region-based proposals. Unlike other approaches that generate multiple region proposals and classify each, YOLO directly predicts bounding boxes and class probabilities from entire images in one evaluation. This makes it highly efficient for applications requiring fast processing speeds.
YOLO divides an input image into an S×S grid, where each grid cell is responsible for predicting a fixed number of bounding boxes and their confidence scores, along with class probabilities for each box. Each prediction includes the bounding box coordinates (center x, center y, width, height), the confidence score that indicates the likelihood of the box containing an object, and the class probabilities for the detected object. The confidence score is a product of the IoU (Intersection over Union) between the predicted box and the ground truth, and the probability of the object’s presence in the box.
In the YOLO framework, multiple bounding boxes are predicted for each grid cell. During training, only one bounding box predictor is assigned to be responsible for detecting a particular object. This assignment is based on which predicted box has the highest Intersection over Union (IoU) with the ground truth box. This process allows each bounding box predictor to specialize, focusing on different object sizes, aspect ratios, or classes, which helps improve the model’s recall and detection accuracy.
A crucial technique used in YOLO models is Non-Maximum Suppression (NMS). NMS is a post-processing method that enhances the accuracy and efficiency of object detection by addressing the issue of multiple overlapping bounding boxes for the same object. When several predicted boxes overlap and represent the same object, NMS selects the box with the highest confidence score and removes others that are deemed redundant. This process ensures that each object is represented by a single, most accurate bounding box, improving the clarity of detection results.

4.1. YOLO v1

YOLO v1 consists of 24 convolutional layers followed by two fully connected layers. The first 20 convolutional layers were pre-trained on the ImageNet at a resolution of 224 × 224 pixels size. The architecture of YOLO v1 is inspired by GoogleNet [38] and Network in Network [39] which use a 1 × 1 convolutional layer to decrease the number of feature maps and keep the number of parameters low. Also, all layers employed a Leaky Rectified Linear Unit (LReLU), apart from the last layer that uses a linear activation function. The model was fine-tuned in PASCAL VOC 2007 and 2012 at a resolution of 448 × 448 pixels. The Darknet framework was used for training.

4.2. YOLO v2

The architecture of YOLOv2 [40], named darknet-19, is similar to the VGG [41] concept and consists of 19 convolutional layers and five max pooling layers. Building upon YOLOv1, YOLOv2 introduces some new advancements.
  • Batch normalization (BN) is implemented to alleviate the Internal Covariate Shift and to improve convergence.
  • High Resolution Classifier. In YOLOv2, the classifier was pre-trained in ImageNet like YOLOv1 at the resolution of 224 × 224 pixels. However, in the last 10 epochs of training, the resolution becomes 448 × 448 pixels.
  • Anchor Boxes. YOLOv2 uses anchor boxes (or prior boxes) to predict bounding boxes. These anchor boxes represent different shapes and aspect ratios, and they are used as reference templates for the objects that the model might detect. In YOLOv2, five anchor boxes were used. The width and height of the anchor boxes should match the common object of the training set. For that purpose, the authors used k-means clustering to choose the proper size of them.
  • Direct Location Prediction. Instead of predicting bounding box coordinates directly (as in YOLOv1), YOLOv2 predicts adjustments relative to these anchor boxes for each grid cell.
  • Fine Grained Features. YOLOv2 combines features from earlier and deeper layers to deal with the detection of small objects. This is carried out by using a passthrough layer that concatenates the higher-resolution features with the lower-resolution features.
  • Multi-Scale Training. Since YOLOv2 uses only convolutional and pooling layers, it can handle multiple sizes of images, making it more robust to different image sizes and to work well in a variety of input dimensions. For that reason, during training, every 10 batches, the network chooses a different image size.

4.3. YOLO v3

YOLOv3 was introduced by Redmon et al. [42] in 2018 and proposed some important changes over YOLOv2. First of all, YOLOv3 uses a bigger backbone named Darknet-53 which consists of 53 convolutional layers with 3 × 3 and 1 × 1 convolutional filters with residual connections. Also, max-pooling layers were replaced by strided convolutions. Moreover, in YOLOv3, binary cross entropy is used for classification instead of softmax. This allows for multiple label assignments in the same box. In YOLOv3, the objects are detected at three different stages/layers of the network. Assuming the input to the network is 416 × 416 pixels, the three feature vectors we obtain are 52 × 52, 26 × 26, and 13 × 13 pixels, responsible for detecting small, medium, and large objects, respectively. The network extracts features from each of the three scales using a similar concept to Feature Pyramid Networks or FPNs. Upsampling and concatenating features with different scales allow the network to learn more fine-grained information from the earlier feature maps and more meaningful semantic information from the upsampled later layer feature maps. Similarly to YOLOv2, for choosing the precise anchor boxes, the authors use k-means clustering in YOLOv3. But instead of five, in YOLOv3, nine clusters are selected, dividing them evenly across the three scales, which means three anchors for each scale per grid cell.

4.4. The Structure of Object Detectors

Henceforward, the architecture of modern object detectors is divided into three parts, the backbone, the neck, and the head, as shown in Figure 6.
The backbone part is responsible for extracting rich feature representations from the input image. It is typically a convolutional neural network (CNN) that is pre-trained on large-scale image classification datasets like ImageNet. Popular backbones include ResNet [43], VGG, Darknet-53, and more. The backbone’s purpose is to generate a feature map that encodes various spatial and semantic information about objects within the image.
The neck part further refines and processes these feature maps, often focusing on multi-scale feature aggregation. Techniques like Feature Pyramid Networks (FPNs) [44] or Path Aggregation Networks (PANet) [45] are commonly used to enhance feature maps by combining high resolution details (earlier layers) with deeper, semantically rich information (deeper layers). This multi-scale processing is crucial for detecting objects of varying sizes in an image, as it helps maintain both fine details and high-level context across different layers.
The head part is responsible for producing the final detection outputs from the processed feature maps. It includes layers that predict object bounding boxes, classify detected objects, and calculate objectness scores, indicating the presence of an object. The head often employs convolutional layers with regression for bounding box predictions and logistic regression for class probabilities. For models like YOLO and Faster R-CNN, the head generates bounding boxes and class scores for each anchor box at every spatial location in the feature map, allowing for the efficient detection of multiple objects.

4.5. YOLO v4

YOLOv4 was published by Bochkovskiy et al. [46] in 2020 by different authors from the previous versions. However, the main features and the philosophy remain the same. YOLOv4 introduces several new features that are divided into two main categories, Bag-of-Freebies and Bag-of-Specials. Bag-of-Freebies affect only the training procedure, increasing the training cost but without affecting inference time. On the other hand, Bag-of-Specials have an impact on inference by slightly increasing the inference cost but with an important improvement in accuracy.
The Bag-of-Specials lead to the following architecture. The authors tried several CNN architectures for the backbone like ResNeXt50 [47], EfficientNet-B3 [16], and Darknet-53. They finally chose a variation in Darknet-53 with cross-stage spatial connections (CSPNet) [48] with Mish activation function. CSP connections decrease the computation cost while preserving high accuracy. After the backbone, YOLOv4 uses spatial pyramid pooling (SPP) [49] which allows multi-scale prediction and increases the receptive fields of the network. Also, a spatial attention module (SAM) [50] block was also added. SAM assists the model in emphasizing important spatial regions. SPP and SAM along with a variation in PANet for the feature aggregation constitute the neck of YOLOv4. The original PANet algorithm adds the aggregated features, while the modified version concatenates them. As a head, YOLOv4 uses the same as YOLOv3 and anchor-based detection.
The implementation of bag-of-freebies includes several steps. Strong data augmentation techniques such as mosaic and cutmix [51] were applied along with typical techniques like random brightness, contrast, scaling, cropping, flipping, and rotation. Also, class label smoothing and DropBlock [52] were used for regularization. Furthermore, at the detection stage, CIoU [53] and cross-mini-batch normalization were used. Finally, self-adversarial training was applied to make the model robust under adversarial attacks and perturbations.

4.6. YOLO v5

YOLOv5 [54] was introduced by Glen Jocher in 2020. It is the first YOLO model that was not developed in the Darknet framework but in Pytorch. It has many similarities with YOLOv4. A modified CSPDarknet53 is used as the backbone. The authors added the Stem layer, which is a strided convolution layer with a large window size to reduce the computational cost. The architecture of the neck uses a spatial pyramid pooling fast (SPPF) layer with a modified CSP-PAN. The SPPF layer is designed to accelerate network processing by merging features from different scales into a uniform-size feature map, enhancing computational efficiency. The head is similar to the YOLOv3 head.
Moreover, YOLOv5 uses a plethora of data augmentation techniques such as Mosaic, Copy-Paste, random affine transformations, MixUp, Albumentations, HSV, and random horizontal flip. In YOLOv5, various training strategies are implemented to enhance model performance and efficiency. Multi-scale training is utilized, where input images are randomly resized between 0.5 and 1.5 times their original size during training, enabling the model to adapt to objects of different scales. The AutoAnchor mechanism optimizes anchor boxes to better align with the characteristics of ground truth boxes in the training data, improving detection accuracy. Additionally, a warmup and cosine learning rate scheduler gradually adjusts the learning rate, helping to stabilize the training process and improve convergence. To ensure more stable model updates and reduce generalization error, an Exponential Moving Average (EMA) of model parameters was employed. Mixed precision training further contributes to faster training by performing computations in half-precision, leading to reduced memory usage. Finally, hyperparameter evolution automatically adjusts hyperparameters through an evolutionary approach, finding optimal configurations for better performance. These strategies collectively contribute to YOLOv5’s efficient training and accurate detection capabilities.

4.7. YOLOR

In 2021, YOLOR (You Only Learn One Representation) was introduced by Wang et al. [55]. YOLOR emerged as one of the most advanced YOLO models, introducing a multi-task learning approach, as the model was trained to perform classification, detection, and pose estimation simultaneously. The authors combined explicit and implicit knowledge to achieve joint learning. Explicit knowledge is gained by training a model in a supervised manner for a specific task, while implicit knowledge simulates subconscious learning, similar to how humans gain latent background knowledge without direct supervision. Even though joint learning usually leads to poor feature generation, the combination of implicit and explicit knowledge presents improvements in all tasks. The authors proposed three ways to integrate implicit with explicit knowledge.
  • Vector/Matrix/Tensor (z): Direct use of a vector z (or higher-dimensional tensor) as implicit knowledge. Simple but effective for tasks where prior knowledge can be directly applied.
  • Neural Network (Wz): Transforms the implicit knowledge vector using a neural network to allow for more complex, task-specific interaction with explicit features.
  • Matrix Factorization (ZTc): Combines multiple vectors of implicit knowledge through factorization, enabling a rich, flexible combination of implicit knowledge with explicit learning.

4.8. YOLOX

In 2021, Zheng Ge et al. from Megvii Technology released YOLOX [56]. The authors used YOLOv3 as a baseline for Darknet53 and the SPP layer with FPN. However, they introduced five key modifications.
  • Decoupled head: In the previous YOLO versions, the head was responsible for the classification and the regression task for bounding boxes. However, YOLOX uses a decoupled head where two different branches of the head are responsible for classification and regression tasks independently (Figure 7).
  • Anchor-free: YOLOX proposes an anchor-free mechanism since it simplifies the training and decoding phase. Also, the anchor-free mechanism leads to a decreased complexity of detection heads and an increase in predictions in the image.
  • Multi-positives: In contrast to previous YOLO models, which selected only one positive sample per object, YOLOX assigns multiple predictions as positives, assuming that the predicted center is close to the center of the ground truth. Specifically, positive samples are those that their center fall within a 3 × 3 area around the ground truth center.
  • SimOTA: It is a key technique used in YOLOX to improve how the model assigns positive samples during training. It is based on the Optimal Transport Assignment (OTA) [57] algorithm and it aims to minimize the matching cost between predictions and ground truth objects, allowing the model to focus on the most promising predictions. By adopting SimOTA, YOLOX enhances its training efficiency, reduces the positive–negative imbalance, and ultimately improves detection accuracy.
  • Strong augmentations: YOLOX used the MixUP and Mosaic data augmentations and disabled them for the last 15 epochs. Also, the authors concluded that when using these augmentation techniques, the impact of pre-training is no longer as substantial.

4.9. YOLO v6

In 2022, YOLOv6 was released by Chuyi Li et al. [58], presenting improvements in both accuracy and inference speed. The development of this new version is driven by the need to address the practical challenges often encountered in industrial applications. The advancements of YOLOv6 are summarized as follows.
  • Network Design: The authors explored the RepVGG [59] architecture as the backbone, determining that it is well suited for small networks due to its ability to provide rich feature representations. Consequently, the smaller versions of YOLOv6 utilize RepBlock as their building block. However, in larger networks, RepBlock significantly increases computational costs, leading the authors to adopt an optimized CSP block, referred to as the CSPStackRep Block. Additionally, an enhanced PAN algorithm is employed as the neck, incorporating both RepBlocks and CSPStackRep Blocks. Finally, YOLOv6 features a decoupled head.
  • Label Assignment: YOLOv6 employs an advanced label assignment technique similar to YOLOX, as it is also anchor-free. The authors experimented with various label assignment methods, including SimOTA, and ultimately adopted TAL (Task-Aligned Labeling), which was introduced in TOOD [60].
  • Loss Function: The authors experimented with several loss functions for classification, box regression, and object loss choosing VariFocal Loss [61] and SIoU [62]/GIoU [63] for classification and regression, respectively.
  • Industry-handy improvements: YOLOv6 incorporates practical enhancements like self-distillation and extended training epochs to boost model performance. Additionally, a cosine decay strategy adjusts the balance between soft and hard labels during training, helping the student model adapt throughout different training phases.
  • Quantization and deployment: For quantization and deployment, YOLOv6 uses RepOptimizer to produce Post-Training Quantization (PTQ)-friendly weights and combines quantization-aware training (QAT) with channel-wise distillation and graph optimization.

4.10. YOLO v7

In 2022, Wang et al. [64] published YOLO v7, which presents significant improvements in both accuracy and speed. The authors proposed several changes which are divided into architectural and trainable bag-of-freebies.
  • Architecture
    1. 
    E-ELAN: The authors introduced the E-ELAN block. ELAN [65] is a specific computational block designed to improve gradient flow by enabling more direct gradient paths and optimizing the way information is combined across layers. E-ELAN extends this concept, allowing for even better learning by structuring the network in a way that maintains a balance between model depth and the ability of gradients to travel through the network.
    2. 
    Model scaling for concatenation-based models: Model scaling in concatenation-based models is challenging because performing depth scaling causes width scaling too, which can lead to inefficiencies in terms of computational cost and memory usage. The authors introduced a new method for scaling concatenation-based models using a uniform factor, ensuring that the scaling happens in a more balanced way, keeping the network’s complexity under control while maintaining its capacity for learning.
  • Trainable Bag-of-Freebies
    1. 
    Planned re-parameterized convolution: Inspired by YOLOv6, YOLOv7 incorporates the RepConv architecture. However, the authors found that using this technique led to a significant decrease in the performance of models such as ResNet and DenseNet. To address this issue, they proposed RepConvN, which modifies RepConv by eliminating the identity connection, thus enhancing performance while maintaining the advantages of re-parameterized convolutions.
    2. 
    Auxiliary head coarse-to-fine: In YOLOv7, the head consists of a lead and an auxiliary component. The lead head focuses on final predictions, while the auxiliary head aids in training, guided by a coarse-to-fine label assignment. This design improves both precision and recall, allowing YOLOv7 to achieve high accuracy without added inference costs.
    3. 
    Other trainable Bag-of-Freebies: The authors used some other trainable tricks that they proposed. (1) Batch normalization in conv-bn-activation topology, (2) implicit knowledge, which was also used in YOLOR, and (3) Exponential Moving Average (EMA) as the final inference.

4.11. YOLO v8

In 2023, Ultralytics introduced YOLOv8 [66], which shares many similarities with YOLOv5, as both models were developed by the same company. YOLOv8 uses CSPDarknet as a backbone, like YOLOv5, but with a different module named C2f, which is a cross-stage partial bottleneck with two convolutions, replacing C3 in YOLOv5.
Another difference between these models is the fact that YOLOv8 is an anchor-free model since it does not use prior anchor boxes. Furthermore, a decoupled head was employed to process objectness score, classification, and regression independently. Finally, BCE was used for classification loss, while CIoU and DFL functions were used for bounding-box predictions. During the training process of YOLOv8, the mosaic data augmentation technique is employed, but it is deactivated during the final 10 epochs of training.

4.12. YOLO NAS

YOLO-NAS [67] is an advanced object detection architecture developed by Deci-AI using Automated Neural Architecture Construction (AutoNAC), which automates the design and optimization of deep neural networks. It delivers state-of-the-art performance in tasks like object detection by balancing speed, accuracy, and computational efficiency. Key features of YOLO-NAS include its adaptability for edge devices through hybrid quantization, which selectively reduces the precision of different parts of the neural network, ensuring smaller model size and lower power consumption without sacrificing significant accuracy. In addition to quantization, YOLO-NAS incorporates advanced machine learning techniques such as Post-Training Quantization (PTQ) and quantization-aware training (QAT). PTQ reduces the model’s precision after training, enabling faster inference and lower resource consumption, while QAT trains the network to adapt to quantization effects, maintaining high accuracy even after reducing the model size. Furthermore, YOLO-NAS integrates Distribution Focal Loss (DFL), a loss function that improves the detection of objects of varying sizes, making it highly adaptable to different object detection tasks, from small-scale object recognition to large-scale scenarios.

4.13. YOLO v9

In 2024, YOLOv9 was released by Wang et al. [68], introducing two important innovations, the Programmable Gradient Information (PGI) and the Generalized Efficient Layer Aggregation Network (GELAN).
PGI: Programmable Gradient Information (PGI) is introduced to mitigate the Information Bottleneck (IB) phenomenon, which often occurs in deep neural networks. PGI consists of three key components: (1) the main branch: the architecture responsible for inference; (2) the Auxiliary Reversible Branch: used only during the training phase, this branch ensures the preservation of reliable gradients through the use of reversible functions; and (3) Multi-Level Auxiliary Information: a mechanism that provides multiple levels of supervision, guiding the learning process in the main branch by supervising intermediate layers.
GELAN: Inspired by CSPNet and ELAN, the authors proposed the GELAN architecture, optimized for low complexity, fast inference, and high accuracy. GELAN enhances ELAN’s ability to stack multiple convolutional blocks by enabling the stacking of any computational block. In YOLOv9, CSPNet and the planned RepConv blocks are used as computational units to further improve the model’s performance.

4.14. YOLO v10

YOLOv10 was published by Wang et al. [69] and it is built on the Ultralytics package. The proposed model introduced several advancements improving both accuracy and computational cost and can be divided into three categories: Consistent Dual Assignments for NMS-free Training, efficiency-driven model design, and accuracy-driven model design.
Consistent Dual Assignments for NMS-free Training: Previous YOLO models used one-to-many label assignment strategies, meaning that multiple predicted bounding boxes could be assigned to a single ground truth object. While this approach allows the model to learn richer features and representations, it requires the use of a post-processing method, Non-Maximum Suppression (NMS), which increases computational cost. On the other hand, one-to-one label assignment is more computationally efficient but does not provide the same level of representation learning. YOLOv10 combines these methods to leverage the benefits of both. During training, both the one-to-one and one-to-many heads are optimized, but at inference, the one-to-many head is discarded, reducing the need for NMS. To ensure optimal training, a consistent matching metric is applied to align the one-to-one head’s supervision with the learning objectives of the one-to-many head.
Efficiency-Driven Model Design: YOLOv10 employed efficiency-driven changes in the head, downsampling layers and in stages with basic building blocks. The authors stated that regression error carries more impactful information for the classification task. That is why they introduced a lightweight architecture in the classification head without sacrificing significant performance. Typical YOLO architectures use convolutions with stride 2 for spatial downsampling while also doubling the number of channels. Instead, in YOLOv10, this procedure consists of two stages. Firstly, pointwise convolution is applied to change the channels and then depthwise convolution for spatial downsampling. Finally, YOLOv10 introduces a rank-guided block design to enhance computational efficiency and optimize stage-specific architecture. By conducting intrinsic rank analysis, the team identified redundancies in using uniform blocks across all stages. To address this, they developed the Compact Inverted Block (CIB), combining depthwise and pointwise convolutions for spatial and channel mixing, respectively, and integrated it into the Efficient Layer Aggregation Network (ELAN) for improved efficiency in object detection.
Accuracy-Driven model Design: YOLOv10 introduces large kernel convolution, which enhances the receptive field and as a consequence the capability of the model. Nevertheless, large kernels have a negative impact on small object detection. That is why the authors use large kernels (7 × 7 instead of 3 × 3) only in deeper stages in the CIB. They further apply structural reparameterization [70] to introduce an additional 3 × 3 depthwise convolution branch, helping to address optimization challenges without adding inference overhead. Furthermore, YOLOv10 enhances global feature modeling with Partial Self-Attention (PSA). Channels are split post-1 × 1 convolution, applying self-attention to only one subset, then recombining them with another 1 × 1 convolution. The query and key dimensions in MHSA are halved, and PSA is placed only after Stage 4 to reduce computational costs.

4.15. YOLO v11

YOLOv11 is the latest model at this time and it was released by Ultralytics [71] who also released YOLOv8 and YOLOv5. YOLOv11 introduces several advancements in its backbone, neck, and head parts to enhance object detection efficiency and accuracy. The backbone employs a new C3k2 block, replacing the older C2f block from YOLOv8, which optimizes feature extraction with two smaller convolutions, improving computational efficiency. The SPPF block is retained, and a new Cross Stage Partial with Spatial Attention (C2PSA) block is added to enhance spatial attention, allowing the model to focus on important regions. In the neck, C3k2 improves feature aggregation, further enhancing speed and processing efficiency. The head utilizes multiple C3k2 and CBS (Convolution-BatchNorm-Silu) blocks, with C3k2 modules adaptable for different convolutional sizes to refine multi-scale features. The final detect layer produces bounding boxes and class scores, leveraging these innovations for faster and more precise object localization and classification.
Figure 8 represents the evolution of YOLO object detection architectures, from YOLOv3 to YOLOv11. Each block represents a specific YOLO version, highlighting its backbone network, architectural improvements, and key innovations.

5. Experimental Study in Robustness

Evaluating the robustness of YOLO object detection models under adversarial attacks and corrupted images is crucial for their deployment in real-world applications, particularly in safety-critical systems like autonomous vehicles and surveillance. Adversarial attacks, which involve adding subtle perturbations to images, can mislead YOLO models into misclassifying or failing to detect objects, posing severe safety risks. Similarly, common image corruption types—such as blurring, noise, or variations in brightness—can degrade model performance, affecting detection accuracy and reliability. Robustness evaluation helps identify the vulnerabilities of YOLO models under these conditions and drives the development of defense mechanisms to enhance their resilience. While teams in YOLO development focus on improvement in high accuracy and low computational cost, the robustness evaluation is essential to ensuring that YOLO models maintain high accuracy and reliability across various conditions, ultimately making them safer and more reliable for deployment in unpredictable environments.

Results

In this section, we present the evaluation results of the YOLO models under the previously described attacks. Table 2 provides the nominal performance metrics for each model, including their parameters. In subsequent tables, we document the degradation in accuracy observed under each attack relative to the baseline (clean) accuracy.
Table 3 and Table 4 present the results from the FGSM and PGD attacks, respectively. The mAP column reports the nominal performance metrics of each YOLO model on the COCO dataset, while the clean mAP column shows the models’ performance on resized, unaltered images. The remaining columns represent the degradation in mAP relative to the clean mAP for each corresponding epsilon value.
YOLOX-X and larger YOLO versions consistently showed lower accuracy degradation across various perturbation levels. The superior robustness of YOLOX models can be attributed to their anchor-free design and decoupled heads. The anchor-free structure inherently reduces complexity, minimizing the potential vulnerabilities associated with anchor box predictions. Furthermore, the decoupled head separates classification from bounding box regression, limiting adversarial impact because perturbations typically exploit coupled predictions.
YOLOX-X again demonstrated superior resilience compared to other models, notably due to its SimOTA positive sample assignment strategy, which optimizes the training by focusing on predictions closest to the ground truth. This selective optimization reduces susceptibility to gradient-based adversarial perturbations, such as those produced by PGD attacks. Larger variants like YOLOv11-X and YOLOv9-E also exhibited robustness due to deeper architectures and improved feature aggregation, facilitating more stable internal representations even when perturbed.
Table 5 shows the degradation models experienced under image corruptions such as fog, pixelate, gaussian noise, impulse noise, defocus blur, motion blur, and adversarial attacks from FGSM and PGD with ϵ = 0.01.
YOLOv3 and YOLOv4 outperformed newer models in specific corruptions like pixelation and Gaussian noise due to their comparatively simpler feature extraction pipelines, which avoid excessive sensitivity to fine-grained perturbations. Conversely, YOLOX performed strongly against fog and blur effects because of its optimized multi-scale feature aggregation, enabling better adaptation to texture and spatial distortions. Models with explicit data augmentation strategies such as YOLOv7 and YOLOv8 maintained decent robustness but exhibited specific weaknesses to pixelation noise, indicating a trade-off between robustness to naturally occurring distortions versus artificial adversarial perturbations.
Conversely, YOLO models relying heavily on anchor boxes, particularly YOLOv5 and earlier variants, show greater sensitivity to adversarial attacks, as perturbations often significantly disrupt anchor-dependent predictions. Older architectures like YOLOv3, despite lacking modern efficiency features, display surprising robustness in specific corruptions due to their straightforward structure, which inadvertently limits the exploitation of complex model features by adversarial attacks. Models employing heavy data augmentation (e.g., YOLOv8, YOLOv11) showed robustness to environmental distortions but vulnerability to pixel-level adversarial perturbations, highlighting the nuanced trade-offs in model training strategies.
Figure 9 is a heatmap that visualizes the average performance degradation (AVG Degradation) of various YOLO models under adversarial attacks or image corruptions, with lighter colors indicating higher robustness. The models are ordered by degradation level, facilitating a clear comparison of their relative resilience.
To gain a more nuanced understanding of model robustness beyond overall mAP degradation, we conducted an area-wise breakdown based on object sizes for the most recent YOLO models and their largest version, including YOLOX, which also demonstrated the highest overall robustness. Following COCO standards, objects were categorized into small (area < 32² pixels), medium (32² ≤ area < 96² pixels), and large (area ≥ 96² pixels). Across most YOLO variants, the small object category consistently exhibited the greatest performance degradation under FGSM attacks, which aligns with the intuition that small objects are more susceptible to slight pixel-level perturbations. For instance, YOLOX, YOLOv8, and YOLOv9 all showed their highest FGSM-induced mAP drop in the small object subset. Interestingly, under PGD, a more iterative and powerful attack, medium-sized objects experienced the greatest degradation in YOLOX, YOLOv8, YOLOv10, and YOLOv11. YOLOv9 was an exception, with small objects remaining the most vulnerable even under PGD. Notably, YOLOv10 showed an unusual pattern, with large objects being the most affected under FGSM. This divergence may indicate model-specific weaknesses in handling scale-invariant features under adversarial noise. The summarized results of area-based vulnerability across models and attack types are provided in Table 6.
A class-level analysis further reveals how adversarial attacks affect semantic categories differently. The results highlight that some classes remain remarkably stable even under severe attacks, while others experience substantial degradation. For instance, YOLOX and YOLOv8 identified “toothbrush” as the most robust class under FGSM, while YOLOv9, YOLOv10, and YOLOv11 found “toaster” among the most robust across both FGSM and PGD, even outperforming their clean performance in YOLOv9. This could be attributed to their distinct shapes or relatively less visual ambiguity. In contrast, classes like “bear”, “microwave”, “suitcase”, and “hair drier” were among the most vulnerable, showing substantial mAP degradation. YOLOv11 notably identified “suitcase” as the most degraded under both FGSM and PGD, indicating potential sensitivity to structured textures or occlusions. These findings suggest that robustness is not uniform across semantic categories and that certain classes could act as stress tests when evaluating model vulnerability. They also point to opportunities for targeted data augmentation or architecture refinement to improve resilience in weaker categories. An overview of the most robust and vulnerable classes for each model and attack type is presented in Table 7.

6. Discussion

Robustness in object detectors is crucial since they are used in critical domains such as autonomous driving, surveillance, etc. This study highlights the relative robustness of different YOLO model versions when exposed to adversarial attacks, such as FGSM, PGD, and image corruptions like fog, pixelate, gaussian noise, impulse noise, defocus blur, and motion blur. This section discusses the results of the conducted simulations and analyzes the behavior of the models under attack.
Specifically, under the FGSM attack, the X-Large YOLOX model demonstrated the highest robustness. While the large versions of YOLOv5 and YOLOv8 maintained their accuracy under attack with an epsilon value of 0.01, they experienced greater degradation than YOLOX at higher epsilon values. Following YOLOX, the YOLOv4, YOLOv3, and YOLO 11x models, demonstrated a notable degree of robustness. A similar pattern is observed under PGD attacks. The large YOLOv5 model maintains its accuracy at an epsilon value of 0.01; however, YOLOX models exhibit lower degradation across the remaining epsilon values. Following YOLOX, YOLOv3, YOLO 11x, YOLOv9-e, and YOLOv4 present a good performance in that order. Furthermore, results indicate that all sizes of YOLOX models exhibit superior robustness compared to other models.
When images are attacked by applying some types of image corruption, the results are slightly different. Firstly, each corruption method introduces a distinct type of noise, which explains why a model that demonstrates robustness against one type of noise may not exhibit the same level of resilience against others. However, the top-performing models across each type of corruption are consistently YOLOX, YOLOv3, and YOLOv4. More precisely, the results for each corruption are as follows.
Fog: The introduction of fog noise causes the smallest degradation in performance among all corruptions. Large YOLOX achieved the best performance, losing only 0.8 mAP. However, a plethora of other models such as YOLOX-x, YOLOv9-e, YOLOv8m, YOLOv7x, and others have a similar behavior of maintaining high mAP. On the other hand, YOLO11n, YOLOv9-s, and YOLOv5n presented the worst performance.
Pixelate: In pixelate corruption, YOLOv3 surprisingly loses only 0.7% mAP while the other models have a significantly more significant decrease in their performance. YOLOv5n and YOLOv6n were the second-best models with 3.3% and 3.4% degradation, respectively. YOLOv7, YOLOR-CSP, and all sizes of YOLO11 except for nano achieved the worst performance.
Gaussian Noise (GN): In Gaussian noise, YOLOv3 is again the most robust model. Nevertheless, near to its performance are YOLOv4 and YOLOX-x. YOLO5n, YOLOv8n, and YOLO11n were the least robust models for this type of noise.
Impulse Noise (IN): YOLOv4 and YOLOX-x were the most robust models under attack with impulse noise as they were degraded by 8.9% mAP. Impulse noise was the strongest type of noise causing the biggest degradation in models. YOLO11, YOLOv5, YOLOv8, and YOLOv9 present some of the worst results.
Defocus Blur (DB): The defocus blur degradation scores range from 3.7% to around 8.7%. YOLOv4 has the best performance under defocus blur with a score of 3.7%, indicating it is the most robust model against this specific corruption type. Other models with relatively high robustness for this type of noise include YOLOv3 (5.5%) and YOLOX-s (6.5%), which suggests that these models are also more resilient to defocus blur, albeit with slightly higher degradation than YOLOv4. Models such as YOLOv9-m (7.8%) and YOLOv11m (7.9%) show higher degradation, indicating they are more susceptible to performance drops under defocus blur.
Motion Blur (MB): YOLOv4 achieved the best performance in MB noise. YOLOv3 follows with very good performance too. YOLOX models also perform relatively well with this type of noise. In contrast, all sizes of YOLOv7, YOLOv9, and YOLO11 seem to be more susceptible in MB.
In Figure 10, the overall behavior of the models is visualized. As is shown, the blue frame contains the models with high clean mAP performance while the red frame includes the models with the least mAP degradation. The intersection of these two frames provides us with models that achieve high mAP performance in both clean and attacked images. The size of the X marker in the figure is proportional to the number of parameters of each model; larger markers indicate models with higher parameter counts. It is observed that YOLOX-x and YOLOX-l are the only models that achieve both goals.
YOLOX models (particularly YOLOX-l and YOLOX-x) consistently exhibit low degradation across a range of attacks, such as FGSM and PGD, as well as various corruption types like Gaussian noise (GN) and pixelation. This suggests a high level of robustness in the YOLOX architecture, especially for large versions, making them suitable for environments where adversarial robustness is critical. YOLOX and certain versions of YOLOv3, YOLOv4, and YOLOv9-e are among the top-performing models across multiple corruption types. For example, they perform well under Gaussian noise (GN), defocus blur (DB), and motion blur (MB). This consistent robustness across diverse noise types implies that these models are better at handling various types of data imperfections, a key feature for applications in dynamic and uncontrolled environments.
Decoupled head: YOLOX is the first YOLO variant to adopt a decoupled head architecture, separating classification and localization into independent branches. This design improves robustness by minimizing task interference—allowing each head to specialize without being impacted by conflicting gradients. Under adversarial conditions or image corruptions, this separation helps the model maintain more stable predictions, as it reduces the risk of compounded errors due to misalignment between object classification and localization. Although later YOLO models also adopt decoupled heads, our results indicate that they do not achieve comparable robustness to YOLOX, suggesting that decoupling alone is not sufficient, but plays a complementary role within YOLOX’s overall architecture.
Strong Data Augmentation: YOLOX employs extensive data augmentation techniques during training, such as Mosaic and MixUp, which expose the model to a broader distribution of visual patterns. This improves robustness by helping the model generalize across different lighting, textures, and occlusion conditions—characteristics that are common in real-world noise and perturbations. However, as observed in our experiments, subsequent models like YOLOv7 also utilize strong data augmentation but do not exhibit similar robustness under adversarial attacks. This implies that while augmentation enhances generalization, it does not fully explain YOLOX’s superior resilience to perturbations and adversarial input.
Anchor-free Training: YOLOX replaces traditional anchor-based detection with an anchor-free mechanism, which simplifies the object detection process by eliminating predefined anchor box configurations. This not only reduces computational complexity but also improves robustness by avoiding overfitting to fixed anchor sizes and aspect ratios. Such flexibility enables the model to better detect objects that are distorted or misaligned due to adversarial noise or corruptions. Nonetheless, our experimental results show that models like YOLOv6 and YOLOv8, which also adopt anchor-free training, fail to match YOLOX’s robustness. This suggests that anchor-free design, while helpful, is likely not the primary factor contributing to YOLOX’s performance in adverse conditions.
Multi-Positives: YOLOX incorporates a multi-positive label assignment strategy, which allows multiple predicted boxes to be matched to a single ground truth object during training. This is particularly beneficial in cluttered scenes or when objects are partially occluded—scenarios where standard one-to-one matching may fail. By learning from multiple correct predictions, the model becomes more resilient to variations in object positioning and overlap. However, this technique is also used in later models like YOLOv6 and v7, which do not achieve the same robustness performance in our evaluations. Therefore, multi-positive assignment contributes to robustness but likely serves as a supporting mechanism rather than the core driver of YOLOX’s resilience.
SimOTA: SimOTA, or Simple Optimal Transport Assignment, is the only feature of YOLOX that other YOLO models do not use. SimOTA works by assigning predicted bounding boxes to ground truth objects using a cost matrix that evaluates each prediction based on two factors: IoU (Intersection over Union), which measures spatial alignment, and classification confidence, which reflects the likelihood of a correct class match. It calculates the total cost for each possible match, dynamically selecting only the top-k predictions with the lowest cost for each ground truth object. This dynamic top-k thresholding lets SimOTA adapt to the specific quality of predictions in each image, focusing training on the highest-quality matches. This selective matching helps the model learn only from the clearest examples, improving accuracy and generalization. By combining classification confidence (how likely the prediction represents the correct object) with spatial alignment (measured through Intersection over Union, or IoU), SimOTA creates a balanced and adaptive matching process. Unlike traditional methods that use fixed matching criteria, SimOTA employs dynamic thresholding, allowing it to select only the highest-quality predictions for training, based on the specific characteristics of each image. This adaptability ensures that the model learns from the clearest, most reliable examples, reducing the influence of noisy or ambiguous predictions. As a result, the model learns representations that capture essential, generalizable features of objects, making it more effective when deployed in varied real-world scenarios. This dual focus on quality and spatial accuracy makes SimOTA particularly robust, equipping models to handle new environments and unexpected variations in object appearance with improved accuracy and reliability.
In summary, although later YOLO models adopt several of YOLOX’s innovations, only YOLOX consistently demonstrates superior robustness. This implies that robustness is not the result of any single feature but likely emerges from the specific combination of architectural and methodological choices—with SimOTA potentially being the most impactful based on our findings.
Also, it is observed that some small models behave better in some types of noise. In pixelate noise, motion blur, and defocus blur, the nano versions of v5, v6, v8, v10, and v11 present significant improvement in degradation than the bigger corresponding version. This may be because larger models with more parameters have a higher capacity to learn fine details in training data. While this allows them to capture intricate patterns in clean images, it also makes them more sensitive to high-frequency details and pixel-level structures. While larger model capacity generally leads to greater robustness against adversarial attacks, this is not necessarily the case for image corruption. This also can be shown in Figure 6, which illustrates the overall performance of each model in relation to its corresponding size.
Despite their overall robustness, even the top models experience variability in their resilience based on the type of corruption. Pixelation, in particular, leads to a notable drop in performance for many models, but YOLOX-x and YOLOX-l handle it relatively well compared to others. This indicates that while robustness is high in general, certain noise types still pose challenges, and specialized defenses might be needed for pixelation-type noises. Based on these results, YOLOX models, especially in their larger variants, are likely the best choice for applications requiring high robustness against adversarial attacks and corruption. While YOLOv3 and YOLOv4 could still be considered viable options, their relatively poor performance in clean images suggests they may not be the optimal choice for applications where accuracy is a critical requirement. Since YOLOv3 and YOLOv4 already perform relatively poorly on clean images compared to more advanced models, the degree of degradation they experience under various corruptions may appear less pronounced. In terms of architectural foundations, YOLOX incorporates components that were already explored in previous YOLO versions, but it also introduces distinct advancements that appear to enhance its robustness. YOLOX leverages CSPNet as its backbone—a structure that is also used in YOLOv5—while utilizing an FPN-based neck, as found in YOLOv3 and YOLOv3u. Despite these commonalities, YOLOv5 and YOLOv3u do not exhibit notable robustness against adversarial attacks and image corruption, suggesting that robustness in YOLOX is likely attributed to additional factors beyond these shared architectural elements. YOLOX builds upon YOLOv3 but makes five critical modifications: (1) decoupled head, (2) strong data augmentation, (3) anchor-free training, (4) multi-positives, and (5) SimOTA. To investigate the underlying factors contributing to the robustness of YOLOX, an analysis of the key architectural and methodological improvements it introduces is conducted hereafter.
Almost all models exhibit an upward trend in performance as their model size increases, with a characteristic curve after their smallest version (Figure 11). However, YOLOX deviates from this pattern, showing a downward trend in performance as model size increases. Additionally, smaller models generally achieve relatively low mean Average Precision (mAP) scores. Small models may inherently show limited mAP improvement with further reductions in model size because their baseline performance is already constrained by their reduced capacity. These models, designed with fewer parameters and simpler architectures, often lack the depth and feature extraction ability needed to capture intricate patterns in the data. Consequently, they start with relatively low mAP, meaning there is less room for a noticeable drop.
A limitation of our study is that we evaluated pre-trained models on a fixed test set without retraining. As a result, conventional statistical significance tests or confidence interval analyses could not be directly applied, since model-level variance across multiple training runs was not available. Future work involving retraining models from scratch or using cross-validation would allow for more rigorous statistical validation of the observed robustness differences.
Additionally, the scope of adversarial attacks in this study was limited to FGSM and PGD due to their wide adoption and interpretability, particularly in benchmarking scenarios. More recent and powerful attacks, such as AutoAttack, APGD, or Square Attack, were not included, mainly due to the high computational cost associated with testing a large number of YOLO variants. Moreover, our experiments focused on a black-box threat model, which better reflects real-world conditions where model internals are typically inaccessible to attackers. However, we acknowledge that evaluating under white-box settings and incorporating stronger adversarial attacks would offer a more comprehensive assessment of model robustness. We consider this an important direction for future research.

7. Conclusions

This work has explored the robustness of YOLO models under adversarial attacks and image corruption, focusing on key architectural advancements and their impact on model resilience. YOLO models are widely used in fields such as autonomous driving, surveillance, and medical imaging, where even slight performance degradation under challenging conditions—such as adversarial attacks or image corruption—could lead to significant risks or errors. Ensuring that YOLO models are robust is not just a technical goal but a practical necessity for these high-stakes environments. Among the models reviewed, YOLOX emerges as the most robust, demonstrating a superior ability to retain performance in the face of adversarial perturbations and noisy data. This robustness may be largely attributed to the integration of the SimOTA algorithm, which optimizes label assignment and enhances model generalization by prioritizing high-quality predictions during training. By underscoring the robustness of YOLOX and identifying SimOTA as a potential key factor, this paper offers a foundation for future YOLO development focused on enhancing resilience against real-world adversarial conditions. Further investigation into SimOTA and similar algorithms could unlock new pathways to improve YOLO models, achieving a more effective balance among accuracy, speed, and robustness.

Author Contributions

Conceptualization, K.D.A. and G.A.P.; methodology, K.D.A. and G.A.P.; software, K.D.A.; validation, K.D.A. and G.A.P.; formal analysis, K.D.A.; investigation, K.D.A.; resources, K.D.A.; data curation, K.D.A.; writing—original draft preparation, K.D.A.; writing—review and editing, G.A.P.; visualization, K.D.A.; supervision, G.A.P.; project administration, G.A.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The original data presented in the study are openly available in the dataset COCO at https://cocodataset.org/#home (accessed on 11 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
YOLOYou Only Look Once
COCOCommon Object in Context
DNNDeep neural network
CNNConvolutional neural network
FPNFeature Pyramid Network
PANetPath Aggregation Networks
R-CNNRegion-based CNN
DETRDetection Transformer
SSDSingle Shot Detection
FGSMFast Gradient Sign Method
PGDProjected Gradient Descent
CSPCross-stage spatial connections
SAMSpatial Attention Module
IoUIntersection over Union
APAverage Precision
NMSNon-Maximum Suppression
BNBatch normalization
EMAExponential Moving Average
SimOTASimplified Optimal Transport Assignment
QATQuantization
PTQPost-Training Quantization
ELANEfficient Layer Aggregation Network
GELANGeneralized Efficient Layer Aggregation Network
PGIProgrammable Gradient Information
ARTAdversarial Robustness Toolbox
SPPSpatial pyramid pooling
DFLDistribution Focal Loss
IFInformation Bottleneck
CIBCompact Inverted Block
PSAPartial Self-Attention

References

  1. Liang, S.; Wu, H.; Zhen, L.; Hua, Q.; Garg, S.; Kaddoum, G.; Hassan, M.M.; Yu, K. Edge YOLO: Real-Time Intelligent Object Detection System Based on Edge-Cloud Cooperation in Autonomous Vehicles. IEEE Trans. Intell. Transport. Syst. 2022, 23, 25345–25360. [Google Scholar] [CrossRef]
  2. Yue, X.; Li, H.; Shimizu, M.; Kawamura, S.; Meng, L. YOLO-GD: A Deep Learning-Based Object Detection Algorithm for Empty-Dish Recycling Robots. Machines 2022, 10, 294. [Google Scholar] [CrossRef]
  3. Zhu, J.; Li, X.; Jin, P.; Xu, Q.; Sun, Z.; Song, X. MME-YOLO: Multi-Sensor Multi-Level Enhanced YOLO for Robust Vehicle Detection in Traffic Surveillance. Sensors 2021, 21, 27. [Google Scholar] [CrossRef] [PubMed]
  4. Qureshi, R.; Ragab, M.G.; Abdulkader, S.J.; Muneer, A.; Alqushaib, A.; Sumiea, E.H.; Alhussian, H. A Comprehensive Systematic Review of YOLO for Medical Object Detection (2018 to 2023). IEEE Access 2024, 12, 57815–57836. [Google Scholar]
  5. Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A Comprehensive Review of YOLO Variants and Their Application in the Agricultural Domain. arXiv 2024, arXiv:2406.10139. [Google Scholar]
  6. Mushtaq, M.; Akram, M.U.; Alghamdi, N.S.; Fatima, J.; Masood, R.F. Localization and Edge-Based Segmentation of Lumbar Spine Vertebrae to Identify the Deformities Using Deep Learning Models. Sensors 2022, 22, 1547. [Google Scholar] [CrossRef]
  7. Baccouche, A.; Garcia-Zapirain, B.; Castillo Olea, C.; Elmaghraby, A.S. Breast Lesions Detection and Classification via YOLO-Based Fusion Models. Comput. Mater. Contin. 2021, 69, 1407–1425. [Google Scholar] [CrossRef]
  8. Dyrmann, M.; Mortensen, A.K.; Linneberg, L.; Høye, T.T.; Bjerge, K. Camera Assisted Roadside Monitoring for Invasive Alien Plant Species Using Deep Learning. Sensors 2021, 21, 6126. [Google Scholar] [CrossRef]
  9. Wang, Q.; Cheng, M.; Huang, S.; Cai, Z.; Zhang, J.; Yuan, H. A Deep Learning Approach Incorporating YOLO v5 and Attention Mechanisms for Field Real-Time Detection of the Invasive Weed Solanum Rostratum Dunal Seedlings. Comput. Electron. Agric. 2022, 199, 107194. [Google Scholar] [CrossRef]
  10. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2016, arXiv:1506.02640. [Google Scholar]
  11. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
  12. Girshick, R. Fast R-CNN. arXiv 2015, arXiv:1504.08083. [Google Scholar]
  13. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  14. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part I 14; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. [Google Scholar] [CrossRef]
  15. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  16. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar]
  17. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision; Springer International Publishing: Cham, Switzerland, 2020. [Google Scholar]
  18. Apostolidis, K.D.; Papakostas, G.A. A Survey on Adversarial Deep Learning Robustness in Medical Image Analysis. Electronics 2021, 10, 2132. [Google Scholar] [CrossRef]
  19. Michaelis, C.; Mitzkus, B.; Geirhos, R.; Rusak, E.; Bringmann, O.; Ecker, A.S.; Bethge, M.; Brendel, W. Benchmarking Robustness in Object Detection: Autonomous Driving When Winter Is Coming. arXiv 2019, arXiv:1907.07484. [Google Scholar]
  20. Maliamanis, T.; Papakostas, G. Adversarial Computer Vision: A Current Snapshot. In Proceedings of the Twelfth International Conference on Machine Vision (ICMV 2019); Osten, W., Nikolaev, D.P., Eds.; SPIE: Amsterdam, The Netherlands, 2020; p. 121. [Google Scholar]
  21. Hussain, M. YOLOv1 to v8: Unveiling Each Variant–A Comprehensive Review of YOLO. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
  22. Terven, J.; Cordova-Esparza, D. A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
  23. Hussain, M.; Khanam, R. In-Depth Review of YOLOv1 to YOLOv10 Variants for Enhanced Photovoltaic Defect Detection. Solar 2024, 4, 351–386. [Google Scholar] [CrossRef]
  24. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and Harnessing Adversarial Examples. arXiv 2015, arXiv:1412.6572. [Google Scholar]
  25. Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; Vladu, A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv 2019, arXiv:1706.06083. [Google Scholar]
  26. Jain, S. Adversarial Attack on Yolov5 for Traffic and Road Sign Detection. In Proceedings of the 2024 4th International Conference on Applied Artificial Intelligence (ICAPAI), Halden, Norway, 16 April 2024; pp. 1–5. [Google Scholar]
  27. Dai, L.; Wang, J.; Yang, B.; Chen, F.; Zhang, H. An Adversarial Example Attack Method Based on Predicted Bounding Box Adaptive Deformation in Optical Remote Sensing Images. PeerJ Comput. Sci. 2024, 10, e2053. [Google Scholar] [CrossRef]
  28. Bayer, J.; Becker, S.; Münch, D.; Arens, M. Eigenpatches—Adversarial Patches from Principal Components. In Proceedings of the Advances in Visual Computing; Bebis, G., Ghiasi, G., Fang, Y., Sharf, A., Dong, Y., Weaver, C., Leo, Z., LaViola, J.J., Jr., Kohli, L., Eds.; Springer Nature: Cham, Switzerland, 2023; pp. 274–284. [Google Scholar]
  29. Shack, J.; Petrovic, K.; Saukh, O. Breaking the Illusion: Real-World Challenges for Adversarial Patches in Object Detection. arXiv 2024, arXiv:2410.19863. [Google Scholar]
  30. Lian, J.; Pan, J.; Wang, L.; Wang, Y.; Chau, L.-P.; Mei, S. PADetBench: Towards Benchmarking Physical Attacks against Object Detection. arXiv 2024, arXiv:2408.09181. [Google Scholar]
  31. Liu, J.; Wang, Z.; Ma, L.; Fang, C.; Bai, T.; Zhang, X.; Liu, J.; Chen, Z. Benchmarking Object Detection Robustness against Real-World Corruptions. Int. J. Comput. Vis. 2024, 132, 4398–4416. [Google Scholar] [CrossRef]
  32. Arsenos, A.; Karampinis, V.; Petrongonas, E.; Skliros, C.; Kollias, D.; Kollias, S.; Voulodimos, A. Common Corruptions for Enhancing and Evaluating Robustness in Air-to-Air Visual Object Detection. arXiv 2024, arXiv:2405.06765. [Google Scholar] [CrossRef]
  33. Ding, Y.; Luo, X. SDNIA-YOLO: A Robust Object Detection Model for Extreme Weather Conditions. arXiv 2024, arXiv:2406.12395. [Google Scholar]
  34. The PASCAL Visual Object Classes Challenge 2007 (VOC2007). Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2007/ (accessed on 7 November 2024).
  35. The PASCAL Visual Object Classes Challenge 2012 (VOC2012). Available online: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/ (accessed on 7 November 2024).
  36. Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014, Proceedings, Part V; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
  37. Nicolae, M.-I.; Sinn, M.; Tran, M.N.; Buesser, B.; Rawat, A.; Wistuba, M.; Zantedeschi, V.; Baracaldo, N.; Chen, B.; Ludwig, H.; et al. Adversarial Robustness Toolbox v1.0.0. arXiv 2019, arXiv:1807.01069. [Google Scholar]
  38. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
  39. [1312.4400] Network in Network. Available online: https://arxiv.org/abs/1312.4400 (accessed on 7 November 2024).
  40. [1612.08242] YOLO9000: Better, Faster, Stronger. Available online: https://arxiv.org/abs/1612.08242 (accessed on 7 November 2024).
  41. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
  42. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  43. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  44. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  45. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018. [Google Scholar]
  46. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  47. Xie, S.; Girshick, R.; Dollar, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 5987–5995. [Google Scholar]
  48. Wang, C.-Y.; Mark Liao, H.-Y.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar]
  49. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  50. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Computer Vision—ECCV 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 11211, pp. 3–19. ISBN 978-3-030-01233-5. [Google Scholar]
  51. Yun, S.; Han, D.; Chun, S.; Oh, S.J.; Yoo, Y.; Choe, J. CutMix: Regularization Strategy to Train Strong Classifiers With Localizable Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6022–6031. [Google Scholar]
  52. Ghiasi, G.; Lin, T.-Y.; Le, Q.V. DropBlock: A Regularization Method for Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems; Curran Associates, Inc.: Montréal, QC, Canada, 2018; Volume 31. [Google Scholar]
  53. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. AAAI 2020, 34, 12993–13000. [Google Scholar] [CrossRef]
  54. Ultralytics YOLOv5. Available online: https://docs.ultralytics.com/models/yolov5 (accessed on 7 November 2024).
  55. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. You Only Learn One Representation: Unified Network for Multiple Tasks. arXiv 2021, arXiv:2105.04206. [Google Scholar]
  56. Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
  57. Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. OTA: Optimal Transport Assignment for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
  58. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  59. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
  60. Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
  61. Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-Aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar]
  62. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
  63. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  64. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
  65. Wang, C.-Y.; Liao, H.-Y.M.; Yeh, I.-H. Designing Network Design Strategies Through Gradient Path Analysis. arXiv 2022, arXiv:2211.04800. [Google Scholar]
  66. Ultralytics YOLOv8. Available online: https://docs.ultralytics.com/models/yolov8 (accessed on 7 November 2024).
  67. Ultralytics YOLO-NAS (Neural Architecture Search). Available online: https://docs.ultralytics.com/models/yolo-nas (accessed on 7 November 2024).
  68. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In European Conference on Computer Vision; Springer Nature Switzerland: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
  69. Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar]
  70. Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling Up Your Kernels to 31x31: Revisiting Large Kernel Design in CNNs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
  71. YOLO11 🚀 NEW—Ultralytics YOLO Docs. Available online: https://docs.ultralytics.com/models/yolo11/ (accessed on 7 November 2024).
Figure 1. Evolution of YOLO models.
Figure 1. Evolution of YOLO models.
Electronics 14 01624 g001
Figure 2. Pipeline for robustness evaluation. (a) Adversarial attacks, (b) image corruption.
Figure 2. Pipeline for robustness evaluation. (a) Adversarial attacks, (b) image corruption.
Electronics 14 01624 g002
Figure 3. Images under FGSM attack. (a) Clean image, (b) ϵ = 0.01, (c) ϵ = 0.03, (d) ϵ = 0.05, (e) ϵ = 0.07, (f) ϵ = 0.1.
Figure 3. Images under FGSM attack. (a) Clean image, (b) ϵ = 0.01, (c) ϵ = 0.03, (d) ϵ = 0.05, (e) ϵ = 0.07, (f) ϵ = 0.1.
Electronics 14 01624 g003
Figure 4. Images under PGD attack. (a) Clean image, (b) ϵ = 0.01, (c) ϵ = 0.03, (d) ϵ = 0.05, (e) ϵ = 0.07, (f) ϵ = 0.1.
Figure 4. Images under PGD attack. (a) Clean image, (b) ϵ = 0.01, (c) ϵ = 0.03, (d) ϵ = 0.05, (e) ϵ = 0.07, (f) ϵ = 0.1.
Electronics 14 01624 g004
Figure 5. Example of image corruption. (a) Clean image, (b) defocus blur, (c) fog, (d) impulse noise, (e) motion blur, (f) pixelate, and (g) Gaussian noise.
Figure 5. Example of image corruption. (a) Clean image, (b) defocus blur, (c) fog, (d) impulse noise, (e) motion blur, (f) pixelate, and (g) Gaussian noise.
Electronics 14 01624 g005
Figure 6. The structure of the modern YOLO object detectors, illustrating the three main components: the backbone for feature extraction, the neck for multi-scale feature aggregation, and the head for final bounding box prediction and classification.
Figure 6. The structure of the modern YOLO object detectors, illustrating the three main components: the backbone for feature extraction, the neck for multi-scale feature aggregation, and the head for final bounding box prediction and classification.
Electronics 14 01624 g006
Figure 7. The head of YOLOX [43] in contrast to previous YOLO.
Figure 7. The head of YOLOX [43] in contrast to previous YOLO.
Electronics 14 01624 g007
Figure 8. Evolution of YOLO object detection architectures.
Figure 8. Evolution of YOLO object detection architectures.
Electronics 14 01624 g008
Figure 9. Heatmap with average degradation across all attacks.
Figure 9. Heatmap with average degradation across all attacks.
Electronics 14 01624 g009
Figure 10. The average degradation across all attacks with clean mAP.
Figure 10. The average degradation across all attacks with clean mAP.
Electronics 14 01624 g010
Figure 11. Model performance by size.
Figure 11. Model performance by size.
Electronics 14 01624 g011
Table 1. Configurations of adversarial attacks.
Table 1. Configurations of adversarial attacks.
Attack TypeNorm ConstraintPerturbation Strength (ϵ)Step Size (α)Iterations (max_iter)
FGSML∞0.01, 0.03, 0.05, 0.07, 0.1-1
PGDL∞0.01, 0.03, 0.05, 0.07, 0.10.110
Table 2. YOLO models with their nominal mAP (%) and corresponding parameters.
Table 2. YOLO models with their nominal mAP (%) and corresponding parameters.
ModelmAPParameters (M)ModelmAPParameters (M)
YOLOv328.7065.30YOLOv8mu50.1025.90
YOLOv3u51.70103.80YOLOv8lu52.9043.70
YOLOv428.2063.90YOLOv8xu54.0068.20
YOLOv5nu34.302.70YOLOv9-t46.802.00
YOLOv5su43.009.20YOLOv9-s51.407.20
YOLOv5mu49.0025.10YOLOv9-m53.0020.10
YOLOv5lu52.2053.20YOLOv9-c53.0025.50
YOLOv5xu53.2097.30YOLOv9-e55.6058.10
YOLOX-s40.509.00YOLO NAS S46.5019.00
YOLOX-m46.9025.30YOLO NAS M50.4051.10
YOLOX-l49.7054.20YOLO NAS L51.3066.90
YOLOX-x51.1099.10YOLOv10n38.502.30
YOLOR csp51.1099.80YOLOv10s46.307.20
YOLOv6-N37.504.65YOLOv10m51.1015.40
YOLOv6-S45.1018.54YOLOv10l53.2024.40
YOLOv6-M50.0034.86YOLOv10x54.4029.50
YOLOv6-L52.8059.61YOLOv11n39.502.60
YOLOv751.2036.90YOLOv11s47.009.40
YOLOv7x53.1071.30YOLOv11m51.5020.10
YOLOv8nu37.103.20YOLOv11l53.4025.30
YOLOv8su44.7011.20YOLOv11x54.7056.90
Table 3. YOLO model’s performance under FGSM attack (bold numbers represent the best performance in each case).
Table 3. YOLO model’s performance under FGSM attack (bold numbers represent the best performance in each case).
ModelmAPClean mAP ϵ  = 0.01 ϵ = 0.03 ϵ = 0.05 ϵ  = 0.07 ϵ = 0.1
YOLOv328.7027.800.100.702.403.806.40
YOLOv3u51.7047.900.201.203.305.208.90
YOLOv428.2028.900.200.802.403.706.20
YOLOv5nu34.3031.100.201.404.306.7011.10
YOLOv5su43.0039.200.101.404.006.5011.50
YOLOv5mu49.0045.600.301.103.605.9011.00
YOLOv5lu52.2048.600.000.903.105.009.00
YOLOv5xu53.2049.900.200.902.904.608.60
YOLOX-S40.5027.100.100.702.303.305.60
YOLOX-M46.9032.100.100.601.903.005.00
YOLOX-L49.7034.400.200.801.903.105.00
YOLOX-X51.1035.400.200.601.602.404.30
YOLOR51.1047.300.201.002.904.708.30
YOLOv6-N37.5035.300.200.903.505.409.40
YOLOv6-S45.1041.800.201.103.505.409.30
YOLOv6-M50.0046.500.201.003.004.608.20
YOLOv6-L52.8049.700.200.902.704.207.60
YOLOv751.2047.200.101.103.305.109.30
YOLOv7x53.1049.500.201.103.304.907.60
YOLOv8nu37.1033.800.301.404.006.4011.30
YOLOv8su44.7041.400.201.504.006.2010.80
YOLOv8mu50.1046.900.201.003.104.808.80
YOLOv8lu52.9049.700.000.902.704.407.80
YOLOv8xu54.0050.700.200.902.604.007.20
YOLOv9-t38.3035.400.101.404.606.9011.00
YOLOv9-s46.8043.700.201.303.905.709.90
YOLOv9-m51.4047.800.201.003.004.808.40
YOLOv9-c53.0049.700.201.103.105.108.90
YOLOv9-e55.6052.500.200.702.303.807.00
YOLO NAS S46.5042.800.301.203.505.409.50
YOLO NAS M50.4046.900.100.903.104.708.00
YOLO NAS L51.3047.500.300.902.604.107.60
YOLOv10n38.5036.100.101.304.306.4010.80
YOLOv10s46.3043.500.201.303.805.9010.20
YOLOv10m51.1047.900.201.003.105.008.90
YOLOv10l53.2050.000.100.802.704.308.00
YOLOv10x54.4051.000.200.802.604.007.50
YOLOv11n39.5036.800.301.404.506.8012.10
YOLOv11s47.0043.800.301.404.106.2010.50
YOLOv11m51.5047.700.201.103.205.009.00
YOLOv11l53.4050.000.201.003.004.507.90
YOLOv11x54.7051.600.200.602.203.606.50
Table 4. YOLO model’s performance under PGD attack (bold numbers represent the best performance in each case).
Table 4. YOLO model’s performance under PGD attack (bold numbers represent the best performance in each case).
ModelmAPClean mAP ϵ = 0.01 ϵ = 0.03 ϵ = 0.05 ϵ = 0.07 ϵ = 0.1
YOLOv328.7027.800.100.501.702.203.90
YOLOv3u51.7047.900.200.802.603.206.20
YOLOv428.2028.900.200.802.002.604.60
YOLOv5nu34.3031.100.201.003.704.608.00
YOLOv5su43.0039.200.200.803.204.308.20
YOLOv5mu49.0045.600.300.802.803.607.40
YOLOv5lu52.2048.600.000.502.304.105.90
YOLOv5xu53.2049.900.100.702.302.905.50
YOLOX-S40.5027.100.100.401.702.303.80
YOLOX-M46.9032.100.200.401.501.803.30
YOLOX-l49.7034.400.200.601.501.903.40
YOLOX-X51.1035.400.100.401.301.502.80
YOLOR51.1047.300.200.702.303.005.70
YOLOv6-N37.5035.300.200.902.803.806.30
YOLOv6-S45.1041.800.100.702.703.606.50
YOLOv6-M50.0046.500.200.702.402.805.40
YOLOv6-L52.8049.700.200.502.102.805.20
YOLOv751.2047.200.100.702.403.106.10
YOLOv7x53.1049.500.200.702.402.905.80
YOLOv8nu37.1033.800.301.003.504.307.80
YOLOv8su44.7041.400.100.903.004.007.20
YOLOv8mu50.1046.900.100.602.403.206.00
YOLOv8lu52.9049.700.100.702.202.705.20
YOLOv8xu54.0050.700.200.702.102.604.90
YOLOv9-t38.3035.400.201.103.704.608.00
YOLOv9-s46.8043.700.200.903.103.906.60
YOLOv9-m51.4047.800.200.602.403.105.60
YOLOv9-c53.0049.700.200.702.603.205.80
YOLOv9-e55.6052.500.100.502.002.604.50
YOLO NAS S46.5042.800.300.802.803.206.30
YOLO NAS M50.4046.900.100.602.503.005.40
YOLO NAS L51.3047.500.400.702.002.404.80
YOLOv10n38.5036.100.201.003.504.607.40
YOLOv10s46.3043.500.201.003.304.207.10
YOLOv10m51.1047.900.200.702.503.206.00
YOLOv10l53.2050.000.200.602.102.605.30
YOLOv10x54.4051.000.100.501.802.404.80
YOLOv11n39.5036.800.201.103.504.508.20
YOLOv11s47.0043.800.201.003.404.207.50
YOLOv11m51.5047.700.200.802.503.205.90
YOLOv11l53.4050.000.100.702.402.705.40
YOLOv11x54.7051.600.100.101.602.104.00
Table 5. Degradation of YOLO models in image corruptions and adversarial attacks with ϵ = 0.1 (bold numbers represent the best performance in each case).
Table 5. Degradation of YOLO models in image corruptions and adversarial attacks with ϵ = 0.1 (bold numbers represent the best performance in each case).
ModelCleanFogPixelateGNINDBMBFGSMPGD
YOLOv328.701.300.704.4011.805.504.306.403.90
YOLOv3u51.701.409.207.5012.206.707.808.906.20
YOLOv428.201.707.704.708.903.704.206.204.60
YOLOv5nu34.303.803.4010.3012.905.407.3011.108.00
YOLOv5su43.003.608.109.7015.606.608.6011.508.20
YOLOv5mu49.001.708.408.0016.406.808.4011.007.40
YOLOv5lu52.201.307.307.3014.306.608.009.005.90
YOLOv5xu53.201.207.906.7013.506.208.008.605.50
YOLOX-s40.501.506.707.0010.406.506.605.603.80
YOLOX-m46.901.307.106.4010.506.707.005.003.30
YOLOX-l49.700.806.805.8010.406.907.005.003.40
YOLOX-x51.100.906.305.008.906.706.904.302.80
YOLOR csp51.101.2010.408.0013.307.808.108.305.70
YOLOv6-N37.502.803.307.8010.005.706.309.406.30
YOLOv6-S45.103.204.908.3012.707.607.509.306.50
YOLOv6-M50.001.709.207.3013.408.308.608.205.40
YOLOv6-L52.801.208.006.6011.607.907.707.605.20
YOLOv751.201.3011.707.2013.908.308.709.306.10
YOLOv7x53.101.2012.007.2013.508.708.907.605.80
YOLOv8nu37.103.405.109.9014.205.507.4011.307.80
YOLOv8su44.703.509.109.4015.706.508.1010.807.20
YOLOv8mu50.101.109.207.2015.106.307.408.806.00
YOLOv8lu52.901.308.106.5012.406.107.807.805.20
YOLOv8xu54.001.209.306.6012.406.107.507.204.90
YOLOv9-t38.303.604.809.3013.506.707.8011.008.00
YOLOv9-s46.803.707.709.0014.807.608.509.906.60
YOLOv9-m51.401.308.407.6016.407.808.808.405.60
YOLOv9-c53.001.709.907.4015.706.608.608.905.80
YOLOv9-e55.601.107.805.8011.706.208.007.004.50
YOLO NAS S46.502.409.508.2014.007.508.209.506.30
YOLONAS M50.401.408.806.6013.907.908.208.005.40
YOLO NAS L51.301.507.606.6013.608.108.207.604.80
YOLOv10n38.503.804.809.2012.806.406.9010.807.40
YOLOv10s46.303.606.409.3014.607.908.1010.207.10
YOLOv10m51.101.809.108.0013.908.307.908.906.00
YOLOv10l53.201.707.906.7012.308.408.108.005.30
YOLOv10x54.401.709.906.8013.208.708.507.504.80
YOLOv11n39.504.005.6010.4015.006.107.8012.108.20
YOLOv11s47.002.3010.809.1016.906.408.7010.507.50
YOLOv11m51.501.5012.007.3016.807.008.809.005.90
YOLOv11l53.401.6012.007.1016.606.908.807.905.40
YOLOv11x54.701.2010.306.2014.306.608.806.504.00
Table 6. Most affected object size category by attack type across YOLO models.
Table 6. Most affected object size category by attack type across YOLO models.
ModelFGSM (Most Affected Area)PGD (Most Affected Area)
YOLOXSmallMedium
YOLOv8SmallMedium
YOLOv9SmallSmall
YOLOv10LargeMedium
YOLO11MediumMedium
Table 7. Most robust and most vulnerable object classes under FGSM and PGD attacks.
Table 7. Most robust and most vulnerable object classes under FGSM and PGD attacks.
ModelFGSM RobustPGD RobustFGSM VulnerablePGD Vulnerable
YOLOXToothbrushToothbrushBearHair drier
YOLOv8ToothbrushKiteSuitcaseToaster
YOLOv9ToasterToasterMouseHot dog
YOLOv10KiteKiteMicrowaveToaster
YOLO11ToasterToasterSuitcaseSuitcase
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Apostolidis, K.D.; Papakostas, G.A. Delving into YOLO Object Detection Models: Insights into Adversarial Robustness. Electronics 2025, 14, 1624. https://doi.org/10.3390/electronics14081624

AMA Style

Apostolidis KD, Papakostas GA. Delving into YOLO Object Detection Models: Insights into Adversarial Robustness. Electronics. 2025; 14(8):1624. https://doi.org/10.3390/electronics14081624

Chicago/Turabian Style

Apostolidis, Kyriakos D., and George A. Papakostas. 2025. "Delving into YOLO Object Detection Models: Insights into Adversarial Robustness" Electronics 14, no. 8: 1624. https://doi.org/10.3390/electronics14081624

APA Style

Apostolidis, K. D., & Papakostas, G. A. (2025). Delving into YOLO Object Detection Models: Insights into Adversarial Robustness. Electronics, 14(8), 1624. https://doi.org/10.3390/electronics14081624

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop