To guarantee the fairness of comparison with other methods, all weights for the loss function use common uniform metrics to compare methods.
4.2. Evaluation Indicators
Disturbance Impact Index: Used for previous classification models and classification datasets, such as ImageNet [
34]. Gradient-based attacks are global attacks, so the norm can reflect the perturbation limit of the attack on the image. In contrast, for the COCO datasets, the background occupies far more pixels than the target pixels. The norm constraint is to constrain each pixel as a whole, whereas in the actual attack, it is not necessary for us to attack all the pixels. Therefore, we introduced SSIM and PSNR metrics that are closer to human observation to judge the size of interference.
Structure Similarity Index Measure (SSIM) [
15]: Mainly measures the similarity of images from three aspects: brightness, contrast, and structure. Due to SSIM being a perception model, it is more in line with the intuitive feeling of the human eye.
Peak signal-to-noise ratio (PSNR): This metric is an engineering term that represents the ratio of the maximum possible power of a signal to the destructive noise power that affects its representation accuracy. To measure the image quality after processing, we usually refer to the PSNR value to measure whether a processing program is satisfactory.
Performance Indicators: Regarding the evaluation metrics for object detection, we use COCO’s target recognition evaluation criteria:
is the accuracy rate, which is used to measure the percentage of correct predictions among all predictions;
is the recall rate, which is the number of all correct predictions as a percentage of the total targets; the
metric considers both accuracy and completeness, so the area under the
curve is used to represent a performance metric of the object detection model for this dataset. Its default
range is (0.5:0.95). The
metric represents the
performance of the model if the
is greater than 0.5. performance;
is the
performance for the more stringent case of
greater than 0.75;
,
,
, respectively, represent the detection performance of small (area <
), medium (
< area <
), and large (area >
) targets in the
range (0.5, 0.95).
Table 1 shows the AP metrics of the adversarial samples we generated by YOLOR compared with the I-FGSM and PGD methods.
Table 2 shows the recall performance comparison of the above methods comparison.
4.3. Generalizability Comparisons
As the existing object detection attack methods are only applicable in reality to evade detection, to demonstrate the effectiveness of our method more intuitively, we first reproduce the I-FGSM and PGD in a classification attack and apply it to the attack on the object detection model. To ensure fairness, all of our experiments attack both the confidence and bounding regression box of the object detection model as well as the classification results. In addition, they all use the same parameter selection of (, , ) = (0.7, 0.05, 0.3). By way of comparison reference, we use the same 10 iteration times. The performance metrics are given by the authors we use in this section of experiments to compare the effectiveness of our work. To prove our conclusions more rigorously, all our experiments are built based on what has been replicated.
Attack effect experiment: From the first row of our main experiment results,
Table 1, we can see that YOLOR-CSP obtains a robust
performance of 49.2% on clean images. The second and third rows represent the performance impact of the adversarial samples generated by I-FGSM and PGD on YOLOR that we use as a reference. They both use infinite norm de-constraint with a constraint range of 8/255. It is evident that both methods generate adversarial samples with extremely high similarity, with an average SSIM close to 0.9, as well as a high peak signal-to-noise ratio. However, there is no significant impact on the YOLOR attack. As our method is an object-focused attack with no parametric to constrain the perturbation, we use the GLH method with higher
1 parameters for both SSIM and PSNR than the previous two. Compared with I-FGSM and PGD, our method GLH improves the attacks by 18.2% and 17.4%, respectively. This indicates that our attacks are not only superior to other methods but also that the perturbations generally have less impact on the images.
To achieve better attacks, we liberate the infinite parametric constraints of I-FGSM and PGD, and the performance is significantly enhanced. However, YOLOR still has 23.1% and 21.6% of performance. Because these methods only consider the direction of the gradient and do not quantify the magnitude of the perturbation, the attacks against the object detection model are far from satisfactory. Our method with parameter 2 is 17.7% and 12.0% more effective than I-FGSM and PGD, respectively, and our similarity is still higher than the first two.
To achieve the ultimate attack effect, we use parameter 3 to perform high-performance damage to the image while ensuring that the SSIM is not lower than 0.7. We can see that the effect of our attack makes the model lose 90% of the AP performance, basically knocking down the model’s judgment. Our later migratory experiments all use this parameter for the attack.
Recall attack experiment: In
Table 2, we test the effect of the adversarial sample on the recall. Here, max = n means retaining the top n prediction boxes in confidence ranking on each graph of the test set separately. We can see that our method still outperforms I-FGSM and PGD in each metric. We can see that at max = 1, our GLH (
3) with the parameter setting almost crushes the object detection module and the recall rate drops from 37.6% to 8.8%, directly reducing the model’s recall metric by 76.5%. In addition, in the case of higher fault tolerance max = 10 and max = 100, our method discriminates to reduce the recall index of the model by 71.19% and 68.89%. In addition, for small targets, the
metric after our attack is only 5.8%, which means that our method enables the model to ignore almost all small targets. Moreover, for medium targets, there is also only a 19.9% recall metric left, which almost loses the ability to judge. As for the recall metric for large targets, although we only reduce it to 34.9%, we also reduce the performance by 56.9% compared to a clean image, achieving an extremely significant attack effect. The reason for this is that our attacks are focused on objects, so the attack effect is especially effective for small and medium targets. For large objects, more scrambling is needed to interfere with its judgment, therefore the scrambling of the image is also increased, which also achieves a strong attack effect.
Transportability experiments: The adversarial sample we implemented through YOLOR is also highly transferable. To support our view, we chose models from recent years or more representative models for testing. Moreover, we find that our generated adversarial samples also achieve surprising results in black-box attacks, as shown in
Table 3: our YOLOR-based adversarial samples also obtain quite high transferability for different backbone YOLO models. Starting from the table, we can see that for the YOLOv5, we reduce its performance from 37.40% to 15.30%, which corresponds to a performance loss of 59.09%. For YOLOX [
37] and YOLOv4 [
38], which have the same backbone as YOLOv5, they have a performance loss of 54.29% and 73.9%. As for the different backbone models YOLOv6 [
39] and YOLOv7 [
40], which are the newest and most powerful models in the YOLO family, they lose 57.30% and 68.09% of performance, respectively, for the black-box attacks we generate against the samples.
More importantly, against the non-YOLO models, our attacks also have strong migration attack performance, as shown in
Table 4. For the detection performance of DETR [
41] and EffcientDet [
42], our adversarial samples likewise cause a high-intensity black-box attack effect on this model.
Module ablation: To ensure the effectiveness of each module, rigorous ablation experiments are conducted, as shown in
Table 5. To better express the effectiveness of the work we have done, we use the GLH with
3 parameters as a sample of ablation experiments. We can see that after using our established object detection attack generalized AGA, the attack effect is especially powerful, whereas the similarity is only 0.668, as the gradient information of individual images is quite different. When we check the quality of the adversarial sample generation, most of the adversarial samples are perturbed overly severely. The samples with too severe perturbation, which have excessive initial gradients, result in perturbations that are unusual from the normal adversarial samples after iteration. The image distortion is already noticeable to our human eyes for such samples. For these adversarial samples, this perturbation is substandard.
After using the LFA module, we found that the similarity between the attacked image and the original image is improved substantially. To represent the performance of our module more intuitively, we counted the number of samples in each similarity range, as shown in
Figure 6.
It is observed that in the gradient attack without adding the LFA module, there are extraordinarily many samples with the SSIM less than 0.5. In addition, such images are indistinguishable by the human eye, because the unrestricted perturbation is extraordinarily powerful for the destruction of the images. Although it obtains an amazing attack effect, we think this kind of antagonistic sample is meaningless. The adversarial sample should focus on the attack effect as well as the overall similarity distribution. as well as after adding the LFA module, we can see that the similarity of the images mostly exceeds 0.5.
IoU ablation: In the adaptive gradient attack module, we set the parameter
, which represents a criterion for measuring the accuracy of detecting the corresponding object. In addition, all our experiments have taken the value of
as GIOU. GIoU’s performance of our selected curve is better while sacrificing only the subtle similarity. As shown in
Table 6. Whereas EIoU has extremely powerful performances but sacrifices too much similarity, the confrontation samples need to guarantee better similarity before we consider the performance improvement. Therefore, our parameter
takes GIoU.
Parametric ablation: For the third part of Equation (
5), we set three hyperparameters
,
, and
. We adjusted the values of each parameter separately and analyzed the effect of each parameter on the results by the experimental results, as shown in
Figure 7. For the weights of confidence loss
and boundary regression box loss
, it is observed that the effect on the attack increases significantly with the increase of the parameters. However, it is also affecting the similarity of the images. At the same time, we can see that the SSIM also starts to decrease with the value of the parameters. This is explained by the fact that our parameters affect the output of this loss function and also increase the amount of perturbation, whereas the classification loss weights
. It is apparent to us that as
increases, although the attack effect is also enhanced, it is also obvious that the effect of this parameter on the image is enormous. For classification, more images need to be disturbed to guide the category into another class. Nevertheless, for the effect, it is considered more cost-effective to attack the confidence and bounding regression boxes.
For a more visual presentation of the functionality of the LFA module, we verified the effect of the
parameter on the
and tested the effect of
from 10 to 100. As illustrated in
Figure 8, the effect of the parameter on
has been significantly reduced when
is taken to 40, as well as the trend of
reduction being leveled off. However, to avoid the
parameter leveling off before 40, our main experimental values are used at 50 to avoid the effect of the parameter.