1. Introduction
Object detection in RSIs has broad applications in urban and agricultural fields. In recent years, there has been significant progress in convolutional-neural-network-based (CNN-based) detectors. Compared with traditional methods, CNN-based detectors have better feature extraction ability, faster speeds, and higher accuracy [
1,
2]. Adversarial attacks introduce small and imperceptible texture perturbations to clean images through gradient descending or adversarial generation and then generate adversarial examples. CNNs are vulnerable to attacks because they make final predictions by promoting semantic understanding according to the corresponding texture and color of the input images. However, adversarial attacks disrupt the texture information and destroy the semantic understanding of the network, eventually leading to incorrect predictions [
3]. Therefore, it is important to improve the robustness of CNN-based detectors and mitigate performance degradation caused by adversarial attacks, especially to ensure the reliable deployment of security sensitive applications.
Early works on adversarial attacks [
4,
5] focus on the classifier and invalidate the classification model. These works mainly generate the most effective adversarial example while making minimal changes to the original image by adding disturbances to the input. Thus, these changes mislead the network and greatly reduce the accuracy of the classifier. Specifically, gradient-based methods compute the minimum necessary perturbation by maximizing the prediction error of the network to produce adversarial examples. Transfer-based attacks exploit the use of training data to generate adversarial alternatives. As an extension of the classification task, object detection [
6,
7,
8] outputs the classification label and the location at the same time. In this way, the detection process is more complex and difficult to attack. Some works [
9,
10] significantly reduce the accuracy of models by generating adversarial patches, exposing the weak robustness of object detection under adversarial attacks.
Compared with natural images, RSIs have a larger size, higher spatial resolution, and more detailed spectral information. A small noise perturbation can successfully attack the model and change the output [
11]. Specifically, adversarial attacks in the digital world generate adversarial patches to cover important parts [
12], which can also be applied in the physical world to evade or deceive object detectors. Defense methods under attack have often been ignored, especially in military scenarios. Attackers apply adversarial perturbation to the target (such as an aircraft or ship); then, the adversarial examples can directly deceive the CNN-based model, introducing high-level risks to the object detection system.
Current defense methods used in object detection to achieve robustness improvement can be generally classified into two classes. One type includes methods developed for model vulnerability [
13], which find the potential adversarial patch area and recover it for detection. Unfortunately, these models pose considerable challenges to the authenticity and accuracy of data recovery, which are inapplicable to remote sensing scenarios. The other type of widely used methods are those that obtain robust models through training. Through adversarial training [
14], the MTD model learns both clean and adversarial images, pays more attention to the robust features in adversarial samples, and ignores the non-robust features. However, adversarial training always leads to a robustness bottleneck [
15]. Specifically, to resist adversarial attacks, the detector has to sacrifice the performance on clean images to detect adversarial images. This is mainly because of the introduction of adversarial features. The opposing training effects between the clean image and adversarial image make it harder to distinguish clean and adversarial features, resulting in a decrease in performance during detection. As shown in
Figure 1, we use a classical single-shot multibox detector (SSD) [
6], a robust detector using multi-task domains (MTD) [
14], a robust detector with class-wised adversarial training (CWAT) [
16], and our proposed detector to test the performance before and after the adversarial attack Projected Gradient Descent (PGD) [
17]. The results show that the accuracy is improved after the attack to some extent, but the performance on clean images is degraded inevitably for current robust detectors.
This paper provides effective solutions to the above issues. Firstly, a multi-dimensional convolution kernel is proposed to learn the robust features of the clean image and the corresponding adversarial image efficiently and dynamically. It can obtain rich context information from RSIs and significantly improve the feature extraction ability of the detector. Furthermore, a regularization loss was designed to constrain the consistent feature reconstruction process from the perspective of the internal distribution of the image. It reduces the interference of adversarial attacks and further increases the robustness of the object detector. The key contributions of this paper are summarized as follows:
A multi-dimensional adversarial convolution (MAConv) kernel is proposed to extract features adaptively and dynamically. It effectively extracts both shared features and specific features from clean images and adversarial images under attacks and thus enhances the ability of feature extraction.
From the perspective of image distribution, a consistency regularization loss is proposed to extract consistent features from the mixture distribution under attacks. By reconstructing clean features for detection, our method successfully improves the robustness of the object detector.
We performed extensive experiments under different adversarial attacks on three public datasets, HRSC, UCAS-AOD, and DIOR. The experimental results show superior performance in both single-class and multi-class object detection compared with current robust object detectors.
4. Experiments and Analysis
4.1. Datasets
In this work, three RSI datasets with multi-resolution, including one-class, two-class, and multi-class datasets, were employed to validate the detection robustness and to evaluate the effectiveness of our proposed method. The details of the three datasets are introduced as follows:
The HRSC [
47] dataset contains images of two scenarios, including offshore ships and nearshore ships. The data are all obtained from Google Earth and contain 2976 targets in all. The image resolution ranges from 0.4 to 2 m, and the size covers 300 × 300 to 1500 × 900. The dataset is split into training, validation, and test sets, which contain 436 images, 181 images, and 444 images, respectively.
UCAS-AOD [
48] has 1510 aerial images with two categories of 14,596 instances. The size of the images is about 659 × 1280 pixels, and the object categories are vehicle and plane. We randomly selected 1110 images for training and 400 for testing.
DIOR [
49] is a large-scale, publicly available benchmark for object detection in remote sensing scenarios. It contains 23,463 images and 192,472 instances in total. The image size is 800 × 800. The dataset covers 20 object classes: airplane, airport, baseball field, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, golf course, ground track field, harbor, overpass, ship, stadium, storage tank, tennis court, train station, vehicle, and windmill. Our experiments selected 11,725 images as the training dataset and 11,738 as the test dataset.
4.2. Implementation Details
Our experiments were realized in Pytorch and run with NVIDIA V100 GPUs. Experiments on three typical datasets of RSIs were conducted to evaluate the universality of our method. In order to show the applicability of the model, we also conducted an experiment on the general dataset PASCAL VOC [
50]. For robustness evaluation, all of the settings in this paper are consistent with current robust detectors such as MTD [
14], CWAT [
16], and RobustDet [
15]. Specifically, our proposed method is embedded in the one-stage detector SSD [
6] with VGG16 as the backbone. We used three common attack algorithms for comparison including PGD [
17], CWA [
16], and DAG [
9]. The visualizations of the attacks on different datasets are shown in
Figure 6. Due to the complex background of remote sensing images, the changes of the images cannot be easily found by naked eyes. Here, the texture changes can be seen obviously on the ‘sea’ part in the HRSC examples. The PGD-20 attacker is set with a budget
= 8 to generate adversarial examples. For the DAG attack, we conducted 150 steps for an effective attack. We employed stochastic gradient descent (SGD) with a learning rate of
. The Pascal VOC mean average precision (mAP) with the IoU threshold 0.5 was used as the evaluation metric.
4.3. Ablation Study
We performed a series of ablation experiments to evaluate the performance of our method independently. ‘Clean’ represents the detector trained with normal training using clean images. Since object detection consists of two tasks: classification and location regression, we divided the PGD attack into classification attack and localization attack .
4.3.1. Ablation Test of MAConv
We assessed the effectiveness of MAConv on HRSC and employed the common DynamicConv [
40] used in the current RobustDet for comparison. As shown in
Table 1, the performance under different attacks was increased by 1.21%, 3.55%, 4.07%, 4.03%, and 3.90% compared with the original DynamicConv. In addition, MAConv effectively controls the parameter scale and maintains a proper balance between model accuracy and size. Compared with the DynamicConv, the number of convolution layers increased from 155 to 187 using MAConv, and the trainable parameters decreased from 114.73 M to 41.63 M. As MAConv is utilized in the early stage of the neural network to optimize the convolution, it can extract both shared and specific features from clean and adversarial images. Thus, MAConv reduces the training difficulty and improves the feature extraction ability. By increasing the attention dimension instead of simply stacking the network parameters and layers, the accuracy was greatly improved while ensuring the model magnitude. The results demonstrate that MAConv is highly effective at extracting feature information in RSIs.
4.3.2. Ablation Test of Consistency Regularization Loss
We analyzed the ability of consistency regularization loss under different attacks on the HRSC dataset.
Table 2 shows the experimental results. Without regularization, the model exhibits poor robustness under attacks with only 10% to 14% mAP. In contrast, although slight degradation occurs on clean images, the detector achieves substantial performance improvement under all kinds of attacks using regularization loss. In particular, the highest improvement reached 35.87% under DAG attack with only 2.95% performance sacrifice on clean images. This demonstrates that the model predicts the correct distribution under the constraint of regularization loss and thus effectively improves the robustness of the detector. Compared with MAConv, which requires multiple layers to extract features, consistency regularization loss imposes more direct and stronger constraints on reconstructed features for subsequent detection, leading to a higher precision increase.
4.3.3. Evaluation Using Different Network Architecture
Based on our proposed methods, we evaluated the model using different backbone networks. Besides the original VGG16 backbone in the SSD, we also utilized the ResNet50 backbone for our experiments. The results shown in
Table 3 illustrate that our model can increase the accuracy of the robust detector by 3.55 to 4.07% under the VGG16 backbone. Meanwhile, the mAP on the clean image is also slightly increased on the HRSC dataset, about 1.21% compared with the original DynamicConv. For the ResNet50 backbone, MAConv was applied in the base layers of the SSD. The results show that the adversarial attacks still cause performance degradation after changing the backbone. Compared with the original 84.21%, 12.16%, 12.93%, 12.79%, and 14.77% mAP values, our model improves the performance by 1.36%, 4.38%, 4.16%, 4.22%, and 3.78% under different attacks, respectively. The results illustrate that our model can consistently enhance the adversarial robustness of object detectors in different backbone networks.
4.4. Overall Comparison
To evaluate the comprehensive performance, our detector was tested on the natural image set VOC07 and three remote sensing datasets: HRSC, UCAS-AOD, and DIOR. For fairness, all experiments were performed under the same conditions. Lastly, the evaluation results were comprehensively analyzed.
4.4.1. Results for the VOC07 Dataset
To verify the effectiveness and generality of the proposed model, we conducted experiments on the VOC07 dataset of natural images. As demonstrated in
Table 4, for the clean SSD detector, the adversarial attacks lower the accuracies to less than 5%. The current robust detectors improve the performance under attacks but also degrade a lot on clean images. Our proposed model is theoretically effective at enhancing the detection robustness for natural images. The experimental results also show improvements of 2.8% and 2.7% under PGD attacks compared to clean the SSD. For CWA and DAG attacks, the performance improved by 2.4% and 2.6%, respectively. At the same time, it achieves an increase of 0.8% on clean images.
4.4.2. Results for the HRSC Dataset
Rectangular ships are the main targets in HRSC. We can summarize from the results in
Table 5 that there was stable performance improvement in our model under different types of attack. The state-of-the-art robust detector RobustDet shows mAPs of 80.98%, 10.85%, 11.57%, 11.12%, and 13.54% in different cases. In comparison, our model shows enhancements of 2.29%, 4.57%, 4.99%, 4.53%, and 5.93%, respectively. At the same time, it also improves the performance by 2.21% on clean images. With comparable performance to the SSD on clean images, this proves the robustness of the proposed model in dealing with different attacks while surpassing the performance of the current robust detectors.
Figure 7 visualizes an example of the detection results of the model on HRSC. In
Figure 7, we can see the attack may lead to inaccurate localization of the densely arranged targets, resulting in overlapping bounding boxes.
4.4.3. Results for the UCAS-AOD Dataset
Table 6 depicts the results of current robust detectors for the UCAS-AOD dataset under mainstream adversarial attacks. The attacks significantly degrade the detector’s performance. In particular, for small objects such as vehicles, the accuracy is worse than that of aircraft, which reduces the overall detection accuracy. The results in
Figure 8 demonstrate that adversarial attacks also make the localization of objects more difficult, especially for densely distributed small targets such as cars. Several bounding boxes overlap near the same target. For the clean SSD model, adversarial attacks decrease the mAP to 1.36%, making the detector almost ineffective. This requires robust detectors to improve the performance under attacks while maintaining robustness on clean images. Current robust detectors alleviate this problem to some extent. The results indicate that the proposed model increases the performance by up to 4.34% under different attacks compared with RobustDet, which shows excellent performance on RSIs.
4.4.4. Results for the DIOR Dataset
The DIOR dataset has more complex scenes and multiple target categories, containing dense and sparse objects of different scales. This not only increases the difficulty of detection, but also turns adversarial training into a challenge. The performance results are shown in
Table 7. Despite the poor accuracy for some objects of small size, our proposed method still maintains performance and robustness advantages compared to other detectors. Compared with RobustDet, our model improves mAP by 2.28% on clean images. At the same time, it also has advantages in robustness under different attacks, and the mAP increased by 1.47%, 1.32%, 1.58%, and 1.61%, respectively. The visualization results are presented in
Figure 9. The results show that for small targets in normal arrangement, the attack may also make the bounding boxes overlap and lead to inaccurate localization. In general, our model achieves higher overall performance across datasets of different sizes and various types of objects.
5. Discussion
The results of this study show that our methods are effective at enhancing detection robustness. Compared with the latest robust detector RobustDet, the proposed method has a maximum improvement of 5.93%, 4.34%, and 1.61% on the remote sensing datasets HRSC2016, UCAS-AOD, and DIOR under different attacks. These results show that our proposed MAConv and consistency regularization loss play a key role in the robust detector. However, there are still many issues to be discussed for real-world applications. Firstly, in remote sensing scenarios, the coverage of attack is limited, and adversarial attacks on remote sensing targets often appear in the form of patches, which are inconspicuous and difficult to defend against. The specific attack implementation methods are still under development, and defense and robustness studies for such attacks still need to be further carried out in the future. In addition, for large datasets with many categories, especially for small targets with only a few samples, it is still hard to detect issues accurately. This brings difficulty to the adversarial training process of robust detectors. Optimizing the adversarial training process effectively and improving the feature learning strategy are also possible research directions for future works.
6. Conclusions
In this work, we perform a robust object detector for muti-scale objects in remote sensing data. With the aim of alleviating the robustness bottleneck under adversarial attacks, we proposed a multi-dimensional convolution to dynamically learn the specific and consistent features in clean and adversarial images. This method also enriches the feature extraction process for subsequent detection while controlling the network scale. Furthermore, to eliminate the interference of adversarial attacks on accuracy, the reconstruction of the clean image was carried out with a regularization constraint from the perspective of image distribution. The regularization loss contributes to extracting consistent features from the mixture distribution and eliminating the interference brought by adversarial attacks to conduct accurate object detection in complex remote sensing scenarios. The experimental results illustrate that our proposed method successfully increases the accuracy under adversarial attacks and maintains the performance on clean images at the same time. Thus, our method effectively enhances the robustness of the object detector and transcends current robust detectors in terms of performance on RSIs. However, for remote sensing applications, although current robust detectors can resist attacks to some extent, the accuracy under interference is still difficult to reach an application level. Tradeoff between accuracy and robustness of remote sensing object detectors for actual deployment is an important research direction in the future. Additionally, we will conduct the research on robust out-of-distribution detection to further improve the reliability of object detectors for RSIs.