1. Introduction
Object detection in remote sensing is a challenging task due to the arbitrary orientations of objects and the unbalanced distribution of objects within a single image. For instance, one image may contain hundreds of vehicles, while another may only have a single tennis court.
Detectors based on CNNs [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] have achieved significant results in object detection tasks for optical remote sensing images. These methods can be divided into two categories based on the presence of anchors: anchor-based and anchor-free. Anchor-based methods [
2,
7,
8] work by pre-setting anchor boxes and then adjusting them during prediction to obtain the final results. In contrast, anchor-free methods [
11,
12,
13,
14] do not predict the offsets of anchor boxes directly; instead, they often use the center point as a reference, obtaining predictions through horizontal, vertical, or rotational adjustments. Based on the stage of detection, they can be further divided into single-stage and two-stage detectors. Single-stage detectors [
7,
8,
9,
10] directly yield prediction results, offering fast speeds but generally lower accuracy. Two-stage detection methods [
1,
2,
4,
5] initially generate proposals through a region proposal network (RPN) that are then classified and further refined by subsequent networks to produce predictions, resulting in slower speeds but higher accuracy.
However, methods [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14] starting from either rotating boxes or center points generate a large number of background samples that interfere with the detector’s judgment and require manually designed preprocessing and postprocessing components to meet the needs of optical remote sensing object detection. These components not only increase the complexity of model design but also limit the universality and flexibility of the model.
Recently, transformer-based detectors, like DETR [
15], have brought revolutionary changes to object detection tasks, especially by discarding the manually designed components and directly predicting the categories and bounding boxes of objects through an end-to-end training approach.
To address the issue of slow convergence, researchers have been using deformable attention to reduce the amount of computation [
16], understanding the semantic and positional information of queries [
17], employing denoising with added noise [
18,
19], and utilizing auxiliary training heads [
20] to accelerate convergence, thus making improvements to various aspects of DETR [
15].
However, existing DETR-based detectors still have some limitations in rotated object detection tasks, and the number of queries can seriously affect the amount of computation. When facing arbitrary-oriented object detection tasks, introducing the angle of bounding boxes into the transformer-based detector (with DINO [
19] as the baseline in this paper) still experiences the following three challenges: (1) choice of an appropriate format to define rotating boxes; (2) adjustment of the method for calculating reference points in deformable attention to accommodate rotating boxes; (3) iteratively correction of the rotating box angles in the decoder when the values of
are all between 0 and 1.
In this paper, we aim to adapt transformer-based detectors for oriented object detection and specifically enhance the accuracy and speed in arbitrary-oriented object detection tasks.
For adapting the transformer-based detector for oriented object detection, we propose a method to define rotating boxes in the
format, which is a natural extension of the
format. We designed an algorithm for rotating reference points to ensure that the interaction reference points generated by the encoder’s output for rotating box proposals remain within the box. We developed activation and inverse activation functions specific to rotation, similar to mapping
to the 0–1 range to match image predictions, to accommodate different standards of rotating angle descriptions. Consequently,
to address the limitations of computational cost and slow convergence in transformer-based detectors, inspired by RT-DETR [
21], we employed a hybrid encoder [
21] to reduce the computational cost and number of parameters while maintaining the same level of accuracy. Then,
for the loss of memory positional information, we propose an Image Feature Reconstruction (IFR) module to supervise the memory obtained through feature interactions via self-attention in the encoder. By restoring the memory to multi-layer features and upsampling them to the original image size for feature reconstruction, we can effectively compensate for the loss of spatial positioning information caused by the flattening operation necessary for self-attention. Finally,
for the issue of the bipartite graph matching and the limit on the number of queries, we propose a method to select auxiliary dynamic training queries for the decoder, which can improve the quality of the top-k proposals selected by the encoder and mitigate the issue encountered during prediction, where using only the top-k scores to obtain proposals can result in high classification scores but low-quality bounding boxes.
In conclusion, the main contributions of this paper can be summarized as follows:
We adapted a transformer-based detector to accommodate arbitrary-oriented object detection tasks, using enhancements in deformable attention, iterative corrections in the decoder, and methods of adding noise to rotating bounding boxes;
By employing a hybrid encoder, we effectively reduced the computational cost and number of parameters while maintaining accuracy to effectively improve the limitations of computational cost and slow convergence in transformer-based detectors;
We introduced an image feature reconstruction (IFR) module to supervise the memory obtained through feature interactions via self-attention within the hybrid encoder. By restoring the memory to multi-layer features and upsampling them to the original image size for feature reconstruction, we can effectively compensate for the loss of spatial positioning information;
We developed a method to select auxiliary dynamic training queries for the decoder, enhancing the quality of the top-k proposals generated by the encoder. It can mitigate issues encountered during prediction when there are high classification scores but low-quality bounding boxes.
2. Related Works
CNN Architecture-Based Detectors. Detection algorithms based on CNN architectures primarily focus on improvements in three key areas: feature alignment, positive and negative sample matching, and the regression of rotated bounding boxes.
Han et al. [
7] proposed a single-shot alignment network (S²A-Net) consisting of two modules, a feature alignment module (FAM) and an oriented detection module (ODM), to alleviate the inconsistency between classification score and localization accuracy. Yang et al. [
8] proposed an end-to-end refined single-stage rotation detector for fast and accurate object detection using a progressive regression approach from coarse to fine granularity. Hou et al. [
12] proposed novel flexible shape-adaptive selection (SA-S) and shape-adaptive measurement (SA-M) strategies for oriented object detection, which comprise an SA-S strategy for sample selection and SA-M strategy for the quality estimation of positive samples. Li et al. [
14] proposed an effective adaptive point learning approach to aerial object detection by taking advantage of the adaptive point representation, which can capture the geometric information of the arbitrary-oriented instances.
Although detectors based on CNN architectures have achieved significant results, they still require complex preprocessing and postprocessing.
Transformer Architecture-Based Detectors. Transformer-based detectors can be divided into two categories. The first category combines self-attention, cross-attention, and CNNs to enhance the network’s capability for image feature extraction and interaction. The second category includes DETR-like detectors, which are applied to tasks involving arbitrary-oriented object detection.
For the first category, Li et al. [
22] proposed a method that combines a transformer with a transfer CNN for object detection in remote sensing images. The transformer is used to process a feature pyramid of the image, while the CNN is used to extract features. Zhang et al. [
23] introduced GANsformer, a detection network that combines a convolutional network with a transformer for aerial image analysis. The transformer is employed as a branch network to improve CNN’s ability to encode global features. Tang et al. [
24] proposed a method that utilizes feature sampling and grouping for scene text detection in remote sensing images. Their approach combines a transformer with a CNN to effectively detect text in complex scenes. Liu et al. [
25] proposed a hybrid network architecture called TransConvNet, which combined the advantages of CNNs and transformers by aggregating global and local information. They also designed an adaptive feature fusion network to capture information from multiple resolutions. Pu et al. [
26] introduced an Adaptive Rotated Convolution (ARC) module to identify and locate objects in images with arbitrary orientation.
For the second category, Zheng et al. [
27] developed ADT-Det, an adaptive dynamic refined single-stage transformer detector for arbitrary-oriented object detection in satellite optical imagery. Their approach utilizes a transformer-based architecture to achieve accurate detection results. Dai et al. [
28] introduced RODFormer, a high-precision design for rotating object detection with transformers. Their method utilizes a transformer-based architecture to accurately detect and localize rotating objects in remote sensing images. Ma et al. [
29] introduces a novel approach to oriented object detection by leveraging transformers to bypass complex rotated anchors and incorporates a memory-efficient encoder with depthwise separable convolution. Lee et al. [
30] proposed a transformer-based oriented object detector named Rotated DETR with oriented bounding boxes (OBBs) labeling. They employed a scoring network for background token reduction and an innovative proposal generator with iterative refinement for precise angle-aware proposals. Dai et al. [
31] proposed an Arbitrary-Oriented Object DEtection TRansformer framework, termed AO2-DETR, which incorporates an oriented proposal generation, adaptive proposal refinement for rotation-invariant features and a rotation-aware set matching loss within a transformer framework. Zhou et al. [
32] introduced dynamic queries for efficiency without loss in performance and deployed a novel label re-assignment strategy. Their framework is based on DETR, with the box regression head replaced with a point prediction head. Hu et al. [
33] introduced a Reassigned Bipartite Graph Matching (RBGM) to filter high-quality negative samples, an Ignored Sample Predicted Head (ISPH) for precise negative sample prediction, and a Reassigned Hungarian Loss to enhance model training with high-quality negative samples. Pu et al. [
34] introduced Rank-DETR, an enhanced DETR-based object detection framework that prioritizes high localization accuracy in bounding box predictions to improve ranking accuracy and overall object detection performance, especially under high Intersection over Union (IoU) thresholds.
Our method belongs to the second category, employing a relatively simple strategy to apply DETR to arbitrary-oriented object detection, and achieving competitive results.
Label Assignment Strategy in Transformers. Since the introduction of the transformer architecture into object detection tasks by DETR [
15], the Hungarian one-to-one matching and set prediction approach in the DETR architecture has remained the mainstream in DETR-like algorithms. This matching method allows object detection to forego the NMS operation inherent in CNN architectures. As a result, the preprocessing and postprocessing stages of detection algorithms have been greatly simplified. However, Hungarian matching introduces new challenges in object prediction. It increases the instability of the queries, as the same query often corresponds to different objects during the iterative process across multiple layers of the decoder, thereby reducing the network’s convergence speed. Additionally, when the number of objects in an image approaches the number of queries, this can lead to a significant drop in prediction accuracy. DAB-DETR [
17] improved detection accuracy through iterative correction using multi-layer decoder iterations. DN-DETR [
18] DN-DETR (De-Noising DETR) introduced denoising of the noise-added ground truth to enhance the decoder’s predictive capability for queries. However, neither of these methods adopt the approach of Co-DETR [
20], which utilizes an additional matching method to improve the iterative pattern of queries. The positive and negative sample allocation method proposed in this paper is similar to that of Co-DETR [
20], but it does not employ additional bounding box and class prediction branches. Instead, it merely involves the reallocation of the encoder’s output. By utilizing the effective positive and negative sample allocation methods found in CNN architectures, our approach provides more positive samples to aid in the convergence of queries.
4. Experiment
This section presents the details of experiment settings, including the dataset, the experimental environment and the evaluation indicators. Each subsection elaborates on specific aspects of the experiments’ settings.
4.1. Dataset
To demonstrate the effectiveness of our algorithm, this section first introduces the hyperparameters settings and training environment configurations for the DETR-ORD detector. Subsequently, we evaluate the optimized detector on multiple public optical remote sensing datasets, including DOTA [
37] (DOTA-v1.0 and DOTA-v1.5 [
37]), DIOR-R [
38], and HRSC2016 [
39]. The results are then compared against the leading detection algorithms on the respective datasets.
DOTA [
37] is one of the largest datasets for oriented object detection, with three main versions currently available: DOTA-v1.0, DOTA-v1.5, and DOTA-v2.0. In this paper, we conduct comparative experiments using versions 1.0 and 1.5. Due to the large size of the DOTA-v2.0 dataset, our hardware resources are insufficient to meet its training requirements, and thus experiments on version 2.0 were not conducted. DOTA-v1.0 consists of 2806 large aerial images with pixel sizes ranging from 800 × 800 to 20,000 × 20,000, containing objects of various sizes, orientations, and shapes. The release time, number of categories, number of images, and number of instances for the three versions of the dataset are shown in
Table 1.
DOTA-v1.0 and DOTA-v1.5 have the same number of images, with 1411 images in the training set, 458 images in the validation set, and 937 images in the test set. DOTA-v1.0 contains 188,282 annotated instances covering 15 common categories: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer-Ball Field (SBF), Roundabout (RA), Harbor (HA), Swimming Pool (SP), and Helicopter (HC) [
37]. DOTA-v1.5 adds one more category, Container Crane (CC), and includes 402,089 annotated instances. Due to the varying sizes of images in the DOTA dataset, we followed the official data preprocessing method, splitting the images with a stride of 200 pixels and a resolution of 1024 × 1024. The split images were then used for detection, and the results were merged. After splitting, the DOTA-v1.0 and DOTA-v1.5 training sets contain 15,749 images, and the validation sets contain 5297 images. In our experiments on the DOTA dataset, we merged the training and validation sets for training and performed inference on the test set, submitting the results to the official DOTA evaluation server. The distribution of the number of annotations per image for DOTA-v1.5 is shown in
Figure 4. The DOTA-v1.5 dataset exhibits a long-tail distribution, with most images having 0–50 annotations. To evaluate the detector’s performance on the DOTA dataset, we conducted experiments on DOTA-v1.0 and DOTA-v1.5, setting the number of queries to 400 and the number of denoising queries to 100, using ResNet50 as the backbone network and training for 36 epochs.
The DIOR-R [
38] dataset is a re-annotated version of the DIOR dataset’s images. The DIOR-R dataset contains a total of 23,463 images and 192,518 ground-truth annotations. Each image in the dataset has a size of 800 × 800 pixels, with spatial resolutions ranging from 0.5 m∼30 m. The training and validation sets combined consist of 11,725 images, while the test set includes 11,738 images. The dataset covers 20 categories: Airplane (APL), Airport (APO), Baseball Field (BF), Basketball Court (BC), Bridge (BR), Chimney (CH), Expressway Service Area (ESA), Expressway Toll Station (ETS), Dam (DAM), Golf Course (GF), Ground Track Field (GTF), Harbor (HA), Overpass (OP), Ship (SH), Stadium (STA), Storage Tank (STO), Tennis Court (TC), Train Station (TS), Vehicle (VE), and Windmill (WM) [
38]. The distribution of ground-truth annotations per image in the dataset is shown in
Figure 5.
To validate the detector’s performance on the DIOR-R dataset, we used 200 queries and 50 denoising queries, scaled the images to a resolution of 512 × 512, and employed the ResNet50 backbone network, training for 36 epochs.
HRSC2016 [
39] comprises images of ships collected from six well-known ports. The dataset contains only one category: ship. Image sizes range from 300 × 300 to 1500 × 900. The HRSC2016 dataset includes a total of 1061 images (436 for training, 181 for validation, and 444 for testing) which are used for training and testing. The target distribution per image is shown in
Figure 6. We used 20 queries and 5 denoising queries, scaled the images to a resolution of 512 × 512, and employed the ResNet50 backbone network, training for 36 epochs.
4.2. Experimental Environment
Due to the inconsistency in target number distribution across different datasets, the number of queries used for detection and denoising is set differently. These specific settings will be detailed separately for each dataset, while other hyperparameter settings remain consistent. Some parameters are some of the most commonly chosen for DETR detectors such as learning rate, optimizer, the number of query items. They are crucial for ensuring the model effectively learns and performs optimally across object detection tasks. For example, an appropriate learning rate helps the model converge quickly and accurately, avoiding issues like instability from too high a rate or slow convergence from too low a rate. The optimizer determines the specific algorithm for parameter updates. The number of query items directly affects the model’s output capabilities and computational demands. The hyperparameter settings for the detector in this paper are shown in
Table 2, and the training hardware platform utilized is presented in
Table 3.
4.3. Evaluation Indicators
In terms of evaluation indicators for the detector, the commonly used VOC metrics for rotated bounding boxes are adopted. To assess the effectiveness of detection, it is desirable for the detected results to closely match the ground-truth. Therefore, the concepts of precision and recall are introduced. Precision refers to the proportion of detected targets that are considered true positives, while recall refers to the proportion of true targets that are correctly detected by the detector. The formulas for calculating precision and recall are shown in Equation (
26).
Here
represents the number of detection boxes for the current predicted category with an intersection over union (IoU) greater than the specified IoU threshold.
denotes the number of detection boxes for the current predicted category with an IoU less than the specified threshold, and
indicates the number of true annotations in the current category that were not detected. The mAP mentioned in this paper uses an IoU threshold of 0.5, and the mAP calculated using this threshold is referred to as
[email protected]. The formula for calculating mAP is shown in Equation (
27).
Here, represents the average precision for a single category, denotes the interpolated precision at recall r, and N represents the number of categories. According to the VOC07 evaluation standard, interpolation is performed at 11 points, whereas in the VOC12 evaluation standard, it is based on the area under the precision-recall curve.
In addition to mAP, our paper also introduces the F1 score. The F1 score is the harmonic mean of precision and recall, and its calculation formula is shown in Equation (
28).
In the subsequent experiments, DOTA-v1.0, DOTA-v1.5, and DIOR-R use the VOC07
[email protected] evaluation standard. HRSC2016 uses both VOC07 and VOC12
[email protected] evaluation standards.
4.4. Ablation Study
In the ablation study, to verify the effectiveness of the dynamic query method, an analysis was first conducted on the distribution of the number of targets per image in the DOTA-v1.0 dataset. Subsequently, the number of queries that are less than or close to the average distribution was selected to address the problem of query number limitations in the detector’s performance. Additionally, the number of queries exceeding the target distribution was chosen for an ablation study of the dynamic query algorithm.
The impact of different query numbers on the detection results is shown in
Table 4. The experimental results presented do not incorporate denoising and image feature restoration modules. The model was trained on the DOTA-v1.0 dataset for 12 epochs, and the results were validated with a validation set at the 3rd, 6th, 9th, and 12th epochs. As can be seen from the table, the introduction of dynamic queries significantly enhanced the performance of the detector when the number of queries was close to or less than the average number of targets. Specifically, with 50 queries, the mAP value of the dynamic query improved by 6.5% compared to the baseline model. With 100 queries, there was a 3.2% improvement, demonstrating the effectiveness of the dynamic query algorithm in addressing query number limitations. When the number of queries met the model’s prediction, at 400, the dynamic query still achieved a 2.1% improvement, proving that the dynamic query algorithm can not only eliminate the restrictions on query numbers but also simultaneously enhance model performance.
The results of the DETR-ORD ablation study are shown in
Table 5. Note that in the ablation study, the number of queries used is 400, the number of denoising queries is 100, and ATSS selects up to a maximum of 400 queries. The images of 1024 × 1024 are resized to 512 × 512 for training. From the results of the ablation study, it can be observed that there is no significant change in accuracy when comparing the hybrid encoder with the traditional encoder. After introducing the IFR module, the mAP increased by 0.8%. With the introduction of ATSS-assisted dynamic queries, the mAP increased by 2.8%. The dynamic queries comparison results are showned in
Figure A1 and the IFR modules comparison results are showned in
Figure A2.
4.5. Implementation Details
In terms of code implementation, we utilize the open-source deep learning detection toolboxes, MMDetection [
40] and MMRotate [
41], and introduced rotation into DINO [
19], we refer to it as R-DINO, as our baseline. These toolboxes offer a comprehensive, flexible, and extensible framework for object detection tasks, including support for various state-of-the-art models and algorithms. MMDetection provides a rich collection of object detection and instance segmentation methods, while MMRotate extends these capabilities to efficiently handle rotated objects, which is crucial for aerial image analysis. By leveraging these toolboxes, we were able to significantly streamline our development process, enabling rapid experimentation with different models and configurations to optimize our detection performance on optical remote sensing datasets.
Taking the DOTA-v1.0 [
37] as an example, it is split according to a size of 1024 and a gap of 200. In the ablation study section, we trained for 12 epochs on the training set using eight NVIDIA TITAN RTX GPUs and compared metrics on the validation set. In the comparison with the SOTA (state-of-the-art) methods, we utilized the same hardware setup and trained for 36 epochs using both the training and validation sets combined. The performance of the test set is evaluated on the official DOTA evaluation server.
6. Discussion
In optical remote sensing image analysis tasks, there are still some challenges when dealing with rotating or arbitrarily oriented targets in complex scenes and problems such as inconsistent size of target distribution and uneven image quality, which are effectively addressed by our proposed model.
6.1. Comparison with Other Models
In the paper, we compare the performance of the DETR-ORD model before and after the improvement and the performance of each state-of-the-art model on each dataset. The improved detector achieves competitive overall results on each dataset. Experimental results on several remote sensing datasets show that the DETR-ORD model proposed in this paper improves mAP by 2.02% compared to the pre-improvement model on the DOTA-v1.0 dataset in the task of optical remote sensing image analysis. On the DOTA-v1.5 dataset, DETR-ORD improves the mAP by 0.28% compared to the superior algorithm and 5.2% compared to the pre-improvement model. On the DIOR-R dataset, DETR-ORD improves the mAP by 2.19% compared to the superior algorithm and by 2.9% compared to the pre-improvement model. On the HRSC2016 dataset, DETR-ORD improves the mAP by 1.41% compared to the pre-improvement model. The figures indicate that the improved detector maintains high precision, accuracy, and recall in scenarios with large variations in target size, densely packed small targets, and overlapping targets. The improved detector in our paper achieved the optimal overall results on the DOTA-v1.5 and DIOR-R datasets and competitive results in both DOTA-v1.0 and HRSC2016 datasets.
6.2. Future Directions: Multi-Scene Optical Oriented Target Detection Task
This paper focuses on the study of a transformer-based DETR-like detector applied to the task of oriented target detection on optical remote sensing images. For future research work, the application of the transformer-based DETR-like detector on low-computing-power devices can be further explored, and the model can be extended to more application scenarios, such as video target detection, multi-target tracking, in order to improve the practicability and adaptability and to increase the speed of the detectors while guaranteeing the detection accuracy. Our model has been extended and validated on retail merchandise image dataset SKU110K-R, scene text detection image dataset MSRA-TD500, and private rubber forest dataset with competitive results.
6.3. Limitations
Like most research, while the DETR-ORD model demonstrates improvements in oriented object detection, there are still some conditions and limitations in which our models sometimes can not obtain more effective results. In certain scenarios with a dense distribution of image targets, the DETR-ORD model sometimes fails to obtain the best detection results on all kinds of targets, and there are still computational efficiency issues in processing the large-scale dataset. In our future work, we will optimize the structure of the DETR-ORD model to improve efficiency and accuracy in oriented remote sensing object detection tasks and apply it to more detection scenarios.
7. Conclusions
In our paper, in light of the limitations of existing DETR-based detectors, which are unsuitable for arbitrary-oriented object detection, the issue of positional information loss due to the transformer architecture, and the constraints on detection performance in dense target scenarios, we propose an oriented object detector based on image structure reconstruction and dynamic queries. This detector optimizes the transformer-based DETR detector by integrating rotational detection, image feature reconstruction, and dynamic query algorithms. The resulting design is an efficient and precisely oriented object detector suitable for multi-task scenarios, demonstrating strong adaptability and practical value.
We presented significant advancements in oriented object detection by integrating the prediction of rotation into the transformer architecture, specifically enhancing the DETR detector. Our adoption of a hybrid encoder, as opposed to the conventional encoder, has notably decreased the computational complexity and the model’s parameter count without sacrificing accuracy. This efficiency is achieved through a novel representation and prediction method for rotating boxes, utilizing the format, and the introduction of algorithms for rotating reference points to ensure encoder-generated interaction reference points for rotating box proposals remain accurate. Moreover, the development of activation and inverse activation functions tailored for rotation accommodates varying standards of rotating angle descriptions, streamlining the prediction process.
Further, we introduced the IFR module, an addition that supervises the memory from feature interactions via self-attention in the encoder. By restoring this memory to multi-layer features and upsampling them back to the original image size, our method effectively counters the loss of spatial positioning information, a common issue in the flattening operation required for self-attention. This module boosts the detector’s performance by enhancing feature reconstruction.
Additionally, we introduced dynamic query by integrating the ATSS method to supplement the Hungarian matching assignment. This innovation includes an extra training branch that allocates more positive samples to each ground truth, significantly improving the quality of the top-k proposals selected by the encoder. This approach addresses the challenge of obtaining high-quality bounding boxes, which has been a problem when relying solely on top-k scores for proposal selection, leading to proposals with high classification scores but low bounding box quality.
In the validation experiments, our improved DETR-based detector demonstrates improvements in optical remote sensing image analysis applications. On the DOTA-v1.0 dataset, it achieves a 2.02% increase in mAP compared to its previous version. On the DOTA-v1.5 dataset, it surpasses leading algorithms by 0.28% mAP and improves 5.2% over its previous version. On the DIOR-R dataset, it exceeds top-performing algorithms by 2.19% mAP and shows a 2.9% improvement over its previous version. For the HRSC2016 dataset, there is a mAP improvement of 1.41% compared to its previous version. These results demonstrate that the improved detector has strong practical value and broad applicability across various scenarios and applications.