1. Introduction
With the swift progress of deep learning, the enhanced feature extraction ability of convolutional neural networks has further facilitated advancements in remote sensing object detection [
1,
2,
3,
4,
5]. As one of the core research directions in the field of computer vision, this field plays a crucial role in real-world applications such as disaster monitoring, urban planning, and military reconnaissance. The primary objective is to accurately identify, classify, and localize targets of interest within remote sensing imagery [
6,
7,
8], enabling timely decision-making and situational awareness across various domains. In remote sensing images, targets are mostly oriented differently and arranged closely [
9,
10], which makes the adaptability of general object detection frameworks poor. The horizontal anchor boxes often contain a lot of background interference and result in the problem of multiple targets being selected simultaneously, leading to a significant decline in detection performance. Therefore, how to effectively detect oriented objects in remote sensing images has become a current research hotspot. Recently, several works have aimed to enhance the representation of oriented bounding boxes in remote sensing object detection. This is mainly achieved by developing specialized detection frameworks, such as R3Det [
11], Rotated RetinaNet [
12], and RoI Transformer [
13], as well as oriented box encoding techniques, such as sliding vertex offset [
14], short side offset [
15], and midpoint offset box encoding [
16]. In addition, to further improve the performance of these methods, researchers have also proposed a variety of loss functions, including CSL [
17], KLD [
18], and KFIoU [
19].
Although the above methods exhibit high precision in detecting most targets, the detection accuracy for certain specific types of targets (e.g., bridges, ports, ships, etc.) is still unsatisfactory. We believe that the root of the problem lies in the fact that existing methods overlook the differences in shape information between different types of targets, especially the distinction between high-aspect-ratio targets and regularly shaped targets. Here, the shape information we mentioned refers to the ratio of the long side to the short side of the ground-truth bounding box of the target object, which can be intuitively represented by a mathematical formula:
, where
w and
h represent the width and height of the ground-truth bounding box, respectively, and the value range of
SI is (0,1]. We classify all objects into two categories, with 1 >
SI >
for regularly shaped targets and
SI ≤
for high-aspect-ratio targets, as shown in
Figure 1.
Existing methods perform well in detecting regularly shaped targets, but when faced with rotated targets with large aspect ratio variations, the detection accuracy significantly decreases. There are two main reasons for this:
(1) Feature misalignment: The convolutional features of traditional backbone networks are usually aligned based on fixed receptive field orientations, making it difficult to adapt to directionally oriented targets with high aspect ratio differences, resulting in poor feature extraction effects. Even if convolutional alignment or anchor box alignment operations are introduced in subsequent steps, they cannot compensate for the loss of local edge information of the target caused by the initial fixed convolution method, thereby affecting the overall quality of feature extraction. This is because all subsequent operations, such as feature fusion and resampling, are based on the feature maps extracted by the initial backbone network. Therefore, in the object detection framework, the backbone network that initially extracts target features is crucial for improving model accuracy.
(2) Static label allocation: In the process of anchor box regression, high-aspect-ratio rotated targets are extremely sensitive to angle regression. Even a slight angular deviation can cause a significant increase in the deviation between the predicted and ground truth boxes, especially when the shape information value is small, as shown in
Figure 2. This situation leads to an increase in false negatives during sample selection. Even with high classification scores, due to poor regression performance, Intersection over Union (IoU) between the predicted and ground truth boxes is below the preset threshold, causing targets that should be positive samples to be misclassified as negative samples. This misclassification results in an imbalance between positive and negative samples, which in turn negatively affects the overall detection performance.
To address this challenge, we approach from two aspects: high-quality feature extraction and rational label allocation. We believe that high-quality feature extraction is the foundation of oriented object detection, while rational label allocation can further enhance the detection performance. We propose a single-stage Shape-Aware Dynamic Alignment Network (SADA-Net), which consists of two modules and one matching strategy: Dynamic Refined Rotation Convolution Module (DRRCM), Anchor Refinement Feature Alignment Module (ARFAM), and Shape-Aware Quality Assessment (SAQA) matching strategy. Specifically, the DRRCM in the backbone network can accurately predict the weight and angle of the rotational convolution kernel using the Data-Enhanced Spatial Attention Module (DESAM). Subsequently, the predicted parameters are combined to generate a convolution kernel that adaptively adjusts according to the pose information of the oriented target, achieving accurate alignment with the target features and producing direction-sensitive feature maps. The ARFAM in the detection head quickly generates high-quality modified prediction anchor boxes on the direction-sensitive feature map through the regression branch, serving as guidance to dynamically adjust the position of the feature sampling points, thus further achieving precise feature alignment. High-quality directionally adaptive features are extracted through the collaborative work of the double alignment convolution module. In the label allocation process, based on the high-quality feature maps and modified prediction anchor boxes generated above, the SAQA method dynamically adjusts the IoU threshold based on the target’s shape information for training sample selection and calculates the centroid-adaptive distance to add quality information to the selected positive samples, thereby optimizing the training sample selection. Experiments conducted on commonly used public datasets such as HRSC2016, DOTA, and UCAS-AOD demonstrate that our method can maintain good detection performance.
The main contributions of this paper can be summarized as follows:
1. We propose a novel object detection framework, SADA-Net, to generate high-quality directionally adaptive features and regression prediction anchor boxes and couple efficient sample selection strategies to obtain excellent detection performance.
2. A flexible dynamic rotation convolution module is proposed, which can be easily embedded into the backbone networks of many detectors to extract high-quality basic features for oriented targets.
3. A reasonable sample matching strategy, using sample shape information and potential sample quality to optimize training samples, thus solving the inconsistency between classification and regression.
3. Methodology
The overall architecture of SADA-Net is illustrated in
Figure 3. First, it adopts the DRRCM as the backbone network to dynamically adjust convolutional kernels for aligning arbitrarily oriented objects, thereby extracting preliminary aligned features. Subsequently, the ARFAM performs anchor refinement to guide convolutions in acquiring more precise aligned features. Finally, leveraging the high-quality feature maps and refined anchor boxes generated through the above stages, the SAQA strategy dynamically assigns labels and evaluates sample quality, ensuring optimal matching between candidate boxes and their corresponding ground-truth labels. This approach effectively avoids the inconsistency between regression and classification, thereby enhancing detection performance. The detailed implementation of SADA-Net is elaborated below.
3.1. Dynamic Refined Rotation Convolution Module
In most existing remote sensing object detectors, the convolutional structures used in the backbone network adopt axis-aligned or preset fixed rotation angles for feature extraction of targets. However, objects in natural scenes are often placed at arbitrary angles. Therefore, standard convolutional kernels struggle to precisely match the contours of non-axis-aligned targets, making it difficult to effectively extract high-quality features from these arbitrarily oriented objects.
In order to avoid the fixed convolution mode of standard convolution and enhance the representation ability of the target, thus achieving accurate object detection. We propose a Dynamic Refined Rotation Convolution Module (DRRCM). The overall structure of DRRCM is shown in
Figure 4 below.
We designed a Data-Enhanced Spatial Attention Module (DESAM) to generate a spatial mask through pooling, concatenation, convolution transformation, and sigmoid activation function to weight the fused features and highlight the important spatial regions. Then, the weighted fused features are average pooled and input into the kernel angle prediction branch and the kernel weight prediction branch. This module can make the network more accurately focus on the key feature positions in rotating object detection and then accurately generate the predicted weights and angles of the rotating kernels.
Specifically, we first efficiently capture spatial relationships by applying channel-wise average pooling and max pooling (denoted as
and
, respectively) to the feature map
F obtained from depthwise convolution:
where
Savg and
Smax are the spatial feature descriptors obtained through average pooling and max pooling, respectively. To facilitate the information interaction between different spatial descriptors, the concatenated pooled features (with 2 channels) are transformed into
Cin spatial attention maps using a convolutional layer
.
where
Cin represents the number of input channels. For each spatial attention map
, a sigmoid activation function is applied to obtain an individual spatial mask
for each convolutional kernel.
Then, the spatial masks are used to weight the features after depthwise convolution, which are subsequently compressed into a C-dimensional feature vector through global average pooling
.
Then, the pooled feature vectors are sent into two branches, respectively.
The first branch predicts the rotation kernel angle, where the feature vector is input into this branch and processed through
Dropout, a linear layer,
Softsign activation, and multiplication by a scale factor to obtain a set of angles
:
where
is the linear layer without a bias term, ensuring that angle prediction depends solely on variations in the input features and avoids learning biased angles.
is a scaling factor used to expand the rotation range, and the parameter
is used to adjust the angle range, with a default value of 40.
The second branch is for rotating kernel weight prediction. By inputting the feature vectors into this branch and through
Dropout, a linear layer, and
Sigmoid activation, a set of weights
are obtained.
where
is a bias set for the linear layer to improve the flexibility of the model.
DESAM is initialized from a truncated normal distribution with a mean of zero and a standard deviation of 0.2 in order to assist the model in converging more rapidly and to reduce instability at the beginning of training. The implementation of the rotational convolution function is elaborated on below.
The rotation angle
parameter generated by the above DESAM prediction reparameterizes the weights inside the convolution kernel, allowing the convolution kernel to dynamically adjust according to different input feature maps, achieving adaptive rotation:
which
represents the coordinates of the original sampling points.
represents the new sampling point coordinates after the original sampling point
is rotated counterclockwise by an angle
to achieve the convolution kernel and feature alignment.
represents the reparameterized convolution kernel.
represents the bilinear interpolation, which is used to calculate the weight value of the new position after the rotation of the
convolution kernel.
The reparameterized convolution kernel is then multiplied by the corresponding
λi weight and summed, and then convolved with the input feature map to finally generate high-quality direction-aware features
Y:
Through the DRRCM, the convolution kernel can adaptively adjust according to the different orientations of objects in the input feature map, thereby efficiently capturing the features of multi-directional objects in the image. Especially when detecting objects in aerial images that are densely arranged and have large scale differences, high-quality feature extraction is crucial for accurate classification and precise localization.
3.2. Anchor Refinement Feature Alignment Module
Based on the high-quality feature maps generated by DRRCM in the backbone network, Anchor Refinement Feature Alignment Module (ARFAM) is employed to further refine anchor points through a regression operation. The refined anchor parameters are then used to compute an offset field, enabling the dynamic adjustment of aligned convolution sampling points. This process generates feature representations that are more precisely aligned with the target object, as shown in
Figure 5.
First, unlike most dense anchor sampling methods, we only preset a single initial square anchor at each position on the feature map. This anchor is then refined into a high-quality oriented anchor through the regression branch, thus reducing the need to preset a large number of anchors on the feature map, which helps to reduce computational complexity. The offset for predicting the anchor box regression target is as follows:
where (
x,
y,
w,
h,
θ) represent the center coordinates, width, height, and angle parameters of the initial anchor. (
xg,
yg,
wg,
hg,
θg) represent the parameters (same as above) of the ground-truth bounding box. (
Δxg,
Δyg,
Δwg,
Δhg,
Δθg) represent the offsets between the ground-truth bounding box and the initial anchor. By regressing these offsets, the model can adjust the initial anchor to a corrected predicted anchor box that is closer to the ground-truth bounding box.
R(θ) represents the rotation transformation matrix used to convert the center coordinates of the ground-truth bounding box to the coordinate system relative to the initial anchor.
K is the scaling factor used to adjust the angle value, ensuring that the rotation angle remains within a reasonable range.
Secondly, in order to achieve feature extraction for the oriented target, we adjust the feature sampling point position to realize dynamic convolution alignment guided by the modified prediction anchor box. This method adds an offset
calculated from the modified prediction anchor box to the original sampling points of the standard convolution:
Here, S denotes the stride, and c, w, and h represent the corrected predicted anchor box’s center coordinates (cx,cy), width, and height. denotes the position of the conventional sampling point for standard convolution, with and representing the two-dimensional coordinates and relative offset of this sampling point, respectively. R signifies the regular grid of standard convolution; for instance, would indicate a 3 × 3 convolution kernel with a dilation rate of 1.
Then, the dynamic alignment convolution combines the offset and input features
x such that the sampled point location can be adjusted based on the shape and orientation of the predicted anchor box to better match the geometry of the actual target.
where
represents the value of the output feature map at position
.
is the weight of the convolutional kernel at position
.
represents the input feature map.
The model is able to extract high-quality feature representations from the input image through a dual-layer feature alignment technique. Its backbone network dynamically refines the preliminary feature extraction of rotational convolution, and the anchor point refinement feature alignment convolution in the detection head further refines the features. These features not only contain the basic information of the target but are also enhanced at the detail level, providing an effective feature basis for subsequent positioning.
3.3. Shape-Aware Quality Assessment
In previous sections, we introduced the extraction of sensitive features with coding direction information in SADA-Net and its further alignment feature refinement process after anchor box correction, but in the actual training process of the model, it was found that there are still problems of inconsistency between regression and classification tasks, that is, high classification scores cannot guarantee the accurate positioning of detection. This problem has been widely studied in many articles [
35,
36,
37], and some discussions are traced back to the uncertainty of bounding box regression and localization [
37]. We believe that the bias between classification and regression primarily stems from the unreasonable selection of training samples, and we further solve this problem from the perspective of utilizing target shape information and evaluating sample quality.
Most of the existing detectors usually select positive anchors for training according to a fixed IoU threshold between the anchor and the ground-truth bounding box [
29]. However, such sample selection methods often ignores the shape information of the target and fails to make potential distinctions in the quality of the selected positive samples.
To address the above issue, we employ the shape-aware quality assessment (SAQA) method in the training stage. The implementation of this method is introduced in detail in the following parts of this section.
Specifically, firstly, the IoU threshold is adaptively adjusted according to the shape information of the target, and dynamic sample selection is realized. The formula is as follows:
where
represents the IoU threshold for dynamically selecting samples.
represents the mean IoU value of the candidate samples.
represents the standard deviation of the IoU values of the candidate samples.
is the weighting parameter used to control the influence of aspect ratio on the weight factor.
SI represents the shape information of the target.
Then, the selected positive samples are refined by introducing centroid-adaptive distance to evaluate the location and add the quality information. In detail, the centroid-adaptive distance value
is calculated using Euclidean distance from sample point to object center and the side length information of the object.
where (
xi,yi) represents the center coordinates of the ground-truth bounding box, (
xj,
yj) represents the coordinates of the sample point, and
hi and
wi represent the height and width of the ground-truth bounding box, respectively.
Next, after obtaining the centroid-adaptive distance value, the quality score
Qij of the positive sample is calculated.
In this way, the quality of samples is distinguished by introducing centroid-adaptive distance. The smaller the distance value, the closer the sample point is to the target center, and the higher its quality score. Therefore, using this distance value allows for a more accurate evaluation of each positive sample’s quality, thereby optimizing the sample selection process.
The SAQA method fully utilizes the geometric properties of the target and the potential quality of the samples, ensuring the reasonable selection of positive samples during the training process. This improves the consistency between regression and classification, thereby enhancing the overall performance of the detector.
5. Results and Analysis
Results on HRSC2016: The HRSC2016 dataset contains a large number of rotated ship images with high aspect ratio, multi-scale, and arbitrary orientation, which can fully verify the detection performance of our model for high-aspect-ratio oriented targets. Our method achieves competitive performance on the HRSC2016 dataset. As shown in
Table 6, using R-101-DRRCM as the backbone network and adjusting the input image to 512 × 800 pixels, our method achieves the highest mean average precision (mAP) of 90.05%. Even when using the lighter R-50-DRRCM, our method still achieves an mAP of 89.58%. It is worth noting that our method uses only one square anchor at each position of the feature map, but it still outperforms frameworks that preset a large number of rotated anchors at each point of the feature map. For example, R
2CNN presets 21 anchors and R3Det presets 126 anchors. Compared with their best detection results, we use only one anchor and do not use data augmentation strategies for training and testing, which achieve increases of 16.51% and 0.32%, respectively. These results show that it is not necessary to preset a large number of rotated anchor boxes of different scales for oriented object detection. More importantly, it is essential to extract high-quality basic features and optimize high-quality prediction anchor boxes, and on this basis, select reasonable sample training for target recognition. Some qualitative results are shown in
Figure 7.
(2) Results on UCAS-AOD: To further validate the effectiveness of the proposed SADA-Net, a series of experiments were conducted on the UCAS-AOD dataset. The results presented in
Table 7 demonstrate that our method outperforms the other detectors in terms of performance, achieving an mAP of 90.00%. Specifically, the detection results for the categories of cars and airplanes are 89.42% and 90.57%, respectively, both reaching the highest detection accuracy for these categories. This demonstrates that the proposed method exhibits strong robustness for densely arranged small objects, further validating its superior performance. Some qualitative results are shown in
Figure 8.
(3) Results on DOTA: As shown in
Table 8, our proposed method demonstrates superior detection performance compared to other advanced methods, achieving an mAP of 79.60%. Through our proposed SI criterion, we divide the 15 target categories in the dataset into two typical categories: high-aspect-ratio targets (BR, GTF, LV, SH, and HA) and regularly shaped targets (PL, BD, SV, TC, BC, ST, SBF, RA, SP, and HC). The data from the experiments show that this method excels in the detection of high-aspect-ratio targets, achieving the best detection accuracy in all five subcategories; meanwhile, in the detection of regularly shaped targets, seven out of ten subcategories reached the best comparison level, which fully validates the robustness of our method in dealing with the diversity of target orientation and shape variability. It should be emphasized that the improvement of detection accuracy for remote sensing targets with extreme aspect ratio features such as BR and LV is particularly significant, with increases of 4.64% and 4.59%, respectively, compared to the second best method. The qualitative visualization results shown in
Figure 9 further demonstrate that this method has a significant advantage in detection effects under different scales, dense arrangement, and complex background conditions.
6. Conclusions
This paper proposes a single-stage Shape-Aware Dynamic Alignment Network (SADA-Net) for the problem of directional object detection in remote sensing images. The framework optimizes feature representation, anchor point refinement, and training sample selection, significantly improving detection performance. Specifically, SADA-Net extracts high-quality orientation-sensitive features by adaptively adjusting convolution kernel parameters, effectively capturing the rotation characteristics of the target; secondly, by optimizing the anchor point generation mechanism, it ensures the spatial accuracy of the prediction box, significantly improving the feature alignment accuracy. Finally, in the sample selection strategy, combining target shape information and sample quality evaluation, it realizes the dynamic selection of positive samples, thereby enhancing the consistency between regression tasks and classification tasks. Experimental results show that SADA-Net achieves excellent detection performance on three benchmark datasets: HRSC2016, DOTA, and UCAS-AOD, with mAP reaching 90.05%, 79.60%, and 90.00%, respectively, fully verifying the effectiveness and advancement of this method in the task of directional object detection in remote sensing images.