Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection

Liu, Yong; Jing, Zhengbiao; Chang, Yinghong; Jing, Donglin

doi:10.3390/a19030206

Open AccessArticle

Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection

¹

School of Smart City, Chengdu Vocational and Technical College of Industry, Chengdu 610218, China

²

School of Intelligent Manufacturing and Automotive, Chengdu Vocational and Technical College of Industry, Chengdu 610218, China

³

Shanghai Aerospace Control Technology Institute, Shanghai 201109, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(3), 206; https://doi.org/10.3390/a19030206

Submission received: 6 January 2026 / Revised: 28 February 2026 / Accepted: 3 March 2026 / Published: 9 March 2026

(This article belongs to the Special Issue Advances in Deep Learning-Based Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

In recent years, the refinement of bounding box representations has emerged as a major research focus in remote sensing. Nevertheless, mainstream detection algorithms typically ignore the disruptive impacts induced by the diverse morphologies and arbitrary orientations of high-aspect-ratio aerial objects throughout model training, thereby giving rise to several critical technical challenges: (1) Anisotropic information distribution: Target features are highly concentrated in one spatial dimension but sparse in the other, with significant feature differences across bounding box parameters, breaking the symmetry of feature distribution. (2) Missing high-quality positive samples: IoU-based assignment strategies fail to adequately capture the symmetric structural characteristics of elongated targets, resulting in incomplete coverage of critical features. (3) Loss function gradient instability: Small deviations in large-aspect-ratio bounding boxes cause drastic loss value fluctuations, as the asymmetric gradient changes hinder stable optimization directions during training. To address the challenges, we propose a Spatial Orthogonal and Boundary-Aware Network (SOBA-Net) for rotated and elongated target detection, leveraging symmetry-aware designs to enhance feature representation. Specifically, spatial staggered convolutions are constructed to fuse local and directional contextual features, effectively modeling long-range symmetric information across multiple spatial scales and reducing background noise interference. Secondly, the designed Symmetric-Constrained Label Assignment (SC-LA) introduces an IoU-weighted function, ensuring high-quality samples with symmetric structural features are classified as positive samples. Ultimately, the designed Gradient Dynamic Equilibrium Loss Function mitigates the problem of unstable gradients associated with high-aspect-ratio objects by enforcing symmetrical gradient regulation across samples with negligible localization deviations. Comprehensive evaluations across three representative remote sensing benchmarks—DOTA, UCAS-AOD, and HRSC2016—sufficiently corroborate the superiority of symmetry-aware enhancement schemes, which boast straightforward implementation and efficient inference deployment.

Keywords:

aerial target detection; label assignment; convolutional neural networks; bounding box representation

1. Introduction

Traditional aerial target detection methods [1], such as manual operator-based feature extractors and sliding window strategies, although they have made a lot of progress, often suffer from the limitations of computational efficiency and the shortcomings of weak feature expression capability and insufficient robustness. Driven by the advancement of deep learning techniques, one-stage and two-stage detectors have become the prevailing architectures in modern object detection research. One-stage frameworks such as SSD [2], RetinaNet [3], and the YOLO series [4,5] achieve end-to-end detection by directly learning discriminative features from images and predicting results without generating region proposals in advance. Relying on densely distributed preset anchors in the input image, these models can quickly output object categories and bounding box positions, enabling efficient target detection. On the contrary, two-stage detectors represented by Faster R-CNN [6], Mask R-CNN [7], and other improved variants [8,9,10], adopt a hierarchical detection strategy. The Region Proposal Network is first utilized to locate and extract potential object-containing Regions of Interest (RoIs) from the image. Afterward, the detection head performs in-depth feature processing on these RoIs to refine bounding-box regression and accurately perform object classification. Although such methods have achieved prominent improvements in feature representation ability, they seldom fully account for the adaptive matching between bounding boxes and diverse object shapes during the regression process. Consequently, the overall detection performance is constrained to some extent.

In recent years, the refinement of bounding box representations has evolved into a core research direction in remote sensing object detection [11,12,13]. Ding et al. [14] put forward a strategy that transforms horizontal regions of interest (HRoI) into rotated regions of interest (RRoI), which markedly decreases the quantity of predefined anchors relative to conventional anchor-based pipelines. This improvement not only boosts computational efficiency but also strengthens the model’s adaptability to targets with arbitrary orientations. The Gliding Vertex method [15] characterizes objects via quadrilateral structures, supporting a more accurate depiction of target contour features. Approaches including PAA [16] and IQDet [17] rely on the Gaussian Mixture Model (GMM) to model the distribution patterns of targets. By dynamically optimizing the selection of positive and negative samples, these schemes allow bounding box regression to better absorb and utilize critical information from ground-truth annotations. Although representative works such as [18,19,20] have achieved remarkable progress in bounding box optimization and effectively improved detection accuracy and robustness, considerable challenges still remain.

For targets with large aspect ratios—including slender bridges and ships where the aspect ratio exceeds a certain value—a slight angular deviation may result in a significant decline in Figure 1. The capability of bounding boxes to capture the features of large aspect ratio targets efficiently and precisely relies on three core procedures [21,22]: (1) the representativeness of the learned samples; (2) the accuracy and efficacy of the matching strategy; (3) the optimality of the designed objective function. These three aspects jointly determine the efficiency and precision of bounding box representation within the model. Taking the aforementioned procedures into account, we have identified two key challenges in the accurate feature capture of elongated targets via bounding boxes.

To begin with, in the sample selection phase, assignment strategies based on fixed IoU thresholds are unable to cover elongated targets. Second, during the regression and loss function design stages, the bounding box regression task for elongated targets is more intricate compared to that for ordinary targets. The drastic fluctuations in the gradients further contribute to unstable model training. In the subsequent sections, we elaborate on these two challenges in detail.

(1) Missing out on high-quality sample anchors:

Object detection tasks face distinctive difficulties when it comes to identifying elongated targets like ships and bridges. Depending exclusively on the Intersection over Union (IoU) metric to assess the accuracy of predicted anchors often struggles to comprehensively grasp the key characteristics of the target in Figure 2. This problem is further exacerbated by the high aspect ratio of slender targets. A slight localization deviation can lead to a considerable drop in IoU, as demonstrated in Figure 3. As a result, anchor boxes (marked in red) that contain key point information might be ignored, which prevents the model from deeply learning the essential features. Such biases not only undermine the model’s ability to accurately locate the bounding boxes of elongated targets but also may reduce the overall detection precision.

(2) Unstable training caused by drastic changes in the gradient of the regression function:

Models typically prioritize anchors that generate larger loss gradients to refine predictions more accurately. But boundaries of elongated targets introduce unique challenges. Firstly, as illustrated in Figure 1, even minor coordinate errors can trigger a sharp surge in loss function gradients, necessitating loss functions sensitive to such subtle variations. Nevertheless, many existing loss function designs are not fully adapted to this characteristic—they fail to effectively prompt the model to allocate sufficient attention to these small errors, thereby hindering the achievement of precise localization of slender targets during training.

Furthermore, as illustrated in Figure 3, anchors that have minimal overlap with the ground-truth (the yellow anchors in the figure) can also generate substantial gradient losses during the process of coordinate regression. Such low-quality samples are capable of misleading model training, often resulting in unstable backpropagation and a subsequent decline in model performance.

To tackle these issues, it is necessary to devise more advanced regression strategies and loss functions, allowing bounding boxes to more effectively adapt to the characteristics of slender targets. In order to address the aforementioned challenges, we argue that an optimal bounding box representation ought to have the following attributes: (1) Precise sampling strategy: It should supply adequately representative samples for targets with varied orientations and shapes—especially elongated ones—to ensure that the training dataset contains key information. (2) Efficient regression evaluation criteria: Evaluation metrics and loss functions need to be designed to precisely reflect the performance of bounding box regression, particularly regarding accuracy and robustness in dealing with elongated targets. (3) Efficient deployment capability: To guarantee practical efficiency, the bounding box representation should reduce the computational burden on the detection head to the greatest extent possible while maintaining accuracy. This demands the integration of simplified algorithms to promote rapid deployment and real-time processing in practical applications. To achieve this goal, the methods are as follows:

(1) Backbone: By employing a multi-branch parallel framework to integrate local contextual features and specific directional characteristics captured by square convolution and large-kernel strip convolution, we weaken the interference of background noise and realize accurate extraction of key features for targets with varying aspect ratios.

(2) Label assignment strategy: To address the issue of high-quality sample anchor omission caused by slender targets, this paper proposes a label assignment strategy named SC-LA. On the basis of the IoU metric, this strategy incorporates two additional evaluation indices: angular difference and aspect ratio. Specifically, when a target has a high aspect ratio and the angular deviation between samples and ground truths is small, the IoU threshold is dynamically reduced. This ensures that high-quality samples containing key information are assigned as positive samples, enabling the model to more thoroughly learn the characteristics of these samples and thereby improve detection accuracy.

(3) Loss function: To enhance localization capability for slender targets, we put forward the GDE-Loss function. Through fine-grained adjustment of the loss function gradient, our GDE-Loss can effectively mitigate gradient instability during the regression of large-aspect-ratio targets, promoting better model convergence.

Experimental results indicate that our detection head achieves superior performance compared to other state-of-the-art approaches on the benchmark datasets DOTA, UCAS-AOD, and HRSC2016. This outstanding outcome demonstrates the considerable potential of the proposed method in the domains of object detection and remote sensing image analysis.

2. Related Work

2.1. Fixed Label Assignment

A traditional strategy adopts a predetermined strategy for assigning samples. Depending on whether anchors are used, these strategies fall into two categories: anchor-based and anchorless detector approaches.

(1) Anchor-based strategy: Represented by methods such as ReDet [23], SCRDet [24], R3Det [25], and Gliding Vertex [15], this approach utilizes Oriented Bounding Boxes (OBB) combined with fixed thresholds to differentiate positive and negative samples. An anchor is marked as positive when its IoU with the ground truth meets or exceeds the threshold; otherwise, it is labeled as negative. RFLA takes advantage of the prior insight that the receptive field of features conforms to a Gaussian distribution, which boosts the precision of label assignment in tiny object detection and thus enhances overall detection performance.

(2) Anchor-free detection strategies: FCOS [26] adopts a center sampling strategy to randomly generate multiple multi-scale anchors. Each anchor is categorized as either a positive or negative sample according to its IoU value and category consistency with the GT. Subsequently, the enhanced Foveabox [27] utilizes a Pyramid Fovea Pooling strategy: it first conducts pooling operations across multi-scale feature maps. The Pseudo-IoU [28] approach proposes a new metric known as Pseudo-IoU. In detail, for each predicted anchor, all IoU values between the anchor and ground truths are sorted, and a specific proportion of the highest-ranked IoUs are selected to map to their corresponding ground truths. By optimizing IoU estimation, this method enables predicted anchors to learn more features during the training process. Nevertheless, these strategies rely excessively on sample coordinates, do not integrate contextual information, and lack flexibility in practical application.

2.2. Dynamic Label Assignment

In recent years, object detectors have integrated adaptive mechanisms [29] to optimize label assignment, realizing the automatic selection based on statistical characteristics. FreeAnchor [30] establishes top-k anchor candidates for each object by combining IoU values with importance scores—these scores are calculated through the Mean-Max function using classification and localization confidence—to dynamically assign labels. ATSS [31] computes the L2 distance, keeps the top-k anchors, and sets the positive/negative threshold as the mean plus standard deviation (mean + std) of IoUs among these top-k anchors for each gt. Anchors whose IoUs exceed this threshold are identified as positive samples. PAA [16] posits that the joint loss distributions of classification and regression are a Gaussian distribution. Based on this fitted distribution, probabilities are calculated to assign labels.

To alleviate the impact of dynamic rules (such as PAA) that are susceptible to noisy samples, Yuchen Ma et al. proposed IQDet [17]. This method is based on the GMM model to adopt an instance distribution to assign a unique label to each target. Even when facing overlapping or adjacent targets, the stability of the labels can be ensured. Compared with the PAA strategy, the IQDet method exhibits higher robustness. In the same year, to address the shortcomings of the static labeling approach—specifically, that the conditions for determining positive/negative samples should differ for targets with varying sizes, shapes, and occlusion degrees—Zheng et al. proposed the OTA approach [32]. This method defines the unit transport cost as the weighted sum, while the cost is defined solely as their pairwise classification loss. Subsequently, the globally optimal assignment result is obtained by finding the minimum transport cost, thereby classifying positive and negative labels.

3. Methodology

Before delving into the detailed design of the SCCM module, we first clarify its connection logic and functional boundaries with the Feature Pyramid Network (FPN): As a feature fusion module following the backbone network, the FPN’s core function is to address the multi-scale difference issue of remote sensing targets. Its structure consists of a top-down upsampling branch and a bottom-up lateral connection—the upsampling branch maps high-level semantic features to low-level scales, while the lateral connection fuses low-level spatial detail features, ultimately generating a feature pyramid covering scales P3–P7. This provides a multi-scale feature foundation with both semantic information and spatial precision for subsequent detection tasks.

The connection process between this module, the backbone network, and the SCCM is as follows (as shown in Figure 4): The ResNet101 backbone network first extracts basic feature maps (C1–C5) of the input image, which are then fed into the FPN to complete multi-scale feature fusion. The fused feature pyramid is further input to the SCCM for feature enhancement. It is particularly important to note that the feature fusion process of the FPN only involves feature upsampling and lateral concatenation, without any horizontal or vertical large-strip convolution operations; the horizontal/vertical large-strip convolutions in the SCCM, which are used to fuse long-range contextual features, are its exclusive feature enhancement design and only act on the feature pyramid output by the FPN. The two modules have clear boundaries in terms of functional positioning and operational logic, forming an upstream-downstream collaborative relationship from feature fusion to feature enhancement.

3.1. Spatial Crosswise Convolution Module (SCCM)

To enable the detector to better handle high-aspect-ratio objects, we design a spatially interleaved convolution module. This module integrates localized and direction-specific contextual features—captured by square convolution and large-kernel strip convolution—through a multi-branch parallel framework.

Given an input

X \in R^{C \times H \times W}

, we first employ a square convolutional kernel

K \in R^{C \times H_{k} \times W_{k}}

from deep convolutional layers to capture local detailed features. Here, C denotes the number of channels,

H \times W

represents the spatial dimensions of the feature map, and

H_{k} \times W_{k}

corresponds to the size of the convolutional kernel. The square convolutional kernel, with a default size of

3 \times 3

, is used for initial feature extraction. Subsequent to this initial square convolution, we adopt a group of parallel depthwise convolutions; each parallel branch consists of a sequence of horizontal and vertical large-strip convolutions at different scales, which facilitates more effective capture of features from high-aspect-ratio objects. The specific calculation process is described as follows:

\begin{matrix} Y = C o n v_{3 x 3} (X), \\ \hat{Y} = Y + \sum_{i = 0}^{N} V - {Conv}_{i} (H - {Conv}_{i} (Y)) \end{matrix}

(1)

where

H - Conv

and

V - Conv

denote horizontal and vertical strip convolution kernels, with kernel sizes of

k \times 1

(horizontal) and

1 \times k

(vertical), respectively. The size of the bar convolution kernel of each branch is set to 5, 7, and 9, respectively.

To strengthen the feature interactions across different channels, we further apply a point-wise convolution to

\hat{Y}

to obtain a fused feature map F. Here, each position of the fused feature map F encodes the horizontal and vertical features that span a large spatial range. Subsequently, following the approach in Conv2former, we treat the feature map F (the output of the point-wise convolution) as attention weights, which are then used to weight the input X. The output feature map

Y_{o u t}

can be expressed by the following formula:

Y_{o u t} = X ⊙ C o n v_{1 \times 1} (\hat{Y}),

(2)

Here, ⊙ represents the element-wise multiplication. Spatial interleaved convolution can fully capture long-range spatial features of various ranges, thereby effectively enhancing detection performance.

3.2. Shape-Corrected Label Assignment Strategy (SC-LA)

To resolve the problem that high-aspect-ratio targets are prone to missing positive samples with abundant key information during positive-negative sample division, we design an adjusted strategy. We first calculate the distance between the predefined anchor and the center of ground truth

i

, which is denoted as

σ

. We then retain the top k anchors with the smallest distances and compute their IoU values. Based on these IoU values, we further derive the mean

μ

and standard deviation

σ

. The sum of the mean and standard deviation (

μ + σ

) is used as the base threshold, which is subsequently multiplied by the weighting function

f (η_{i})

. The detailed expressions are given as follows:

μ = \frac{1}{P} \sum_{j = 1}^{P} I_{i, j}, σ = \sqrt{\frac{1}{P} \sum_{j = 1}^{P} {(I_{i, j} - μ)}^{2}}

(3)

Δ θ = |\begin{matrix} θ - θ_{i} \end{matrix}|

(4)

T_{i} = f {(η_{i}, Δ θ)}^{*} (μ + σ)

(5)

Here, P stands for the quantity of candidate anchors.

I_{i, j}

denotes IoU value between the i-th ground truth and the j-th anchor. Specifically,

θ

refers to the angle formed between the ground truth and the horizontal axis;

η_{i}

represents the aspect ratio of the ground truth associated with the prediction; and

Δ θ

denotes the angular deviation between the preset anchor and the ground truth relative to the horizontal axis. To achieve more adaptive label assignment, we develop a weight factor function

f (η, θ)

that adjusts its response to

Δ θ

according to

η_{i}

:

f (η_{i}, Δ θ) = e^{- (\frac{η_{i}}{10 α^{*} Δ θ})}

(6)

Here,

α

serves as the adjustable factor parameter for

Δ θ

, which can dynamically adjust the IoU threshold up or down adaptively based on the target’s varying aspect ratios and different magnitudes of angular deviation. This ensures that samples containing high-quality targets can be categorized as positive ones. The trend of

f (η, θ)

is illustrated in Figure 4: when the target aspect ratio

η_{i}

increases and the corresponding

Δ θ

is small, the weighting factor decreases correspondingly. In contrast, when the target aspect ratio

η_{i}

is high, but the angular deviation

Δ θ

increases (i.e., the angle mismatch is significant), the rate at which the IoU threshold decreases slows down as

Δ θ

decreases.

3.3. Gradient Dynamic Equilibrium Loss Function (GDE-Loss)

To address the issue of unstable model training and unsatisfactory regression performance—caused by the phenomenon that high-aspect-ratio anchors exhibit only a slight deviation from the ground truth yet still result in substantial gradient losses during the regression task—we have designed a gradient expression for the loss function (as shown in Figure 5 and Figure 6). Its specific form is presented as follows:

\frac{\partial L_{r e g}}{\partial x} = 1 - e^{- β x}, x \geq 0

(7)

Here, x represents the error value in

β

-regression. Through analyzing the gradient curve of the loss function with respect to x, we increase the gradient growth rate for small-error samples, thereby speeding up the model’s convergence. In addition, we restrict the upward trend of gradients for large-error samples: instead of letting the loss gradient increase unboundedly with the growth of regression error, we design it to asymptotically approach 1. This mechanism promotes more stable and precise model training. Moreover, the gradient of the loss function is bounded to not exceed 1, which also effectively avoids gradient explosion. The parameter

β

is closely associated with

η

and takes values accordingly: as the aspect ratio

η

increases, the gradient variation of small-error samples should become more distinct, and

β

therefore increases correspondingly:

L_{n g} = x + \frac{1}{β} e^{- β x} + C, x \geq 0

(8)

Here, C is a constant parameter. When

x = 0

,

L_{reg}

should also be 0, constant C can be derived as

\{\begin{matrix} x + \frac{1}{β} e^{- β x} + C = 0 \\ x = 0 \end{matrix}, C = - \frac{1}{β}

(9)

i.e., the regression function is obtained as

L_{r e g} = x + \frac{1}{β} e^{- β x} - \frac{1}{β}, x \geq 0

(10)

4. Experiments

Experiments were conducted on typical public datasets, i.e., DOTA, HRSC2016, and UCAS-AOD. Detailed information about the datasets, method implementation, and experimental results is presented in the following subsections.

4.1. Datasets

DOTA [33] serves as the largest-scale aerial dataset, consisting of 2806 high-resolution images with varying dimensions collected from various remote sensing platforms (e.g., satellites and unmanned aerial vehicles (UAVs)). The height and width of these images range from 800 to 4000 pixels. Annotated with rotating bounding boxes, the dataset contains 188,282 target instances that exhibit significant scale differences and dense distribution. These annotated instances fall into 15 categories, including roundabouts (RA), harbours (HA), swimming pools (SP), helicopters (HC), planes (PL), and others.

The HRSC2016 [34] dataset is a high-resolution satellite imagery dataset. It comprises 436 training images, 181 validation images for performance evaluation, and 444 test images. These images exhibit a wide range of sizes, from small 300 × 300-pixel images to large 1500 × 900-pixel ones, thus covering a broad spectrum of ship scales. Additionally, the dataset includes targets with diverse distributions—encompassing ships of different types, sizes, and colors—set against complex backgrounds (e.g., clouds and coastlines).

The UCAS-AOD [35] is an aerial benchmark customized for aerial object detection, focusing specifically on car and aircraft detection. These images capture various types of cars and aircraft under different environmental conditions, including variations in weather and seasons. For the purposes of this research, 1057 images were randomly selected for model training, while the remaining 302 images were used as the test set. Notably, the dataset boasts rich target diversity in distribution: it includes cars and aircraft of different models and colors, distributed across various background environments such as airport runways, parking lots, and urban streets.

To clarify the data distribution for reproducibility, we explicitly define the train/val/test split ratios (in percentage terms) for all datasets used in this study: For the DOTA dataset, we adopt a split of 1964 training images (70%), 281 validation images (10%), and 561 test images (20%); the HRSC2016 dataset is divided into 436 training images (50%), 181 validation images (21%), and 444 test images (29%); as for the UCAS-AOD dataset, it consists of 1057 training images (70%) and 302 test images (20%) without an independent validation set.

4.2. Implementation Details

In the present study, two baseline models were established, including an anchor-based framework adopting ResNet101 and an anchor-free architecture based on RepPoints. Both models utilize a backbone for feature extraction and integrate two detection heads to refine predictive outputs. During the training process, the stochastic gradient descent (SGD) optimizer was employed, with its initial learning rate, momentum, and weight decay set to 0.012, 0.9, and 0.0001, respectively.

To evaluate the performance of the models, training was conducted on the HRSC 2016 and UCAS-AOD for 100 and 120 epochs. In the anchor-free RepPoints model, the number of point sets was set to 12. The weighting parameter

ω

was adjusted on both datasets based on experimental experience. Additionally, experiments were conducted using the MMDetection-1.1, PyTorch-1.3, and MMRotate frameworks, with hardware configurations consisting of 11G RAM and six GPUs with a total memory of 62 GB. Data augmentation strategies included random flipping and random rotation. Experimental results for the baseline models and our proposed method, presented in Table 1 and Table 2, were fairly compared with other methods through multi-scale training and data augmentation.

4.3. Ablation Studies

Evaluation of Each Proposed Component. To validate the effectiveness of the proposed modules, we performed corresponding ablation experiments. Table 1 presents the experimental results of the anchor-free RepPoints method on these two datasets, respectively. The baseline model achieves mAP scores of merely 85.63% and 86.0%, which is due to its tendency to ignore key features of targets with large aspect ratios. When integrating the SC-LA strategy with GDE-Loss, the detector’s performance is improved by 3.09%. This enhancement indicates that the dynamic weighting function in the SC-LA strategy can adaptively lower the IoU threshold according to target shapes. This mechanism facilitates more thorough model learning, ensuring the model fully captures the key features of targets and thus improving detection accuracy.

Furthermore, we introduced the GDE-Loss function, which enhances the model’s sensitivity to samples with minor errors through gradient fine-tuning while alleviating the excessive focus on hard samples. This strategy effectively stabilizes the model’s gradient during the regression process, leading to further performance improvements—specifically, mAP increases by 1.08% and 3.32% on the two datasets, respectively.

Additionally, we carried out experiments on the anchor-based ResNet-101 model, and similar performance improvements were also observed on both datasets in Table 2. These experimental results fully confirm that the proposed strategy is effective in enhancing target detection performance, particularly when dealing with challenging large-aspect-ratio targets.

In contrast to utilizing a single module, a network architecture incorporating multiple stacked modules yields superior performance. Notably, the integration of the SC-LA and GDE-Loss modules enriches the training process with samples containing critical features and provides efficient, precise regression guidance—all without introducing excessive computational complexity. This enhancement directly contributes to improved regression and classification outcomes. When employing the anchored ResNet-101 as the backbone, the model attains its optimal performance, with mean average precision (mAP) values of 90.02 and 90.17%.

Evaluation of the parameters within the module: We set the parameters on HRSC2016, based on the ResNet-101 backbone network with anchors, in the context of the separate inclusion of SC-LA

α

sensitivity experiments were conducted to test the effect of SC-LA in Table 3.

As observed from the table, when

α

is less than 3, the lower the value of

α

, the lower the mAP. This shows that the angular difference has a weaker impact on the weight function. For high-aspect-ratio targets, the corresponding IoU threshold is adjusted to an excessively low level, which may lead to misclassification of some low-quality samples as positive samples and thus result in redundant information being included in positive samples. When

α

exceeds 3, conversely, the higher the value of

α

, the lower the mAP. This is because an excessively large

α

excessively restrains the reduction in the IoU threshold; as a result, even some samples that contain key target features may be erroneously excluded from the positive set. Such over-restraint limits the model’s ability to learn key target features, thereby impairing detection performance.

However, when

α

= 3, the highest mAP of 91.029% is achieved. This indicates that under this condition, the SC-LA strategy adapts well to target shapes and effectively learns target features. At this point, it neither excessively suppresses high-quality samples with angular differences nor misclassifies low-quality samples as positive samples.

Furthermore, with the SC-LA strategy adopted (

α

= 3), we conducted sensitivity experiments on parameter

β

to verify the impact of the GDE-Loss.

It can be observed that when

β

is less than 15, the mAP value declines as a result of the characteristics of elongated ship targets (shown in Table 4). Specifically, when

β

= 0.5,

β

= 1,

β

= 5, and

β

= 10, the high aspect ratio fails to effectively capture the sharp decline of the loss function with the increase of coordinate error. This setting may cause the model to overly concentrate on unimportant samples, thereby limiting the model’s detection accuracy. When

β

= 15, the mAP reaches a peak of 91.29%, indicating that gradient settings have achieved an appropriate balance on the HRSC2016 dataset. Under this condition, the model can avoid excessive attention to unimportant samples, thus realizing accurate detection of elongated ship targets.

Complexity analysis of SOBA-Net:Table 5 is used to compare the number of parameters, FLOPs, and inference time of SOBA-Net with baseline models based on the same hardware environment (NVIDIA Tesla V100 GPU). With the addition of three core modules, SOBA-Net decreases the number of parameters by 10%, FLOPs by 50%, and inference time by 4%, while maintaining computational efficiency comparable to baseline models and verifying the practicality of the method.

4.4. Comparative Experiments

An exhaustive comparison between SOBA-Net and other approaches was conducted on the DOTA dataset in Table 6. From the table provided, our proposed SOBA-Net attains the highest AP scores for the bridge (BR), ship (SH), soccerball field (SBF), roundabout (RA), swimming pool (SP), and helicopter (HC) categories, with respective values of 63.54, 89.35, 72.41, 73.04, 83.24, and 81.26. Moreover, we secured the optimal average performance across all categories, achieving a mAP of 81.74. The traditional R3Det enables more precise alignment between feature representations and the positions of predicted boxes, thereby effectively enhancing detection accuracy. However, this method fails to fully capture the global features of objects, leading to slightly lower accuracy than our approach when detecting high-aspect-ratio targets such as bridges and ships.

LSKNet can more accurately model the differentiated contextual information demands of various object types by dynamically adjusting the network’s large spatial receptive field. Nevertheless, its adoption of inflated convolution can result in the sparsity in key features, leading to a 0.53% gap in mAP compared to our proposed SOBA-Net. In contrast, SOBA-Net achieves superior accuracy by simultaneously extracting long-range contextual features via SCCM. Additionally, it realizes precise prediction of rotating anchor parameters via a decoupled network guided by a dynamic progressive activation mask.

Visualized detection results from selected DOTA dataset samples are presented in Figure 7. Through decoupled parameter prediction, the proposed SOBA-Net precisely delineates the boundaries of diverse targets, thereby achieving accurate detection of densely packed small objects. For targets with arbitrary orientations (e.g., aircraft), this network exhibits the ability to accurately capture their spatial pose, enabling adaptation to rotations across all angular ranges. Additionally, SOBA-Net yields detection results that are more closely aligned with the ground-truth annotations of these objects. Even in scenarios where foreground-background contrast is low (e.g., the rightmost column of images), our method successfully detects bridges and ships with vague textural details, underscoring its strong generalization capability. Notably, SOBA-Net mitigates the incidence of missed detections and false positives, adapts effectively to substantial variations in object scale and aspect ratio, and generates rotating bounding boxes that better conform to target contours. These experimental findings validate that SOBA-Net can robustly capture the spatial location and shape characteristics of objects, facilitating high-precision prediction of object orientations.

Results on HRSC2016

This dataset comprises various ship types, anchored in harbors and on the open seas, offering abundant test scenarios to validate effectiveness. Through the integration of innovative modules, our method achieves a mAP of 91.29%, surpassing all other existing detectors listed in the Table 7. Of particular note is that our method outperforms S2ANet (90.10%), a dedicated ship detector, by 1.19% in mAP. This superiority underscores the superior efficiency of our method, which features fewer parameters and lower computational complexity while delivering better performance.

Visualized detection results, presented in Figure 8, fully illustrate the exceptional performance of our model when confronted with slender ships featuring varied angular orientations. This is especially evident in complex environments adjacent to piers or harbors, as well as in dense side-by-side berthing scenarios. In contrast, our detection head overcomes these limitations by adopting the GDE-Loss regression training strategy and implementing gradient fine-tuning, which substantially enhances the model’s localization accuracy for long, thin targets.

The images in the third row further validate the adaptability of our model in handling targets with vastly differing scales. In these images, the ships exhibit substantial length variations, with some differing in length by as much as 10-fold. By leveraging the SC-LA (Sample-Specific) strategy, our model comprehensively captures the key features of the ships and can adaptively recognize and precisely locate these vessels of vastly varying sizes.

For the target detection task conducted on the UCAS-AOD dataset, our model surpasses all existing two-stage and one-stage detectors, achieving an impressive mAP score of 91.34% (shown in Table 8). The visualized detection outcomes, presented in Figure 9 and Figure 10, further validate the exceptional detection capability of our model in handling targets with varying aspect ratios. Within this dataset, the majority of targets—including vehicles and aircraft—have aspect ratios of 1 or 1.5, yet our model still maintains precise localization for these objects. This superior performance is credited to the SC-LA (Sample-Specific) strategy, which dynamically captures target features through its adaptive weight-function shape, demonstrating remarkable adaptability and generalization ability for targets with different aspect ratios.

Results on UCAS-AOD

Particular significance is the dense side-by-side aircraft parking scenario illustrated in Figure 9 and Figure 10: even when the noses are intertwined with the fuselages of adjacent ones, our model can still effectively localize each individual target. This achievement stems from the SC-LA strategy’s ability to accurately capture key target information, such as the nose and tail of aircraft. Meanwhile, for vehicles and aircraft distributed in arbitrary directions, traditional detection methods often struggle to achieve effective localization due to unstable training gradients. In contrast, our GDE-Loss regression strategy provides the model with more stable and precise parameter update directions during training, which substantially enhances the localization accuracy.

Analysis of Failure Cases

To address the failure case analysis, we identify some representative scenarios with visual examples and underlying causes (as shown in Figure 11). First, for severely occluded targets (occlusion rate ≥ 50%), such as densely moored ships in harbors where adjacent hulls overlap extensively or bridges partially blocked by vegetation, the SC-LA strategy struggles to screen effective positive samples—key features (e.g., ship bows/sterns or bridge piers) are obscured by overlapping regions, leading to missed detections or imprecise bounding boxes. Second, extremely small high-aspect-ratio targets (pixel size < 30 × 3), such as slender components of power lines or narrow waterways in remote sensing images, fail to be fully covered by the receptive field of the SCCM module, resulting in incomplete feature extraction and unstable regression. Third, low-contrast environments (grayscale difference between target and background < 10%), e.g., ships on foggy seas or bridges against cloudy skies, lead to a low signal-to-noise ratio for features; although GDE-Loss mitigates gradient instability, it cannot fully offset the interference from background noise, causing bounding box deviation.

These failures primarily stem from the limitations of current modules in handling extreme spatial constraints and environmental interference: the SC-LA’s key feature screening relies on partial overlap, which is ineffective under heavy occlusion; the SCCM’s multi-scale strip convolution cannot adapt to targets smaller than the minimum receptive field; and the GDE-Loss lacks adaptive adjustment to feature quality. Visualizations of these cases (supplemented as Figure 11) clearly show that occluded targets are misclassified as background, tiny targets are omitted, and low-contrast targets have bounding boxes deviating from ground truth. Future improvements will integrate attention mechanisms to enhance feature extraction in occluded regions, optimize the SCCM’s receptive field adaptation for ultra-small targets, and fuse multi-modal features to improve robustness in low-contrast scenarios.

5. Conclusions

Optimizing bounding-box representation for complex aerial targets, especially high-aspect-ratio and rotated objects, is critical in remote sensing object detection. Existing methods struggle with anisotropic feature distribution, inadequate positive sample selection, and gradient instability, limiting detection accuracy. To address these, we proposed SOBA-Net, integrating three key innovations: (1) spatial staggered convolutions for robust multi-scale feature extraction; (2) the SC-LA module with IoU-based weighting to prioritize high-quality samples; (3) the Gradient Dynamic Equilibrium Loss Function to stabilize regression gradients. Experiments on DOTA, UCAS-AOD, and HRSC2016 validated SOBA-Net’s superiority in detecting small, rotated, and high-aspect-ratio targets, with reduced missed/false detections and better-aligned bounding boxes. It effectively captures spatial and shape information, overcoming traditional limitations. This work provides a robust solution for remote sensing target detection. Future work will enhance real-time efficiency and adaptability to complex conditions like heavy occlusion.

Author Contributions

Conceptualization, Y.L. and D.J.; methodology, Y.L.; software, Y.C.; validation, Z.J. and D.J.; formal analysis, D.J.; investigation, Y.L.; resources, D.J.; data curation, Y.C.; writing—original draft preparation, Y.L.; visualization, Y.C.; supervision, Y.L. All authors have read and agreed to the published manuscript.

Funding

This research received no external funding.

Data Availability Statement

The links for the DOTA datasets is as follows: https://captain-whu.github.io/DOTA/index.html.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Czarnecki, W.M.; Osindero, S.; Jaderberg, M.; Swirszcz, G.; Pascanu, R. Sobolev training for neural networks. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates: Red Hook, NY, USA, 2017. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Computer Vision – ECCV 2016; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2016; Volume 9905. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 318–327. [Google Scholar] [CrossRef] [PubMed]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box Representation for Multi-Category Object Detection in Aerial Images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef]
Han, Y.; Deng, C.; Zhao, B.; Tao, D. State-Aware Anti-Drift Object Tracking. IEEE Trans. Image Process. 2019, 28, 4075–4086. [Google Scholar] [CrossRef]
Han, Y.; Liu, H.; Wang, Y.; Liu, C. A comprehensive review for typical applications based upon unmanned aerial vehicle platform. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 9654–9666. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Deng, C.; Jing, D.; Han, Y.; Wang, S.; Wang, H. FAR-Net: Fast Anchor Refining for Arbitrary-Oriented Object Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505805. [Google Scholar] [CrossRef]
Deng, C.; Jing, D.; Han, Y.; Chanussot, J. Toward hierarchical adaptive alignment for aerial object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615515. [Google Scholar] [CrossRef]
Zhu, H.; Jing, D. Optimizing slender target detection in remote sensing with adaptive boundary perception. Remote Sens. 2024, 16, 2643. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef] [PubMed]
Kim, K.; Lee, H.S. Probabilistic anchor assignment with iou prediction for object detection. In Computer Vision—ECCV 2020; Springer: Cham, Switzerland, 2020; pp. 355–371. [Google Scholar]
Ma, Y.; Liu, S.; Li, Z.; Sun, J. Iqdet: Instance-wise quality distribution sampling for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 1717–1725. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5605814. [Google Scholar] [CrossRef]
Wang, J.; Yang, W.; Li, H.C.; Zhang, H.; Xia, G.S. Learning Center Probability Map for Detecting Objects in Aerial Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 4307–4323. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic Anchor Learning for Arbitrary-Oriented Object Detection. In AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2020. [Google Scholar]
Song, X.; Chen, C.; Li, G.; Liu, Y.; Jing, D.; Peng, J. TSFD-Net: Two-Stage Feature Decoupling Network for Task and Parameter Discrepancies in RSOD. IEEE Geosci. Remote Sens. Lett. 2025, 22, 5002605. [Google Scholar] [CrossRef]
Liu, J.; Jing, D.; Zhang, H.; Dong, C. Srfad-net: Scale-robust feature aggregation and diffusion network for object detection in remote sensing images. Electronics 2024, 13, 2358. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef]
Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. In AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2019. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef]
Jin, Q.; Han, Y.; Wang, W.; Tang, L.; Li, J.; Deng, C. An occlusion-aware tracker with local-global features modeling in uav videos. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 5403–5415. [Google Scholar] [CrossRef]
Li, J.; Cheng, B.; Feris, R.; Xiong, J.; Huang, T.S.; Hwu, W.M.; Shi, H. Pseudo-IoU: Improving label assignment in anchor-free object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 2378–2387. [Google Scholar]
Deng, Z.; Han, Y.; Wang, W.; Deng, C. Decoupling-Based Cross-Layer Connection Removal for Compact Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5655216. [Google Scholar] [CrossRef]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free Oriented Proposal Generator for Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 9759–9768. [Google Scholar]
Ge, Z.; Liu, S.; Li, Z.; Yoshie, O.; Sun, J. Ota: Optimal transport assignment for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2021; pp. 303–312. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-scale Dataset for Object Detection in Aerial Images. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar]
Liu, Z.; Yuan, L.W.L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. Int. Conf. Pattern Recognit. Appl. methods 2017, 1, 324–331. [Google Scholar] [CrossRef]
Zhu, W.; Du, B.; Li, Q. Orientation robust object detection in aerial images using deep convolutional neural network. In 2015 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2015; pp. 1059–1063. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Xian, S.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Advances in Neural Information Processing Systems 29 (NIPS 2016); Curran Associates: Red Hook, NY, USA, 2016. [Google Scholar]
Liu, C.; Ding, W.; Xia, X.; Zhang, B.; Gu, J.; Liu, J.; Ji, R.; Doermann, D. Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2019; pp. 2691–2699. [Google Scholar]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning Modulated Loss for Rotated Object Detection. In AAAI Conference on Artificial Intelligence; AAAI Press: Palo Alto, CA, USA, 2021. [Google Scholar]
Veit, A.; Belongie, S. Convolutional networks with adaptive inference graphs. In European Conference on Computer Vision (ECCV); IEEE: Piscataway, NJ, USA, 2018; pp. 3–18. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for Arbitrary-Oriented Object Detection via Representation Invariance Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021505. [Google Scholar] [CrossRef]
Li, Y.; Li, X.; Dai, Y.; Hou, Q.; Liu, L.; Liu, Y.; Cheng, M.M.; Yang, J. LSKNet: A Foundation Lightweight Backbone for Remote Sensing: LSKNet: A Foundation Lightweight Backbone for Remote Sensing. Int. J. Comput. Vis. 2024, 133, 1410–1431. [Google Scholar] [CrossRef]
Di, W.; Zhang, Q.; Xu, Y.; Zhang, J.; Du, B.; Tao, D.; Zhang, L. Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model. IEEE Trans. Geosci. Remote Sens. 2022, 61, 5607315. [Google Scholar] [CrossRef]
Qu, J.S.; Su, C.; Zhang, Z.; Razi, A. Dilated Convolution and Feature Fusion SSD Network for Small Object Detection in Remote Sensing Images. IEEE Access 2020, 8, 82832–82843. [Google Scholar] [CrossRef]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU Loss for Rotated Object Detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.S.; Bai, X. Rotation-Sensitive Regression for Oriented Scene Text Detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-Oriented Object Detection with Circular Smooth Label. In Computer Vision—ECCV 2020; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2020; pp. 677–694. [Google Scholar] [CrossRef]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV); IEEE: Piscataway, NJ, USA, 2021; pp. 2149–2158. [Google Scholar] [CrossRef]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. RepPoints: Point Set Representation for Object Detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2019; pp. 9656–9665. [Google Scholar] [CrossRef]

Figure 1. Some anchors with high IoU may cover only part of the critical area of the target, which may lead to the omission of critical features.

Figure 2. Some anchors with low IoU values include information about the key parts of the object, which should belong to high-quality positive samples.

Figure 3. Low localization error samples (red) and hard samples (yellow). Among them, the hard samples have a larger function of the loss gradient, which is difficult to regress efficiently during the training process, resulting in unstable training.

Figure 4. Structural diagram of our designed detection head.

Figure 5. Gradient of GDE-Loss.

Figure 6. Regression loss function.

Figure 7. Visualization of DOTA detection results.

Figure 8. Detection results on HRSC2016.

Figure 9. Results on UCAS-AOD (cars).

Figure 10. Results on UCAS-AOD (airplanes).

Figure 11. Analysis of failure cases.

Table 1. Ablation analyses for the UCAS-AOD and HRSC2016 datasets based on RepPoints.

Dataset	SCCM	SC-LA	GDE-Loss	mAP (%)
UCAS-AOD	✗	✗	✗	85.63
	✓	✗	✗	87.72
	✗	✓	✗	87.18
	✗	✗	✓	86.94
	✓	✓	✗	90.13
	✗	✓	✓	90.72
	✓	✓	✓	91.34
HRSC2016	✗	✗	✗	86.01
	✓	✗	✗	87.63
	✗	✓	✗	87.81
	✗	✗	✓	87.74
	✓	✓	✗	89.82
	✗	✓	✓	89.78
	✓	✓	✓	91.29

Table 2. Ablation analyses for the UCAS-AOD and HRSC2016 based on ResNet-101.

Dataset	SCCM	SC-LA	GDE-Loss	mAP (%)
UCAS-AOD	✗	✗	✗	86.12
	✓	✗	✗	87.73
	✗	✓	✗	87.41
	✗	✗	✓	86.94
	✓	✓	✗	88.53
	✗	✓	✓	88.05
	✓	✓	✓	90.02
HRSC2016	✗	✗	✗	85.22
	✓	✗	✗	87.41
	✗	✓	✗	86.89
	✗	✗	✓	86.95
	✓	✓	✗	88.31
	✗	✓	✓	88.42
	✓	✓	✓	90.17

Table 3. Sensitivity analysis of parameter

α

.

Table 3. Sensitivity analysis of parameter

α

.

$α$	0.5	1	2	3	4	5
mAP	87.18	88.35	88.94	91.29	88.98	87.90

Table 4. Sensitivity analysis of parameter

β

.

Table 4. Sensitivity analysis of parameter

β

.

$β$	0.5	1	5	10	15	20
mAP	89.14	89.62	89.86	90.14	90.29	90.18

Table 5. Complexity analysis of SOBA-Net.

Model	Parameters (M)	FLOPS (G)	Inference Time (ms)
ResNet101+FPN	88.6	156	103
RepPoints+FPN	92.3	128	87
SOBA-Net	79.4	72	52

Table 6. Comparisons with state-of-the-art detector results on DOTA.

Methods	Model	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
FR-O [6]	R101	79.42	77.13	17.70	64.05	35.30	38.02	37.16	89.41	69.64	59.28	50.30	52.91	47.89	47.40	46.30	54.13
SCRDet [36]	R101	89.98	80.65	52.09	68.36	68.36	60.32	72.41	90.85	87.94	86.86	65.02	66.68	66.25	68.24	65.21	72.67
SCRDet++ [24]	R101	90.01	82.32	61.94	68.62	69.92	81.17	78.83	90.86	86.32	85.10	65.10	61.12	77.69	80.68	64.25	76.24
RRPN [37]	R101	88.52	71.20	31.66	59.30	51.85	56.19	57.25	90.81	72.84	67.38	56.69	52.84	53.08	51.94	53.85	61.01
ReDet [23]	ReR50	88.81	82.48	60.83	80.82	78.34	86.06	88.31	90.87	88.77	87.03	68.65	66.90	79.26	79.71	74.67	80.10
CSL [38]	R152	90.12	85.53	54.64	75.31	70.44	73.51	77.62	90.84	86.15	86.69	69.60	68.04	73.83	71.10	68.93	76.17
O-RCNN [39]	R50	89.84	85.43	61.09	79.82	79.71	85.35	88.82	90.88	86.68	87.73	72.21	70.80	82.42	78.18	74.11	80.87
$R^{3}$ Det [25]	R152	89.80	83.77	48.11	66.77	78.76	83.27	87.84	90.82	85.38	85.51	65.67	62.68	67.53	78.56	72.62	76.47
RoI-Trans [40]	R101	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
Glid Vertex [15]	R101	89.64	85.00	52.26	77.34	73.01	73.14	86.82	90.74	79.02	86.81	59.55	70.91	72.94	70.68	57.32	75.02
RSDet [41]	R101	89.80	82.90	48.60	65.20	69.50	70.10	70.20	90.50	85.60	83.40	62.50	63.90	65.60	67.20	68.00	72.20
CADNet [42]	R101	87.80	82.40	49.40	73.50	71.10	63.50	76.60	90.90	79.20	73.30	48.40	60.90	62.00	67.00	62.20	69.90
RIDet-O [43]	R101	88.94	78.45	46.87	72.63	77.63	80.68	88.18	90.55	81.33	83.61	64.85	63.72	73.09	73.13	56.87	74.70
LSK-Net [44]	LSK-S	89.57	86.34	63.13	83.67	82.20	86.10	88.66	90.89	88.41	87.42	71.72	69.58	78.88	81.77	76.52	81.21
RVSA [45]	ViT-B	87.63	85.23	61.73	81.11	80.68	85.37	88.26	90.80	86.38	87.21	67.93	69.81	84.06	81.25	77.76	81.01
TED [46]	ViT-B	89.41	84.02	62.38	81.05	81.07	86.12	88.47	90.72	87.03	87.18	68.01	71.84	78.87	79.92	80.39	81.10
KFIoU [47]	ViT-B	89.32	84.52	62.53	81.86	81.55	86.53	89.00	90.76	87.67	88.29	67.31	71.75	79.63	80.25	72.25	80.68
SOBA-Net	ViT-B	88.92	85.51	63.54	81.32	80.40	85.73	89.35	90.61	86.75	87.53	72.41	73.04	82.94	83.24	81.26	81.74

Table 7. Comparison of the effects of multiple rotation methods on model average precision (mAP) on the HRSC2016 dataset.

Method	Backbone	mAP
RDD [48]	VGG16	83.98
RoI-Transformer [40]	ResNet101	85.09
Gliding Vertex [15]	ResNet101	87.91
CSL [49]	ResNet101	88.86
RSDet [41]	ResNet50	86.64
BBAVec [50]	ResNet101	88.21
DAL [20]	ResNet101	89.25
anchor-free:
RepPoints(baseline) [51]	ResNet101	88.61
RepPoints-based	ResNet101	85.06
anchor-based:
S2ANet	ResNet101	90.10
ours	ResNet101	91.29

Table 8. Comparison with other state-of-the-art methods on UCAS-AOD.

Method	Car	Airplane	mAP
YOLOv3 [5]	73.90	89.20	81.55
RetinaNet	85.37	87.62	86.50
RoI-Transformer [40]	88.11	89.92	89.02
RIDet-O [43]	89.11	90.24	89.68
DAL [20]	89.05	90.17	89.61
anchor-free:
RepPoints [51]	84.76	89.01	86.88
RepPoints-based	89.52	90.20	89.85
anchor-based:
S2ANet [14]	88.97	90.51	89.74
ours	89.67	90.51	91.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, Y.; Jing, Z.; Chang, Y.; Jing, D. Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection. Algorithms 2026, 19, 206. https://doi.org/10.3390/a19030206

AMA Style

Liu Y, Jing Z, Chang Y, Jing D. Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection. Algorithms. 2026; 19(3):206. https://doi.org/10.3390/a19030206

Chicago/Turabian Style

Liu, Yong, Zhengbiao Jing, Yinghong Chang, and Donglin Jing. 2026. "Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection" Algorithms 19, no. 3: 206. https://doi.org/10.3390/a19030206

APA Style

Liu, Y., Jing, Z., Chang, Y., & Jing, D. (2026). Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection. Algorithms, 19(3), 206. https://doi.org/10.3390/a19030206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Spatial Orthogonal and Boundary-Aware Network for Rotated and Elongated-Target Detection

Abstract

1. Introduction

2. Related Work

2.1. Fixed Label Assignment

2.2. Dynamic Label Assignment

3. Methodology

3.1. Spatial Crosswise Convolution Module (SCCM)

3.2. Shape-Corrected Label Assignment Strategy (SC-LA)

3.3. Gradient Dynamic Equilibrium Loss Function (GDE-Loss)

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Studies

4.4. Comparative Experiments

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI