1. Introduction
Remote sensing object detection (RSOD) aims to classify and localize objects of interest (e.g., aircraft, ships, and vehicles) in remote sensing images, serving as a critical technology for scene interpretation in the remote sensing field. It is widely applied across various domains, including military reconnaissance, disaster relief, environmental monitoring, and urban planning. In recent years, with the rapid development of deep learning technology, various deep learning-based methods have gradually become mainstream in remote sensing object detection due to their strong feature representation capabilities and have achieved breakthroughs in detection performance.
Deep learning-based RSOD methods can be broadly categorized into two-stage algorithms and one-stage algorithms. Two-stage algorithms first generate a series of regions of interest (Rols) that may contain objects, and then perform object/background classification and bounding box regression on these proposals [
1,
2]. However, remote sensing images contain numerous rotated objects. Traditional methods typically use horizontal anchor boxes to generate candidate regions, which leads to misalignment between targets and anchors, thereby introducing significant background interference. To address this, Ma and Ding et al. [
3,
4,
5,
6] incorporated target angle information into the Region Proposal Network (RPN) and designed specialized angle prediction networks, achieving precise localization using rotated bounding boxes. Enhancing the feature representation of targets is another primary approach to improving detection performance. For example, Fu et al. [
7] enhanced feature representation through multi-scale information fusion; CAD-Net [
8] strengthened the connection between object features and their corresponding scenes by learning the global and local context of regions of interest, thereby enhancing the network’s feature representation; ReDet [
9] uses group convolution to generate rotation-equivariant features, and then combines rotation-invariant Rol alignment to extract rotation-invariant features from rotation-equivariant features to achieve accurate detection of rotating objects. The single-stage algorithm does not need to generate candidate regions, but directly regresses the bounding box and category of the object from multiple locations of the image [
10,
11]. The optimization strategies for single-stage detectors can also be categorized into two parts: efficient feature extraction in the network and bounding box classification and regression. Xu et al. [
12] employed a densely connected network to enhance feature extraction capability. AF-SSD [
13] enhances key features by introducing spatial and channel dual-path attention. AFRE-Net [
14] first creates a fine-grained feature pyramid network (FG-FPN) to provide richer spatial and semantic features and then generates stronger feature representations through a feature enhancement module. OASL [
15] integrates orientation-aware spatial information into the classification and localization branches to enhance feature diversity, thereby establishing a solid foundation for improved detection performance.
Whether it is a two-stage or one-stage algorithm, high-quality feature extraction is a prerequisite for ensuring the final detection quality. As analyzed earlier, we can enhance the network’s feature extraction through various methods such as multi-scale feature fusion, incorporating contextual information, or adding attention mechanisms. Additionally, deformable convolution can be employed to dynamically adjust the receptive field, thereby improving feature adaptability. However, these methods almost universally rely on small-sized convolutional kernels as the basic module for extracting image features. In remote sensing images, targets exhibit significant variations in pose, manifested as large-scale differences, high aspect ratios, and diverse directional arrangements. For backbone feature extraction, small-size square-kernel convolution is constrained by the images’ limited receptive fields, capturing only localized information while maintaining fixed receptive fields across different objects. Although deformable convolutions can dynamically adjust receptive fields, they require meticulous parameter tuning to learn offset values and fail to consistently deliver large receptive fields. Consequently, both approaches lack the overall perception of large-scale or slender objects. Therefore, the use of large-kernel convolution to extract features has begun to attract people’s attention. Recently, LSKNet [
16] first introduced large-kernel convolution into remote sensing object detection. This method connects the feature maps generated by large-size convolution kernels along the channel direction to form features with rich scene context information. PKINet [
17] further parallels large-kernel convolutions of multiple sizes to extract dense texture features of different receptive fields, thereby further improving the network ’s ability to detect multi-scale change objects. CPMFNet [
18] adds a parallel dilated convolution layer structure to the backbone network, and dynamically adjusts the kernel size through dilation rates to ensure that a larger receptive field is provided under the premise of equal computation.
In summary, the aforementioned methods expand the receptive field by employing large-kernel convolutions or dilated convolutions, capturing more contextual information from the scene and modeling long-range dependencies for large-sized and high-aspect-ratio targets. Consequently, they significantly improve the detection performance of remote sensing objects without complex designs. However, in the task of detecting slender and rotated objects, existing networks face conflicts between feature representation and bounding box regression due to the coupling of different attribute parameters, which manifests as the following:
(1) Noise Introduction and Soaring Computational Load: Firstly, due to their large receptive fields, large-kernel convolutions inevitably capture more background information, especially for small and slender targets, thereby introducing significant background noise. Secondly, their computational load increases exponentially compared to small-kernel convolutions. While dilated or atrous convolutions can reduce computational costs, their sparse feature sampling struggles to accurately extract fine boundary features.
(2) Bounding Box Angle Discontinuity: When calculating the offset loss for rotated bounding boxes, the rotation angle is a critical regression parameter. However, angular values are periodic, leading to discontinuities at definition boundaries. Specifically, when an angle approaches a boundary, the predicted box and the ground truth box may be nearly equivalent in physical space, but the regression paths differ significantly depending on the rotation direction, resulting in large deviations in loss computation. (In taking the long-edge representation as an example, if the ground truth angle is 89° (clockwise) and the predicted angle is 91° (clockwise), the physical deviation is only 2°. However, since the predicted angle exceeds the defined range, it is recalculated as −89° (counterclockwise), causing the actual angular deviation in loss computation to become 178°.)
(3) Coupling of Different Attribute Parameters: Traditional remote sensing detection networks only consider the feature differences required for the classification and regression tasks, and only decouple the classification and regression branches at the starting position of the detection head to avoid the interference caused by shared features. However, beyond the significant feature differences between different branches, the bounding box regression branch itself also exhibits variations in the target’s attribute parameters. Specifically, the position and scale of the object need to be predicted based on rotation-invariant features, while the rotation angle prediction is based on rotation and other variable features, and there is still the problem of feature conflict.
To address the aforementioned issues, we must not only achieve more appropriate feature extraction but also maintain the consistency between features and tasks, as well as the continuity of loss in subsequent regression calculations—that is, considering the design and optimization of the network from a holistic perspective. Based on this, we propose SODE-Net, which primarily consists of a backbone network with an MSSO module, a decoupled detection head, and a phase-continuous encoding module for rotation angles. First, we designed the MSSO module in the backbone network to replace stages based on large square-kernel convolutions. This module naturally captures long-range dependencies of objects through multi-scale orthogonal strip-shaped receptive fields without introducing excessive background noise, thereby enabling the network to extract more precise features. Second, we construct a two-stage finely decoupled detection head. It first employs an oriented detection module to generate direction-sensitive features, and then extracts rotation-invariant features from the direction-sensitive features to perform bounding box regression and classification separately. Furthermore, in the regression branch, we use two parallel sets of convolutional layers to predict the position and shape of the rotated bounding box, as well as its angle, achieving secondary decoupling to avoid feature conflicts between the two. Finally, we introduce a phase-continuous encoding module to independently encode the rotation angle, converting the angle value into its phase cosine value. This value remains continuous during periodic angle variations, thereby resolving the discontinuity issue near the boundary of the angle definition domain. The proposed SODE-Net is a general joint solution that combines rotated feature extraction and rotation angle regression. In summary, our contributions include the following:
(1) We designed a backbone network incorporating multi-scale fusion and spatially orthogonal convolution (MSSO) module, which combines the advantages of square-kernel convolutions and strip convolutions. It can efficiently extract features of objects with varying aspect ratios without introducing excessive background noise, particularly excelling in capturing elongated objects.
(2) We designed a detection head with a multi-level decoupled architecture. It first employs rotational filters to generate orientation-sensitive features, and then extracts rotation-invariant features from them to perform bounding box regression and classification separately. In the regression branch, two parallel convolutional groups are used to independently process the regression of box position/shape and angle, achieving secondary decoupling to further separate feature conflicts between them.
(3) We introduce a phase-continuous encoding module in the angle regression branch, which converts the rotation angle values of bounding boxes into cosine phase values. These values remain continuous throughout angular periodic variations, thereby eliminating discontinuity and instability in regression loss caused by the periodicity of angle changes.
(4) The experimental results on large remote sensing datasets DOTAv1.0, HRSC2016, UCAS-AOD, and DIOR-R show that our proposed SODE-Net outperforms other remote sensing image rotation target detection methods and achieves the results of SOTA.
3. Methodology
In this paper, we systematically analyze the challenges faced by existing networks in slender rotating object detection, including difficulties in holistic object perception, the introduction of background noise, and the conflicts in feature and boundary representation caused by the coupling of different attribute parameters. Based on this, we designed a layer incorporating multiple shapes of receptive fields as the backbone stage for feature extraction and decoupled detection heads for parameter regression in the detector. Additionally, we introduced a rotary angle encoder to address the angle jump issue. The overall architecture of SODE-Net is illustrated in
Figure 1. Specifically, we designed a multi-scale fusion and spatially orthogonal feature extraction (MSSO) module that incorporates multiple MSSO blocks in a residual connection manner and connects them in series to extract features through expanded, multi-scale combined receptive fields, as shown in the lower part of
Figure 1. Based on these features, the Region Proposal Network (RPN) generates RoIs for the detection head. Subsequently, within the detection head, we employ a multi-level decoupling module to separate classification, bounding box regression, and rotation angle regression into three distinct branches, each processing features independently to avoid feature coupling issues caused by shared features, as shown in the right part of
Figure 1. Finally, in the angle regression branch, we map the angle value to the phase cosine value through the rotary angle encoder, so that the predicted output becomes a continuous value. All modules will be described in detail below.
3.1. Multi-Scale Fusion and Spatially Orthogonal (MSSO) Module
In RSOD tasks, objects with large-scale variations and high aspect ratios are widely present. Conventional convolutions, due to their limited receptive fields, can only focus on local information, thus lacking a holistic perception of large-scale or elongated objects. Based on this, some methods introduce large-kernel convolution to expand the receptive fields and capture more contextual scene information. However, this approach may introduce significant background noise, adversely affecting detection performance.
To address the aforementioned issues, this section proposes the MSSO Module. Within this module, we first employ the patch-embedding method to divide the feature map into multiple patches and then sequentially connect two MSSO blocks with residual connections. Specifically, each MSSO block first utilizes large-size square-kernel convolution to extract local contextual information, followed by multi-scale spatial orthogonal strip-shaped convolutions to capture long-range dependencies across different orientations while reducing background noise interference. The generated feature maps are subsequently applied as attention weights to the input features, thereby enhancing discriminative feature representation of key regions while suppressing background or redundant information. Compared to previous approaches, the strip convolutions can effectively extract fundamental features of objects with varying aspect ratios, and the multi-scale fusion design accommodates feature extraction for objects at different scales. The sequential architecture effectively combines the advantages of both square convolution and strip-shaped convolution without requiring additional information fusion modules, and the pipeline as illustrated in
Figure 2. The detailed implementation is as follows.
Given an input feature
, with
, we first apply a square-kernel convolution
to extract local contextual features, producing the feature map
defined as
where
and
denote the feature map size and the convolution kernel size. The standard convolution layer primarily extracts local contextual information from the input feature.
Next, we use multi-scale horizontal and vertical orthogonal large-size strip-shaped convolutions to extract features along both spatial axes, which can be formally expressed as
where
and
denote the kernel sizes of the horizontal convolutions, while
and
represent the kernel sizes of the vertical convolutions. The combination of the two sets of orthogonal strip-shaped convolutions can collect features across two spatial axes, and it is also good at capturing long-range dependence information in features.
To further enhance the interaction of features across the channel dimension, we apply a 1 × 1 pointwise convolution to the output of the orthogonal convolution layer, obtaining feature map
Y. Feature map
Y is then used as attention weights applied to the original input
X, yielding the final output feature
A, expressed as follows:
where ⊙ denotes elementwise multiplication. Benefiting from the feature extraction of multi-scale orthogonal convolutions and the channelwise feature aggregation via pointwise convolution, each position in feature map
Y encodes both horizontal and vertical characteristics across a broad spatial region. By applying
Y as attention weights to the input
X, the network enhances its representation of elongated or narrow structures in the spatial dimension while reinforcing feature emphasis in object regions.
3.2. Deep Fine-Grained Decoupling Detection Head
Previous object detectors predominantly employ shared fully connected layers in their detection heads for both classification and localization tasks. However, the spatial correlation of fully connected layers is inherently limited, making them insensitive to positional variations and thus unsuitable for precise localization. To address this, we decouple the classification and localization tasks by introducing an oriented detection module (ODM) to mitigate the inherent conflicts between these two objectives. Furthermore, within the localization task itself, there is also feature coupling between position prediction and angle prediction. Consequently, we implement a secondary decoupling of these two subtasks. The overall workflow is illustrated in
Figure 3.
ODM: This module first employs active rotating filters (ARFs) [
34] to encode the orientation information. The ARF is a K × K × N filter that actively rotates N-1 times during the convolution process, generating feature maps with N orientation channels (N is 8 by default). For input feature map X and ARF feature F, the output of the i-th orientation in feature Y can be expressed as
where
is the feature
F rotated clockwise by
, and
and
are the features of the n-th orientation channel in
and
X, respectively. By applying ARF to convolutional layers, we obtain orientation-sensitive features with explicitly encoded orientation information. While bounding box regression benefits from such orientation-sensitive features, object classification requires rotation-invariant features. To extract rotation-invariant features, we perform rotation-invariant pooling over the orientation-sensitive features, simply selecting the strongest response across all orientation channels as the output feature
:
In this way, we can align features of objects with varying orientations to achieve robust object classification. Compared to orientation-sensitive features, orientation-invariant feature is efficient with fewer parameters. For instance, a H × W × 256 feature map with 8 orientation channels is reduced to H × W × 32 after max-pooling. Finally, we feed the orientation-sensitive features and orientation-invariant features into two subnetworks dedicated to bounding box regression and classification, respectively.
Regression Parameter Decoupling: In the bounding box regression subtask, the prediction of box coordinates requires rotation-invariant features, while angle prediction demands rotation-equivariant features, which still presents an issue of coupled feature representation. Moreover, the subsequent phase-continuous encoding module only involves angle regression. Therefore, we further decompose the regression process of rotated bounding boxes into multiple branches. For different parameters within the bounding box, we group them according to their characteristics and assign separate branches in the detection head to predict each group. This approach enables independent and interference-free regression for distinct parameters, thereby achieving more accurate rotated object detection. The details are as follows.
The parameters of the rotated bounding box are divided into two groups: the box position and size
, and the rotation angle
. In the bounding box regression subtask, we extract distinct feature vectors and from two sets of feature maps, which are then processed by fully connected layers to generate the final predictions. The formula is as follows:
where
and
are the predicted offsets for positional parameters, shape parameters, and angular parameters, respectively.
and
are the learnable fully connected layer parameters. We construct separate loss functions for the two sets of parameters predicted from the grouped features
and
, then use backpropagation to update the convolutional layer parameters. This process continuously enhances the convolutional layer’s ability to extract features corresponding to these parameters, thereby achieving feature decoupling.
3.3. Phase-Continuous Encoding (PCE) Module
As mentioned in the previous section, the objects detected in remote sensing exhibit diverse orientations and are often characterized by high aspect ratios, making horizontal bounding boxes inadequate for accurate representation. Consequently, current mainstream approaches predominantly employ rotated object detectors, which obtain high-quality detection boxes that tightly enclose objects by incorporating rotation angles. However, rotated bounding boxes suffer from the drawback of discontinuous regression loss due to the periodicity inherent in the orientation angles of the bounding boxes.
Based on this, we introduce a phase-continuous encoding module to encode the angle information into a continuous cosine phase values, and then decode the phase information back to the discrete angle prediction through the decoder, thus solving the problem of angle jump at the boundary. In this way, the network further solves the problem that the angle value is discontinuous when calculating the loss on features extracted by the MSSO module, thereby enhancing the network’s precision in capturing slender objects and stable regression under multi-angle variations in objects. The specific workflow is as follows:
Setting the rotation bounding box as the ‘long edge 90’ angle definition, the rotation angle is
and
; the angle-encoding formula is
where
denotes the number of phase-shifting steps, and
. This encoder maps angles to cosine values, and since cosine is continuous over the range of angle variations, it resolves the discontinuity issue caused by direct angle regression. Correspondingly, the angle decoding formula can be expressed as
where the output angle
is within the range of
and is uniquely determined.
The rotating bounding box is represented by five parameters
, which are the center coordinates, width, height, and angle of the box. According to Formula (9), the coding output is in the range of [−1, 1]. In order to make the training more stable, we convert the output features:
where
represents the output features of the convolutional layer, and
denotes the predicted encoded data constrained to the range [−1, 1]. Subsequently, the L1-loss is applied to compute the loss for the angle regression branch:
where
is obtained by applying phase encoding to the rotation angle of the ground truth (GT) bounding box.
Based on the above analysis, the total loss of the network comprises three components: the classification loss
, the bounding box location loss
, and the rotation angle loss
. The overall loss function can be expressed as follows:
where
and
are the balancing weights for bounding box localization and rotation, respectively.