Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection

Li, Ying; Li, Xuemei; Zhang, Caiming

doi:10.3390/rs18060934

Open AccessArticle

Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection

by

Ying Li

,

Xuemei Li

^*

and

Caiming Zhang

School of Software, Shandong University, Jinan 250101, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(6), 934; https://doi.org/10.3390/rs18060934

Submission received: 30 January 2026 / Revised: 10 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

A deformable 1D directional convolution is proposed to implement rotated 1D convolution for adaptively extracting the features of oriented objects.
A tri-branch convolution layer is desinged by combining the deformable 1D directional convolution with the standard square-shaped convolution in a parallel manner.
An orientation-aware feature pyramid network is presented by integrating the tri-branch convolution layer with the feature pyramid network.

What are the implications of the main findings?

The proposed deformable 1D directional convolution only requires simple bidirectional offsets to efficiently implement a rotated 1D convolution, avoiding the time-consuming rotation operation.
The oriented features of objects can be effectively extracted by the orientation-aware feature pyramid network.
The performance of oriented object detection can be improved by adopting the orientation-aware feature pyramid network.

Abstract

Oriented object detection is an important and challenging task in the field of image processing and computer vision. The main challenge in detecting oriented objects comes from their high aspect ratio and being distributed with arbitrary orientations. Various methods have been developed to handle this issue. However, most existing works rely on time-consuming rotation and interpolation operations to align the feature representations of oriented objects. To avoid these operations, in this paper, we first introduce a simple yet effective deformable 1D directional convolution (D1DD-Conv), which implements a rotated convolution by deforming the 1D convolution kernel with horizontal and vertical offsets. Based upon this directional convolution, we then design a tri-branch convolution layer and integrate D1DD-Conv into the feature pyramid network for extracting the directional features of objects. Furthermore, we present a deep model to deal with the oriented object detection task. By allowing the offsets only along with horizontal and vertical directions, D1DD-Conv essentially corresponds to a rotated 1D convolution but without any rotation operations. This simple design is beneficial for efficiently capturing the orientation features of different oriented objects, leading to accurate prediction of the oriented bounding box of each oriented object. Some experiments on three popular datasets show that our model can achieve superior detection performance.

Keywords:

deformable 1D convolution; directional offset; feature pyramid network; oriented bounding box; oriented object detection

1. Introduction

Oriented object detection, as a fundamental task in the image processing and computer vision community, has become a hot research topic and achieved great progress in the past decade [1,2,3,4]. Unlike horizontal object detection that assumes object instances to be aligned with image axes [5], the goal of oriented object detection is to produce some oriented bounding boxes (OBBs) for the objects with arbitrary orientations, which has wide applications, such as aerial image recognition [6], scene text detection [7], smart retail [8], and face detection [9]. Over the past several years, some notable efforts have been made to design oriented object detectors from different perspectives.

A natural approach for predicting OBBs is to directly estimate a tuple

(x, y, w, h, θ)

by adding a rotation angle

θ

into the traditional four-parameter tuple used in horizontal bounding boxes (HBBs) [10]. However, this strategy undergoes the boundary discontinuity issue when the objects rotate near the boundary angle, which is caused by the exchangeability of edges and the periodicity of angle [11]. The direct manifestation of this issue is sharp loss increase at the boundary angle during model training. To eliminate this negative effect of the boundary angle, researchers have proposed a lot of smooth loss functions, such as PSC L1 loss [11], scaled smooth L1 Loss [12], PIoU loss [13], IoU-smooth L1 loss [14], GWD loss [15], KLD loss [16], and KFIoU [17]. These functions can result in more accurate angle estimation.

Another alternative solution that can avoid the discontinuity problem is to define a new representation manner of the oriented objects. To do so, instead of directly estimating the angle

θ

, Xu et al. [18] suggested representing an oriented object by gliding the vertex of the HBBs on each side. This method regresses four length ratios (denoted as

α_{1}

,

α_{2}

,

α_{3}

, and

α_{4}

) that describe the relative gliding offset of each vertex on its corresponding side. As a result, an eight-parameter vector

(x, y, w, h, α_{1}, α_{2}, α_{3}, α_{4})

is adopted to depict an oriented object. Built upon this representation manner, the oriented object detection is achieved by estimating the offsets of each vertex. In [1], a midpoint offset is introduced to represent oriented objects, which only uses two offsets (i.e., one offset relative to the midpoint of the top side of the HBB and the other offset relative to the midpoint of the right side of the HBB). Additionally, double horizontal rectangles [19], convex hulls [20], and outer rectangular projection [21] are also adopted to represent oriented objects, resulting in discontinuity-free detection.

Although most previous works mainly focus on more suitable oriented object representations and loss functions, extracting more discriminative features is equally as important to boost the detection performance. Recently, only a few efforts have been made to explore how to effectively extract oriented features for detecting rotated objects accurately. One solution is to extract oriented features by adopting the rotated convolution equipped with multiple different rotated kernels [8,22,23], in which rotation matrices with learnable angles are adopted to define rotated convolution kernels. Similarly, alignment convolution presented in [24] can extract orientation-sensitive features by using additional oriented offsets. In order to enhance the adaptability of rotated convolution, Pu et al. [5] introduced an adaptively rotated convolution, in which the convolution kernel can adaptively adjust the rotation angle generated by employing a routing function. These convolutions based on rotated kernels can significantly enhance the detection performance of oriented objects. But they generally need to perform time-consuming rotation and interpolation resampling operations, as well as lead to interpolation artifacts caused by the convolution kernel rotation.

To address these issues, we present a simple yet effective deformable 1D directional convolution for capturing oriented features, which is equipped with bidirectional offsets, i.e., a horizontal offset and vertical offset. We use this deformable 1D directional convolution to replace the 2D rotated convolution widely used in several existing detection backbones for efficiently and effectively extracting the feature representations of oriented objects, resulting in accurate detection. Three core contributions of our work are summarized as follows:

We design a simple and effective deformable 1D directional convolution with bidirectional offsets (D1DD-Conv), which can work as a rotated 1D convolution to adaptively extract the features of oriented objects. Different from existing rotated convolutions, our D1DD-Conv only requires a simple bidirectional offset while avoiding time-consuming rotation and interpolation operations.
We present an orientation-aware feature pyramid network by integrating D1DD-Conv with the commonly used feature pyramid network (FPN) and also design a tri-branch convolution layer by combining the standard $3 \times 3$ convolution with a $1 \times 3$ horizontal convolution and a $3 \times 1$ vertical convolution in a parallel manner. Building upon them, we propose an effective two-stage model for oriented object detection.
We introduce an additional deep supervision loss function for model training, which adopts the angles of oriented objects estimated by the offsets of D1DD-Conv. In addition, some experiments are conducted on three popular datasets to verify the effectiveness of the aforementioned components.

The remainder of our paper is organized as follows. Section 2 briefly discusses some related works. In Section 3, we introduce the proposed D1DD-Conv in details. Section 4 evaluates the performance of our proposed model through some experiments. Finally, the conclusion is drawn in Section 5.

2. Related Work

2.1. CNN-Based Oriented Object Detection

In the past two decades, object detection has achieved remarkable progress with many successful detectors such as YOLO [25], R-CNN [26], and DETR [27]. These detectors are designed to locate natural objects with HBBs. However, HBBs cannot accurately represent the oriented objects in scene text images and remote sensing images. To deal with this issue, some well-designed detectors have been presented to locate oriented objects by leveraging OBBs with orientation information.

Owing to the success of convolutional neural networks (CNNs) for various detection tasks [28,29,30], most oriented object detectors use CNNs as the basic architecture. A representative work is the Oriented R-CNN [1] (a two-stage detector for oriented objects), which innovatively represents OBBs by introducing two additional offsets relative to the midpoints of the top and right sides of HBBs. It first utilizes an oriented region proposal network to output several high-quality OBB proposals and then classifies these proposals by employing the features extracted by the backbone, as well as refines their spatial locations. A well-known single-stage oriented detector is R³Det [31], which introduces feature refinement and regression refinement to the detection head for more accurate OBB location. Though R³Det achieves competitive results compared to two-stage detectors, it explicitly regresses the rotation angle, facing a discontinuity problem. Different from Oriented R-CNN and R³Det that need to predefine dense anchors, the anchor-free oriented proposal generator [32] first outputs coarse oriented boxes and then refines these oriented boxes, which produces the final results by using a R-CNN head. The above detectors generally suffer from excessive negative samples caused by the mismatches between RoIs. Regarding other methods, some very recent detectors exploit the transformer or diffusion model as the backbone to boost the detection accuracy [33,34]. However, they face high computational complexity, which limits their applications in resource-constrained platforms. Therefore, our work focuses on designing a more effective CNN-based detector.

To deal with the discontinuity problem, researchers have developed various representation methods for OBBs, such as relative gliding offset of each vertex of HBBs [18], double horizontal rectangles [19], eight-dimensional vectors based on distance and offset parameters [35], orthogonal vector sets [36], polar radius vectors [37], and vector decomposition [38]. In order to avoid an explicit angle regression, several works explored the implicit representation of the rotation angle via angular encoding techniques, e.g., the phase-shifting coder [11] and ACM coder [6]. Meanwhile, using some smooth loss functions [14,15,16,17] can also eliminate the discontinuity problem, which aims to smooth the sharp loss increase at the boundary angle. To address the issue of excessive negative samples, Ding et al. [10] introduced the RoI Transformer, which applies an RoI learner module to transform horizontal proposals into rotated proposals. The RoI Transformer can significantly suppress the number of negative samples during training by matching between rotated RoIs and ground truths. DCFL [39] alleviates the influence of negative samples by adopting a coarse-to-fine assigner with a dynamic prior for sample-label assignments.

2.2. Oriented Convolution

Most existing oriented object detectors directly adopt the standard convolution with square-shaped kernels, e.g.,

3 \times 3

convolution and

5 \times 5

convolution, as the backbone to extract features. This implies that the convolutional features are not well aligned with oriented objects [40,41]. Thus, it becomes a natural choice to employ oriented convolution kernels so as to capture the oriented features. The design of oriented kernels dates back to steerable filters [42] that are synthesized by linear combinations of basis filters, which make it possible to adaptively steer the filters to arbitrary orientations. Inspired by this, Weiler et al. [43] proposed to learn oriented filters with sharing weights over different orientations. These learned filters are steerable by defining them as linear combinations of atomic steerable filters. A similar idea has also been presented in [44], where the oriented filters are achieved by constraining them to a family of circular harmonics (i.e., steerable filters) and linearly combining them.

Unlike the kernel decomposition strategy mentioned above, there have been several efforts to explore oriented convolutions that adaptively adjust the shape of the convolution kernel to capture the orientation features. In [45], Dai et al. presented a deformable convolution, in which additional learnable offset parameters are introduced to optimize the selection of spatial sampling locations for convolution operations. A contemporary work is the active convolution [46], which also introduces the position parameters in the convolution and learns them through backpropagation during training. The crucial difference between deformable convolution and active convolution is that in the former, the position parameters are static and learned per training, while in the latter, the offsets are dynamic and varied per location. In the domain of oriented object detection, alignment convolution was proposed in [24] to deal with the misalignment between standard convolutional features and oriented objects by using additional offset fields. The offset field can be obtained by calculating the differences between regular sampling locations and anchor-based rotated sampling ones. To further capture high-quality oriented features of rotated objects, Pu et al. [5] proposed a rotated convolution, in which the convolutional kernels can adaptively rotate to better capture features of oriented objects with varying orientations. To achieve this goal, the convolutional kernels are rotated by sampling new weights from a rotated kernel space generated by interpolating the standard convolutional weights. Although the adaptive rotated convolution is more flexible to produce high-quality orientation features of oriented objects, interpolating multiple kernel spaces and rotating them incur high calculation costs, which reduces its computational efficiency. To address this issue, oriented 1D kernels were introduced in [47] by rotating a filter with different angles. A 2D kernel that captures orientation features can be approximately implemented by combining multiple oriented 1D kernels. This approximation strategy can improve efficiency while preserving accuracy. However, it still needs to perform the expensive rotation operation. Regarding other methods, 1D rotated convolution with multiple fixed kernels was presented in [48], which implements the 1D convolution operation by sampling the points in twelve fixed orientations.

To further reduce the computational cost of oriented convolutions, in this work, we introduce a novel deformable 1D directional convolution with bidirectional offsets that can efficiently and effectively extract oriented features, integrating them into the FPN framework for handling oriented object detection. Different from the existing deformable convolution that learns arbitrary-oriented 2D offsets to augment spatial sampling locations, our proposed 1D convolution only uses bidirectional offsets, i.e., a horizontal offset and vertical offset, to control the direction of the convolution kernel, resulting in an efficient 1D oriented convolution. Based on this convolution and FPN, we further present an effective network framework for oriented object detection.

3. Method

In this section, we elaborate on the detailed architecture of our proposed method. We first overview the proposed network structure in Section 3.1, which mainly consists of an orientation-aware FPN and detection head. Then, in Section 3.2, we introduce a tri-branch convolution layer and apply it to build an orientation-aware PFN. The key component of the tri-branch convolution layer is the deformable 1D directional convolution (D1DD-Conv) with bidirectional offsets, which is detailed in Section 3.3. Next, the detection head composed of an oriented RPN module and oriented detection module is presented in Section 3.4. Finally, we further introduce a new loss function for training the proposed detector in Section 3.5.

3.1. Architecture Overview

We choose Oriented R-CNN [1], a representative two-stage oriented detector, as our baseline model and propose an orientation-aware FPN by replacing the standard convolution layer used in the original FPN with a tri-branch convolution layer. The orientation-aware FPN is designed as our backbone network to capture multi-scale features sensitive to object orientation, in which the D1DD-Conv is proposed to enhance the orientation-sensitive capability of the backbone. An overview of the proposed architecture is illustrated in Figure 1, which is mainly composed of two pathways with lateral connections, i.e., a top-down pathway and bottom-up pathway. Each pathway is built on the widely used ResNet building blocks. In addition, a new loss function is adopted to further improve the detection accuracy.

3.2. Orientation-Aware Feature Pyramid Network

The feature pyramid network (FPN) [49], originally designed for generic object detection, leverages the multi-scale and pyramidal hierarchy of the classic ResNet [50] to construct feature pyramids. Specifically, equipped with lateral connections, the FPN leverages a dual-pathway framework to produce high-level semantic feature maps across all scales. It consists of a bottom-up pathway and a top-down pathway. The former performs feed-forward computations using the backbone network to produce a feature hierarchy composed of multi-scale feature maps with a scaling factor of 2, while the latter outputs high-resolution features by upsampling high-level semantic feature maps from upper pyramid levels and further boosts feature representation quality by fusing these upsampled features with corresponding outputs from the bottom-up pathway through lateral connections. Both pathways are built upon the ResNet building blocks. These blocks mainly consist of a number of

3 \times 3

convolution layers. However, the

3 \times 3

standard convolution is an orientation-insensitive operation, which is not conducive to capture the orientation characteristics of oriented objects.

To mitigate the aforementioned limitation of FPN, we present an orientation-aware FPN, in which the building block is enhanced by introducing a tri-branch convolution layer. Figure 2 illustrates the structure of a tri-branch convolution layer. We apply two additional D1DD-Convs (i.e., a

3 \times 1

vertical convolution and a

1 \times 3

horizontal convolution) in a parallel manner to enhance the ability of the building block to extract the oriented features. To be specific, we employ 1D horizontal and vertical convolutions to compute the horizontal and vertical features and equip them with bidirectional offsets to make them deformable. These D1DD-Convs are orientation-adaptive operators that can capture the oriented features of objects. This modification makes FPN more orientation-sensitive than its original design. We will elaborate the details of the proposed D1DD-Conv with bidirectional offsets in the following.

3.3. D1DD-Conv with Bidirectional Offsets

As discussed in the previous section, oriented convolutions provide essential orientation information, which is significantly important for accurately predicting OBBs. However, existing oriented convolutions heavily rely on the time-consuming 2D rotation operation. To improve the efficiency of oriented convolution, inspired by the success of employing 1D horizontal and vertical convolutions on visual recognition [51], we propose a D1DD-Conv with bidirectional offsets. Different from the traditional deformable 2D convolution [45] that adopts arbitrary-oriented offsets, our proposed D1DD-Conv only allows the offsets shifting along both horizontal and vertical directions, resulting in an efficient and effective 1D directional convolution. We point out that although using arbitrary-oriented offsets can obtain better performance than bidirectional offsets, the arbitrary-oriented offsets need to perform rotation or interpolation operations with higher computational cost. Figure 3 illustrates the differences between them.

For an arbitrary input feature map X, let

R

be a horizontal/vertical sampling region, as shown in Figure 3c,d (i.e., line-shaped boxes in blue and green). For each reference location

p

, the output of performing a standard 1D horizontal/vertical convolution on X is defined as a weighted summation of sampled values along horizontal/vertical direction, which can be mathematically expressed as

Y (p) = \sum_{s \in R} w (s) \cdot X (p + s) .

(1)

Here, Y represents the output feature map,

w

represents the learnable 1D convolutional kernel, and

s

enumerates the sampling positions in the horizontal/vertical window

R

.

Different from the commonly used 1D convolution defined above, the proposed deformable 1D directional convolution (i.e., D1DD-Conv) is defined by

Y (p) = \sum_{s \in R} w (s) \cdot X (p + s + Δ s) .

(2)

Δ s \in R

denotes the sampling offset that only shifts along both vertical and horizontal directions. As the sampling offset

Δ s

is typically fractional, the bilinear interpolation is generally employed to estimate

X (p + s + Δ s)

[45,52]. Although this interpolation-based strategy works well, its computational burden is relatively high. To address this issue, we simply constrain the offset

Δ s

to be an integer number, which is implemented by the round-up operation. With appropriate values for the offsets, the D1DD-Conv with bidirectional offsets is identical to an oriented convolution, as described in Figure 3c,d. As a result, the proposed D1DD-Conv not only captures the orientation information but also avoids the time-consuming rotation operation. This is the main difference between existing deformable convolutions and ours.

3.4. Detection Head

This subsection details the detection head that includes two parts: an oriented RPN module and an oriented detection module; these are depicted in Figure 1. The first module is designed to yield high-quality OBB proposals with low computational complexity, while the second one is introduced for proposal classification and regression.

3.4.1. Oriented RPN Module

Given the feature maps from the orientation-aware FPN, the oriented RPN module is employed to produce a set of oriented proposals, which is implemented in the form of a lightweight fully convolutional network. Specifically, for this module, we employ the same structure and setting as the Oriented R-CNN described in [1] but replace the

3 \times 3

convolution layer with our proposed tri-branch convolution layer, which aims to improve the capability of oriented feature extraction. For each spatial location within the feature maps, we allocate three anchors of varying aspect ratios, and each of these anchors is represented as a 4D vector, i.e.,

a = (x_{a}, y_{a}, w_{a}, h_{a})

. Here,

(x_{a}, y_{a})

represents the anchor’s center coordinate;

w_{a}

and

h_{a}

are the anchor’s width and height. The module is equipped with two branches composed of 1 × 1 convolutional layers. One branch aims to regress the offsets of the proposals relative to their corresponding anchors, and the other is for predicting the objectness confidence score of each proposal. Note that the regressed offset is a six-element tuple

(Δ x_{a}, Δ y_{a}, Δ w_{a}, Δ h_{a}, Δ α, Δ β)

, where

Δ α

and

Δ β

denote the offsets. We can achieve the oriented proposals by decoding these offsets in the same manner as done in the Oriented R-CNN.

3.4.2. Oriented Detection Module

This module, consisting of a feature alignment block and a dual-branch sub-network, takes the feature maps from the orientation-aware FPN as well as the oriented proposals provided by the oriented RPN module as inputs, and it outputs the predicted OBBs with their object classes. The feature alignment block first applies the rotated RoI alignment operation [1] to reshape each predicted proposal into an oriented rectangle, further corrects the features by using a tri-branch convolution layer without bidirectional offsets, and then extracts rotation-invariant features from each rectangular region to form a fixed-sized feature vector by average pooling. Finally, each feature vector is fed into a dual-branch sub-network to classify the corresponding object category and also regress the OBBs of oriented objects. Note that each branch of this sub-network is composed by three fully connected (FC) layers, and both two branches share the first two FC layers. We would like to point out that due to its adaptivity of orientation, the proposed D1DD-Conv can effectively extract the object features, resulting in accurate object classification and OBB regression.

3.5. Loss Functions

The proposed oriented object detection model is a multitask one for joint regression and classification. Thus, we utilize a hybrid loss function for training the model, which includes two parts: one loss for training the orientation-aware FPN and the oriented RPN module and the other loss for training the oriented detection module and fine-tuning the orientation-aware FPN.

As a by-product of the D1DD-Conv, the angle of oriented objects can be roughly estimated by the offsets of D1DD-Conv. Therefore, in order to accelerate convergence, we also introduce a deep supervision for the orientation-aware FPN and adopt the smooth

L_{1}

function (i.e.,

L_{s m o o t h}

) to calculate the angle loss between the predicted angle ∠ and the corresponding ground-truth

∠^{*}

. This loss is calculated as

\frac{1}{M} \sum_{j = 1}^{M} L_{s m o o t h} (∠_{j}, ∠_{j}^{*})

, where M is the number of primary orientations.

As discussed above, the oriented RPN module is used to produce oriented proposals, which needs to recognize all anchors and output the corresponding offsets of the proposal relative to each anchor. Thus, we leverage the widely used cross-entropy loss function (i.e.,

L_{c e}

) and the smooth

L_{1}

function (i.e.,

L_{s m o o t h}

) for supervising anchor classification and offset regression, respectively. To be specific, the loss used for the first stage training is defined as follows:

L_{1 s t} = \frac{1}{N} \sum_{i = 1}^{N} L_{c e} (p_{i}^{*}, p_{i}) + \frac{1}{N} p_{i}^{*} \sum_{i = 1}^{N} L_{s m o o t h} (o_{i}, o_{i}^{*}) + \frac{1}{M} \sum_{j = 1}^{M} L_{s m o o t h} (∠_{j}, ∠_{j}^{*}) .

(3)

Here, N represents the number of samples,

p_{i}

denotes the classification result of the oriented RPN module, and

p_{i}^{*}

represents the i-th anchor’s binary ground-truth label, i.e.,

p_{i}^{*} \in {0, 1}

.

p_{i} = 0

indicates that the anchor is a positive smaple, while

p_{i} = 1

corresponds to a negative sample.

o_{i}

denotes the regression result of the oriented RPN module, and

o_{i}^{*}

is the ground-truth offset, which can be calculated by the anchors’ coordinates and their corresponding ground-truth boxes.

Similarly, we also utilize the above

L_{c e}

and

L_{s m o o t h}

to define the following loss for jointly training the oriented detection module and fine-tuning the orientation-aware FPN, which is formulated as

L_{2 n d} = L_{c e} (p^{*}, p) + L_{s m o o t h} (t_{i}, t_{i}^{*}) .

(4)

Here,

p = (p_{0}, p_{1}, \dots, p_{K})

represents the classification score over

K + 1

categories, and

p^{*}

is the corresponding ground truth.

t_{i} = (t_{x}^{i}, t_{y}^{i}, t_{w}^{i}, t_{h}^{i}, t_{θ}^{i})

defines the regression offsets for the i-th object class, and

t^{*}

is the ground truth of the regression target. Note that

θ

is the rotation angle of OBBs.

4. Results and Discussion

To comprehensively assess the performance of the proposed detection approach based on D1DD-Conv, we performed some experiments on the datasets for the oriented object detection task, and we report the detection accuracy of different detection methods in terms of their mean average precision (mAP) results.

4.1. Experimental Setup

4.1.1. Datasets

All the experiments were performed on three popular oriented object detection datasets: DOTA-v1.0 [53], HRSC2016 [54], and UCAS-AOD [55]. DOTA-v1.0 is a large-scale aerial image dataset for detecting oriented objects, which has 2806 images and 188,282 instances of 15 object categories. HRSC2016 is a commonly used and challenging ship detection dataset, which consists of 1061 aerial images sourced from Google Earth and has more 20 classes of arbitrary-oriented ships in various appearances. UCAS-AOD is composed of 1510 images with a size

659 \times 1280

of cars and airplanes. For the DOTA-v1.0 dataset, the image resolutions vary within the range of

800 \times 800

to

4000 \times 4000

, while the image size of the HRSC2016 dataset ranges from

300 \times 300

to

1500 \times 900

.

4.1.2. Implementation Details

We employed ResNet [50] as our backbone architecture, which was first pre-trained on ImageNet [56] and subsequently fine-tuned on the downstream aerial image datasets. We implemented our proposed model based upon MMrotate [57] and adopted the SGD optimizer with an initial learning rate of 0.005 to train it, as well as the batch size set to 2. The momentum and weight decay were respectively set to 0.9 and 0.0001. For fair comparisons, we also employed the MMrotate to implement the baseline models with the default parameter setting suggested in the original works. Both training and inference were conducted on the same computer equipped with an Core i9-13900k 3.0-GHz CPU, 128-GB RAM, and a RTX 3090 GPU.

4.1.3. Performance Evaluation Metrics

Following [1], for the DOTA-v1.0 and UCAS-AOD datasets, we adopted the commonly used mean average precision (mAP) as the metric to assess the detection performance of different competing approaches. And the PASCAL VOC2007 (mAP₂₀₀₇) and PASCAL VOC2012 (mAP₂₀₁₂) metrics were used for performance evaluation on the HRSC2016 dataset.

4.2. Ablation Study

To validate the effectiveness of the components used in our detection model, we performed several experiments on DOTA-v1.0 and HRSC2016. We utilized the ResNet-101 based FPN model as the backbone and adopted the Oriented R-CNN model as the baseline. The three main components, including the orientation-aware FPN, tri-branch convolution layer, and deep supervision loss function, were progressively added to the baseline, verifying the effects of these components on the detection quality.

Table 1 shows the quantitative results of the baseline and the proposed model with different settings. It can be seen that when the detection model was only equipped with the orientation-aware FPN, the model could obtain significant improvements, with a 0.67 gain in terms of mAP on DOTA-v1.0, as well as a 0.04 gain of mAP₂₀₀₇ and a 0.06 gain of mAP₂₀₁₂ on HRSC2016. A similar improvement on the UCAS-AOD dataset was also observed. These gains indicate that integrating our D1DD-Conv into the FPN (i.e., orientation-aware FPN) is beneficial for capturing the orientation characteristics of various oriented objects. From the results listed the third row of Table 1, we can also observe that when the tri-branch convolution layer was utilized in the oriented RPN module, the detection quality could be further improved on all the datasets, especially for DOTA-v1.0. Meanwhile, from the last row of Table 1, it is found that applying the deep supervision loss on the orientation-aware FPN training achieved a 0.44 improvement of mAP on DOTA-v1.0 compared with the model without using deep supervision loss. This means that the additional angle loss (i.e., the third term of

L_{1 s t}

loss) is helpful for improving the model’s detection performance. Figure 4 shows the visual results of the proposed model with different settings on a Tennis Court image, in which the orientation-aware FPN, tri-branch convolution layer, and deep supervision loss were progressively added into the baseline. From this figure, we can see that, with different components being progressively adopted, the oriented objects located at the upper-right corner of the image were detected with more precise OBBs and more accurate objectness confidence scores.

In addition, our model, as an improved variant of Oriented R-CNN, slightly increases the parameter count compared to the original version, which mainly comes from the use of D1DD-Conv in the tri-branch convolution layer. Concretely, the parameter counts of our model and Oriented R-CNN are 61.3 M and 61.1 M, respectively. It is worth noting that the marginal increase in parameters had a negligible impact on the inference speed, with our model matching the Oriented R-CN’s performance (26.3 ms vs. 26.2 ms with

800 \times 800

input).

4.3. Experimental Results

In this subsection, we compare our proposed model with four popular detection models based on ResNet, including Oriented R-CNN [1], Gliding Vertex [18], RoI Transformer [10], and ReDet [58]. For a fair comparison, four competing methods were trained and tested in the single scale setting.

4.3.1. Comparisons on DOTA-v1.0 Dataset

Table 2 tabulates the quantitative comparison results of various detection approaches on different object classes of the DOTA-v1.0. Note that the effectiveness of detection approaches was assessed by utilizing the average precision of each category and the mean average precision. From this table, we can observe that the proposed model outperformws all competing detection approaches and achieved the best detection effectiveness. To be specific, except for four object classes (i.e., PL, GTF, LV, and RA), our model based on D1DD-Conv produced the best precision on all the other object classes. For the PL and RA object classes, RoI Transformer is better than others, and our model gets the second place, being significantly superior to the recent SOTA Oriented R-CNN. Overall, our model obtained a 78.02% mAP, yielding a significant improvement of 2.31% over the second best model.

For visual comparisons, Figure 5 exhibits several visual results of different detection approaches on the DOTA-v1.0. From these visual results, we can see that our model produced more accurate OBBs for various oriented objects, surpassing all competing methods. For example, although all the detection methods could locate all five basketball courts in the Basketball Court image (see the first row of Figure 5), our model could predict the orientation of each basketball court more accurately. Similarly, for the Harbor and Large Vehicle images that contain many oriented objects with extreme aspect ratios, the proposed model located these objects with accurate orientation information, which means that the model can estimate the rotation angles of oriented objects well. In addition, from the visual results on the Plane and Ship images (i.e., the fifth and sixth rows of Figure 5), we also observe that our model is significantly superior to the Oriented R-CNN that achieved the second place in the quantitative results. Oriented R-CNN missed several salient planes in the Plane image and produced inaccurate OBBs for some ships in the Ship image. Figure 6 shows more visual results of various oriented objects by the proposed method, which can locate most of the oriented objects with accurate OBBs and direction information.

4.3.2. Comparisons on HRSC2016 Dataset

For the HRSC2016 dataset, following [1], we adopted the commonly used PASCAL VOC2007 (mAP₂₀₀₇) and PASCAL VOC2012 (mAP₂₀₁₂) metrics to evaluate the detection quality of all methods. Table 3 provides quantitative comparison results reported by the competing detection methods. Again, our proposed model obtained the best detection quality in term of both metrics. Several visual detection results on this dataset are given in Figure 7. From this figure, we can observe that, compared with the competitors, our approach precisely locates each oriented objects with accurate OBBs. Meanwhile, although Oriented R-CNN has better quantitative performance than ReDet, its visual results are slightly lower than those of ReDet. Several obvious oriented objects are missed by the Oriented R-CNN, but our method can accurately detect them.

4.3.3. Comparisons on UCAS-AOD Dataset

Table 4 reports the quantitative performance of different oriented object detectors. Our proposed model consistently outperformed the competing method, achieving the best quantitative performance. It surpassed Oriented R-CNN (the second-best model) by 0.24 mAP and RoI Transformer (a popular baseline model) by 1.32 mAP.

5. Conclusions

In this work, we first introduce the D1DD-Conv for adaptively capturing the orientation information of oriented objects. Different from existing detection approaches that often depend upon time-consuming rotation and interpolation operations to align feature representation, our D1DD-Conv can implement rotated 1D convolution by allowing the offsets only along the horizontal and vertical directions. Built upon this D1DD-Conv, we then propose the orientation-aware FPN and tri-branch convolution layer for accurately predicting the OBB of each oriented object. Furthermore, we employ the offsets of the D1DD-Conv to roughly estimate the angles of oriented objects and adopt them as an additional deep supervision loss function to train the detection model. The experimental results conducted on popular datasets demonstrate that our proposed model exhibits superior detection performance. In the future work, we would consider applying the low-rank approximation technique [59,60] to improve the model’s generalization capability.

Author Contributions

Conceptualization, Y.L. and X.L.; methodology, Y.L.; software, Y.L.; validation, Y.L. and X.L.; formal analysis, C.Z.; investigation, X.L.; resources, X.L.; data curation, Y.L.; writing—original draft preparation, Y.L.; writing—review and editing, X.L. and C.Z.; visualization, Y.L.; supervision, Y.L. and C.Z.; project administration, Y.L.; funding acquisition, C.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (NSFC) Joint Fund with Zhejiang Integration of Informatization and Industrialization under Key Project, grant numbers U22A2033 and U24A20219, and Shandong Provincial Natural Science Foundation, grant number ZR2025LQX006.

Data Availability Statement

The DOTA-v1.0 dataset is available at https://captain-whu.github.io/DOTA (accessed on 10 March 2026). The HRSC2016 dataset is available at https://ieee-dataport.org/documents/hrsc2016 (accessed on 10 March 2026). The UCAS-AOD Dataset is available at https://github.com/Lbx2020/UCAS-AOD-dataset (accessed on 10 March 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xie, X.; Cheng, G.; Wang, J.; Li, K.; Yao, X.; Han, J. Oriented R-CNN and beyond. Int. J. Comput. Vis. 2024, 132, 2420–2442. [Google Scholar] [CrossRef]
Wang, X.; Han, C.; Huang, L.; Nie, T.; Liu, X.; Liu, H.; Li, M. AG-Yolo: Attention-guided yolo for efficient remote sensing oriented object detection. Remote Sens. 2025, 17, 1027. [Google Scholar] [CrossRef]
Zhang, C.; Chen, Z.; Xiong, B.; Ji, K.; Kuang, G. EOOD: End-to-end oriented object detection. Neurocomputing 2025, 621, 129251. [Google Scholar] [CrossRef]
Ahmad, I.; Lu, W.; Chen, S.-B.; Tang, J.; Luo, B. Lightweight oriented object detection with dynamic smooth feature fusion network. Neurocomputing 2025, 628, 129725. [Google Scholar] [CrossRef]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6589–6600. [Google Scholar]
Xu, H.; Liu, X.; Xu, H.; Ma, Y.; Zhu, Z.; Yan, C.; Dai, F. Rethinking boundary discountinuity problem for oriented object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 17406–17415. [Google Scholar]
Ma, J.; Shao, W.; Ye, H.; Wang, L.; Wang, H.; Zheng, Y. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans. Multimed. 2018, 20, 3111–3122. [Google Scholar] [CrossRef]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11207–11216. [Google Scholar]
Yang, X.; Yan, J. On the arbitrary-oriented object detection: Classification based approaches revisited. Int. J. Comput. Vis. 2022, 130, 1340–1365. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.-S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Yu, Y.; Da, F. On boundary discontinuity in angle regression based arbitrary oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 6494–6508. [Google Scholar] [CrossRef]
Wei, L.; Zheng, C.; Hu, Y. Oriented object detection in aerial images based on the scaled smooth l₁ loss function. Remote Sens. 2023, 15, 1350. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. PIoU loss: Towards accurate oriented object detection in complex environments. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 195–211. [Google Scholar]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with Gaussian Wasserstein distance loss. In Proceedings of the 38th International Conference on Machine Learning, Virtual, 18–24 July 2021; pp. 11830–11841. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via Kullback-Leibler divergence. In Proceedings of the 35th Conference on Neural Information Processing Systems, Virtual, 6–14 December 2021; pp. 18381–18394. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.-S.; Bai, X. Gliding vertex on the horizental bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar]
Nie, G.; Huang, H. Multi-oriented object detection in arerial images with double horizontal rectangles. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4932–4944. [Google Scholar]
Guo, Z.; Zhang, X.; Liu, C.; Ji, X.; Jiao, J.; Ye, Q. Convex-hull feature adpation for oriented and densely packed object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5252–5265. [Google Scholar] [CrossRef]
Zhang, M.; Ouyang, Y.; Yang, M.; Guo, J.; Li, Y. ORPSD: Outer rectangular projection-based representation for oriented ship detection in SAR images. Remote Sens. 2025, 17, 1511. [Google Scholar]
Zhou, S.; Liu, Z.; Luo, H.; Qi, G.; Liu, Y.; Zuo, H.; Zhang, J.; Wei, Y. GCA2Net: Global-consolidation and angle-adaptive network for oriented object detection in aerial imagery. Remote Sens. 2025, 17, 1077. [Google Scholar] [CrossRef]
Dang, M.; Xu, Q.; Liu, G.; Li, H.; Wang, X. RA²Net: Rotated alignment and aggregation network for oriented object detection in aerial images. Neurocomputing 2026, 659, 131798. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G.-S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Sun, H.; Liu, T.; Zhang, X.; Guo, Q. Resolution-aware criss-cross attention detector for small object detection in aerial images. In Proceedings of the International Conference on Multimedia Retrieval, Chicago, IL, USA, 30 June–3 July 2025; pp. 1219–1227. [Google Scholar]
Hui, S.; Guo, Q.; Geng, X.; Zhang, C. Multi-guidance CNNs for salient object detection. Acm Trans. Multimed. Comput. Commun. Appl. 2023, 19, 117. [Google Scholar] [CrossRef]
Zhao, M.; Guo, Q. Reconstruction-based distillation for anomaly detection. Comput. Graph. 2025, 132, 104328. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R³Det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; pp. 3163–3171. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
Zhao, J.; Ding, Z.; Zhou, Y.; Zhu, H.; Du, W.-L.; Yao, R. OrientedFormer: An end-to-end transformer-based oriented object detector in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5640816. [Google Scholar] [CrossRef]
Zhao, J.; Ding, Z.; Zhou, Y.; Zhu, H.; Du, W.-L.; Yao, R. ReDiffDet: Rotation-equivariant diffusion model for oriented object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 10–17 June 2025; pp. 24429–24439. [Google Scholar]
Ou, Z.; Chen, Z.; Shen, S.; Fan, L.; Yao, S.; Song, M.; Hui, P. Free³Net: Gliding free, orientation free, and anchor free network for oriented object detection. IEEE Trans. Multimed. 2023, 25, 7089–7100. [Google Scholar]
Wang, L.; Zhan, Y.; Liu, W.; Yu, B.; Tao, D. Bounding box vectorization for oriented object detection with tanimoto coefficient regression. IEEE Trans. Multimed. 2024, 26, 5181–5193. [Google Scholar] [CrossRef]
Dang, M.; Liu, G.; Li, H.; Wang, D.; Pan, R. PRA-Det: Anchor-free oriented object detection with polar radius representation. IEEE Trans. Multimed. 2025, 27, 145–157. [Google Scholar] [CrossRef]
Zhou, K.; Zhang, M.; Dong, Y.; Tan, J.; Zhao, S.; Wang, H. Vector decomposition-based arbitrary-oriented object detection for optical remote sensing images. Remote Sens. 2023, 15, 4738. [Google Scholar] [CrossRef]
Xu, C.; Ding, J.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.-S. Dynamic coarse-to-fine learning for oriented tiny object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7318–7328. [Google Scholar]
Yuan, X.; Zheng, Z.; Li, Y.; Liu, X.; Liu, L.; Li, X.; Hou, Q.; Cheng, M.-M. Strip R-CNN: Large strip convolution for remote sensing object detection. arXiv 2025, arXiv:2501.03775. [Google Scholar] [CrossRef]
Chen, C.; Ling, Q. Adaptive convolution for object detection. IEEE Trans. Multimed. 2019, 21, 3205–3217. [Google Scholar] [CrossRef]
Freeman, W.T.; Adelson, E.H. The design and use of steerable filters. IEEE Trans. Pattern Anal. Mach. Intell. 1991, 13, 891–906. [Google Scholar] [CrossRef]
Weiler, M.; Hamprecht, F.A.; Storath, M. Learning steerable filters for rotation equivalent CNNs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 849–858. [Google Scholar]
Worrall, D.E.; Garbin, S.J.; Turmukhambetov, D.; Brostow, G.J. Harmonic networks: Deep translation and rotation equivariance. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5028–5037. [Google Scholar]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Jeon, Y.; Kim, J. Active convolution: Learning the shape of convolution for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4201–4209. [Google Scholar]
Kirchmeyer, A.; Deng, J. Convolutional networks with oriented 1d kernels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6222–6232. [Google Scholar]
Dang, M.; Liu, G.; Kong, A.W.-K.; Zheng, Z.; Luo, N. RO²-DETR: Rotation-equivariant oriented object detection transformer with 1D rotated convolution kernel. Isprs J. Photogramm. Remote Sens. 2025, 228, 166–178. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Ding, X.; Guo, Y.; Ding, G.; Han, J. ACNet: Strengthening the kernel skeletons for powerful CNN via asymmetric convolution blocks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 1911–1920. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. In Proceedings of the International Conference on Learning Representations, Virtual, 3–7 May 2021. [Google Scholar]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Wang, H.; Weng, L.; Yang, Y. Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds. IEEE Geosci. Remote Sens. Lett. 2016, 13, 1074–1078. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the IEEE International Conference on Image Processing, Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Zhou, Y.; Yang, X.; Zhang, G.; Wang, J.; Liu, Y.; Hou, L.; Jiang, X.; Liu, X.; Yan, J.; Lyu, C.; et al. MMRotate: A rotated object detection benchmark using PyTorch. In Proceedings of the ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 7331–7334. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.-S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Guo, Q.; Zhang, Y.; Qiu, S.; Zhang, C. Accelerating patch-based low-rank image restoration using kd-forest and Lanczos approximation. Inf. Sci. 2021, 556, 177–193. [Google Scholar] [CrossRef]
Pu, X.; Xu, F. Low-rank adaption on transformer-based oriented object detector for satellite onboard processing of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5202213. [Google Scholar] [CrossRef]

Figure 1. Overall structure of the proposed detection model. Orientation-aware FPN is a modified version of the classical FPN model that replaces the standard convolution layer with a tri-branch convolution layer that is equipped with the D1DD-Conv with bidirectional offsets, which are employed to extract multi-scale oriented features. Oriented RPN module used in the detection head produces a set of oriented proposals, and oriented detection module produces the probability of object classes and the coordinates of OBBs.

Figure 2. Structure of tri-branch convolution layer.

Figure 3. Illustration of the sampling positions and offsets in different convolutions. Green points and blue points represent regular sampling locations and deformed sampling locations, respectively. Arrows denote the offsets and their directions.

Figure 4. Ablation study of our method on the Tennis Court class of the DOTA-v1.0 dataset. From left to right: orientation-aware FPN, tri-branch convolution layer, and deep supervision loss are progressively added.

Figure 5. Visual detection results on DOTA-v1.0 by different methods. The primary object class from top to bottom: Basketball Court, Harbor, Helicopter, Large Vehicle, Plane, and Ship.

Figure 6. More detection results on various classes of the DOTA-v1.0 by the proposed method.

Figure 7. Visual detection results on HRSC2016 by different methods.

Table 1. Effectiveness study of different components in our model.

Method	Backbone	Orientation-Aware	Tri-Branch	Deep Supervision	DOTA-v1.0	UCAS-AOD	HRSC2016
Method	Backbone	FPN	Convolution Layer	Loss	mAP	mAP	mAP₂₀₀₇	mAP₂₀₁₂
Oriented R-CNN	ResNet-101	✗	✗	✗	76.26	90.11	90.50	97.60
Ours	ResNet-101	✓	✗	✗	76.93	90.24	90.54	97.66
		✓	✓	✗	77.58	90.29	96.56	97.71
		✓	✓	✓	78.02	90.35	90.57	97.74

Table 2. Quantitative comparisons with competing oriented detection approaches on DOTA-v1.0.

Method	Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
RoI Transformer	ResNet-101	88.57	78.21	43.51	75.77	68.64	73.56	83.31	90.66	77.00	81.29	58.31	53.40	62.65	58.71	47.68	69.42
Gliding Vertex	ResNet-101	89.61	84.87	52.06	77.13	72.81	73.07	86.65	90.59	78.82	86.63	59.41	70.68	72.76	70.64	57.18	74.86
ReDet	ReRNet-50	88.68	82.61	53.85	74.02	78.06	84.00	87.97	90.74	87.69	85.57	61.77	60.53	75.99	68.01	63.31	76.19
Oriented R-CNN	ResNet-101	88.81	83.46	55.25	76.81	74.19	82.06	87.50	90.87	85.43	85.35	65.50	66.79	74.37	70.16	57.30	76.26
Ours	ResNet-101	88.89	85.27	55.88	76.51	78.67	83.22	87.99	90.87	87.69	86.85	66.23	67.11	76.12	74.29	64.71	78.02

Here BD, BR, BC, GTF, HC, HA, LV, PL, RA, SH, SP, SBF, ST, SV, and TC are the abbreviations of the following object classes: Baseball Diamond, Bridge, Basketball Court, Ground Track Field, Helicopter, Harbor, Large Vehicle, Plane, Roundabout, Ship, Swimming Pool, Soccer-ball Field, Storage Tank, Small Vehicle, and Tennis Court, respectively. Note that we highlight the best results in bold.

Table 3. Quantitative comparisons with competing oriented detection methods on HRSC2016.

Method	Backbone	mAP₂₀₀₇	mAP₂₀₁₂
RoI Transformer	ResNet-101	86.00	–
Gliding Vertex	ResNet-101	88.20	–
ReDet	ReRNet-50	90.46	97.63
Oriented R-CNN	ResNet-101	90.50	97.60
Ours	ResNet-101	90.57	97.74

The best results are highlighted in bold.

Table 4. Quantitative comparisons with competing oriented detection methods on UCAS-AOD.

Method	Backbone	Airplane	Car	mAP
RoI Transformer	ResNet-101	90.01	88.03	89.03
Gliding Vertex	ResNet-101	90.15	89.42	89.97
ReDet	ReRNet-50	90.24	89.73	90.02
Oriented R-CNN	ResNet-101	90.32	89.90	90.11
Ours	ResNet-101	90.47	90.22	90.35

The best results are highlighted in bold.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Li, X.; Zhang, C. Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection. Remote Sens. 2026, 18, 934. https://doi.org/10.3390/rs18060934

AMA Style

Li Y, Li X, Zhang C. Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection. Remote Sensing. 2026; 18(6):934. https://doi.org/10.3390/rs18060934

Chicago/Turabian Style

Li, Ying, Xuemei Li, and Caiming Zhang. 2026. "Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection" Remote Sensing 18, no. 6: 934. https://doi.org/10.3390/rs18060934

APA Style

Li, Y., Li, X., & Zhang, C. (2026). Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection. Remote Sensing, 18(6), 934. https://doi.org/10.3390/rs18060934

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deformable 1D Directional Convolution with Bidirectional Offsets for Oriented Object Detection

Highlights

Abstract

1. Introduction

2. Related Work

2.1. CNN-Based Oriented Object Detection

2.2. Oriented Convolution

3. Method

3.1. Architecture Overview

3.2. Orientation-Aware Feature Pyramid Network

3.3. D1DD-Conv with Bidirectional Offsets

3.4. Detection Head

3.4.1. Oriented RPN Module

3.4.2. Oriented Detection Module

3.5. Loss Functions

4. Results and Discussion

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.1.3. Performance Evaluation Metrics

4.2. Ablation Study

4.3. Experimental Results

4.3.1. Comparisons on DOTA-v1.0 Dataset

4.3.2. Comparisons on HRSC2016 Dataset

4.3.3. Comparisons on UCAS-AOD Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI