BIF-RCNN: Fusing Background Information for Rotated Object Detection

Zhao, Jianbin; Xu, Xing; Wang, Shaoying; Zhang, Pengfei; Shen, Shengyi; Zeng, Hui; Bu, Xiangshuai; Shen, Yiran; Xue, Kaiwen; Zong, Ping; Zhang, Guoxin; Ou, Zhonghong; Song, Meina; Zhu, Yifan

doi:10.3390/a19020139

Open AccessArticle

BIF-RCNN: Fusing Background Information for Rotated Object Detection

by

Jianbin Zhao

¹,

Xing Xu

¹

,

Shaoying Wang

¹,

Pengfei Zhang

¹,

Shengyi Shen

²,

Hui Zeng

²,

Xiangshuai Bu

²,

Yiran Shen

²,

Kaiwen Xue

²,

Ping Zong

²,

Guoxin Zhang

²

,

Zhonghong Ou

^3,*

,

Meina Song

²

and

Yifan Zhu

²

¹

State Grid Hebei Information and Telecommunication Branch, Shijiazhuang 050011, China

²

School of Computer Science (National Pilot Software Engineering School), Beijing University of Posts and Telecommunications, Beijing 100876, China

³

State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing 100876, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(2), 139; https://doi.org/10.3390/a19020139

Submission received: 5 January 2026 / Revised: 2 February 2026 / Accepted: 4 February 2026 / Published: 9 February 2026

(This article belongs to the Section Algorithms and Mathematical Models for Computer-Assisted Diagnostic Systems)

Download

Browse Figures

Versions Notes

Abstract

Rotated object detection aims to achieve precise localization by strictly aligning bounding boxes with object orientations, thereby minimizing background interference. Existing methods predominantly focus on extracting intra-object features within rotated bounding boxes. However, these approaches often overlook the discriminative contextual information from the surrounding background, leading to classification ambiguity when internal features are indistinguishable. To address this limitation, we propose Background Information Fusion R-CNN (BIF-RCNN), a novel rotated object detection framework that strategically re-integrates the background context from the object’s horizontal enclosing region to validate its category, turning previously discarded “noise” into auxiliary discriminative cues. Specifically, we introduce a dual-level rotation-horizontal feature fusion module (DFM), which leverages horizontal bounding boxes enclosing the rotated objects to extract contextual background features. These features are then adaptively fused with the internal object features to enhance the overall representation capability of the model. In addition, we design a Prediction Difference and Entropy-Constrained Loss (PDE Loss), which guides the model to focus on hard-to-classify samples that are prone to confusion due to similar feature representations. This loss function improves the model’s robustness and discriminative power. Extensive experiments conducted on the DOTA benchmark dataset demonstrate the effectiveness of the proposed method. Notably, our approach achieves up to a 4.02% AP improvement in single-category detection performance compared to a strong baseline, highlighting its superiority in rotated object detection tasks.

Keywords:

background information; joint loss; remote sensing; rotated object detection

1. Introduction

Object detection methods annotate objects using horizontal bounding boxes, and feature extraction backbones built on standard convolutions can effectively capture axis-aligned object representations [1,2,3,4,5,6]. However, in real-world scenarios, objects are not always arranged horizontally; instead, they often exhibit rotations of varying scales and angles [7,8]. Because conventional feature extractors rely on horizontally symmetric convolutional operations, they struggle to accommodate the diverse orientation changes of rotated objects. This mismatch leads to poor alignment between extracted features and the true object geometry, thereby limiting the accuracy of category classification and localization in horizontal object detection. To address these issues, rotated object detection has emerged as a distinct research direction, attracting extensive attention and investigation in the academic community.

A lot of work [9] has been devoted to rotated object detection, and one of the most actively studied issues is the discrepancy between features extracted using axis-aligned convolutions and the true characteristics of rotated objects. Existing studies [9] largely focus on extracting intra-object features. For example, ROI Transformer [10] and Gliding Vertex [11] estimate more suitable localization priors for rotated targets through different strategies. Building on RetinaNet [12], S2ANet [13] introduces a rotation-aligned detection framework that explicitly addresses the misalignment between features and oriented objects. ARS-DETR [14] designs a rotated deformable attention mechanism that adaptively aligns region features with oriented objects, effectively reducing feature misalignment in rotated object detection. LSKNet [15] incorporates large selective kernels and dynamically adjusts receptive-field sizes, enabling the network to accommodate varying contextual ranges required by different object categories. ReDet [16] learns rotation-invariant features and improves detection accuracy. Overall, these approaches alleviate performance degradation caused by feature misalignment to varying degrees.

However, while mainstream studies [10,11,13,14,15,16,17] predominantly focus on maximizing the alignment between extracted features and the target’s geometry, we note an aspect that has been largely overlooked: the background surrounding a rotated object may contain latent yet crucial cues that can assist detection. This insight is motivated by our qualitative analysis. We observe that for targets with weak discriminative appearances, relying exclusively on intra-object features can limit classification performance, whereas contextual cues from the surrounding background may provide complementary evidence. As shown in Figure 1, when only the surface appearance of the bridge deck or the roadway is considered, whether by human visual inspection or by feature extraction on the corresponding feature maps, the two categories can be easily confused due to their highly similar intra-object characteristics. In contrast, once the surrounding background is taken into account, contextual cues such as the ocean and land are incorporated. Because these background types exhibit pronounced visual differences, they provide strong complementary evidence and can substantially facilitate the correct identification of the target’s true class.

In addition, refs. [18,19] report that a considerable portion of samples located outside the ground-truth bounding boxes can still exhibit strong regression performance. As shown in Figure 2, approximately 28% of the predicted boxes with an Intersection over Union (IoU) ≥ 0.5 originate from regions outside the annotated targets. This observation suggests that features beyond the object boundary have not been effectively utilized, even though they may contain key contextual cues related to the object itself. Consequently, how to effectively mine and integrate background information has become a critical open problem in rotated object detection.

To address this challenge, we propose a novel rotated detection framework that incorporates background feature fusion and joint loss optimization. Based on a standard two-stage rotated object detection framework, we further propose two essential components to enhance detection accuracy. One is the Dual-Level Rotated-Horizontal Feature Fusion Module. Specifically, given the rotated proposals generated by the Region Proposal Network (RPN), we compute the ratio between the area of each rotated Region of Interest (RoI) and its corresponding horizontal bounding box as a quantized representation of object orientation. This orientation descriptor is then fed into a multi-layer perceptron (MLP) to produce conditional fusion weights, which are used to adaptively integrate rotated features and horizontal features. We observe that rotated features mainly capture object-aligned geometric information, while horizontal features preserve richer background context. Their adaptive fusion leads to more discriminative representations, especially for densely packed or arbitrarily oriented objects. The other is the Joint Loss Optimization Based on Prediction Difference and Entropy Constraints. Compared with conventional classification losses, our joint loss incorporates both prediction difference and entropy-based constraints, which reduces uncertainty in challenging samples.

To summarize, our contributions are as follows:

We propose a novel framework, BIF-RCNN, which explicitly incorporates background context cues into the overall detection architecture.
We introduce a Dual-Level Rotated-Horizontal Feature Fusion Module (DFM) that explicitly couples the features of an oriented proposal with that of its tightest horizontal bounding box, allowing background context outside the rotated region to be distilled into the object representation and yielding a richer, context-aware embedding that markedly boosts localization and classification accuracy in cluttered scenes.
We formulate a joint optimization loss that couples prediction discrepancy with an entropy constraint to disentangle the highly similar features of rotated instances. Built upon standard cross-entropy, the loss penalizes both the mutual information between ambiguous predictions and the entropy of individual logits, driving the model to yield low-uncertainty decisions on hard, rotation-sensitive samples.

2. Related Work

2.1. Feature Extraction Design for Rotated Object Detection

Rotated object detection is a core challenge in fields such as remote sensing image analysis and autonomous driving. The key to using this feature lies in extracting features that can accurately characterize the target’s orientation and geometric shape. Due to arbitrary target orientations and complex backgrounds, the features used by traditional horizontal detectors often suffer from information loss or confusion, leading to suboptimal detection performance for dense, inclined targets. Therefore, feature extraction design tailored for rotational characteristics has become a research focus in this field, aiming to endow the features themselves with stronger orientation discriminability and geometric sensitivity.

To enhance the orientation awareness of features, existing research primarily unfolds along two paths: first, improving the basic feature extraction architecture, and second, designing dedicated feature enhancement or interaction modules. Regarding architectural improvements, ASL-OOD [20] enhances multi-scale representation and global context modeling capabilities of features by integrating Swin Transformer. In terms of feature enhancement, SA3Det [21] designs a pixel-level self-attention module to preserve critical spatial relationships of small targets, while RO2-DETR [22] introduces a rotation-equivariant attention module that explicitly models object orientation and filters orientation-related target information from complex backgrounds. These methods primarily focus on mining more effective target representations or suppressing background noise through global attention mechanisms.

To address issues of significant target scale variation and feature misalignment, multi-scale feature fusion and adaptive design have become another major research direction. For example, AMFEF-DETR [23] designs an adaptive backbone network and an intra-layer feature interaction module to dynamically adapt to targets of different scales. Similarly, MSRO-Net [24] employs a CNN-Transformer hybrid architecture for collaborative feature extraction and achieves refined feature aggregation through a coordinate-aware pyramid. The core idea of these works is to generate features more robust to scale and shape variations by improving the information flow and combination mechanisms within the network.

Despite significant progress made by the aforementioned methods, they share a common limitation: their feature extraction and enhancement processes primarily focus on the region inside the target bounding box or the hierarchical features of the network itself, while generally neglecting effective contextual information outside the target. For instance, attention mechanisms aim to suppress the background but fail to systematically utilize discriminative information within it; multi-scale fusion optimizes the combination of internal features but does not treat external features as complementary resources for fusion. This insufficient utilization of “effective external features” results in target representations that remain inadequate and non-robust in extremely complex or occluded scenes. Consequently, how to break through the existing framework to actively and structurally leverage the environmental information surrounding the target for representation enhancement has become a neglected yet critical research direction.

The innovation of this work precisely addresses this common shortcoming. To tackle the problem that existing methods have not fully utilized a wealth of effective external features, we design a Dual-Level Rotated-Horizontal Feature Fusion Module. This module systematically leverages the external background information of the target by integrating features from the rotated bounding box and its enclosing horizontal bounding box, thereby enhancing the overall representation of the target in complex backgrounds and improving detection performance. This design paradigm shifts away from solely optimizing internal features, offering a new perspective for feature extraction in rotated object detection.

2.2. Loss Function Design for Rotated Object Detection

The design of loss functions is central to achieving high-precision localization and classification in rotated object detectors. The geometric complexity of rotated bounding boxes, particularly the introduction of the angle parameter, poses challenges such as boundary discontinuity and misalignment with the Rotated IoU evaluation metric for traditional horizontal detection losses (e.g., Smooth L1). To drive models to learn the rotational geometry of targets more accurately, recent research (2024–2025) has focused on designing more precise, efficient, and rotation-aware regression loss functions. The evolution is primarily evident in two directions: further refinement of Gaussian distribution-based representations and the development of novel, computationally efficient approximations of IoU.

In the refinement of Gaussian distribution representations, research has shifted from general optimization to specialized designs targeting specific geometric characteristics. The KLD loss [25], which models rotated boxes as Gaussian distributions and uses Kullback–Leibler divergence for regression, offers the key advantage of dynamically adjusting the gradient weight of the angle parameter based on the aspect ratio of the target. This property significantly enhances detection accuracy for objects with large aspect ratios. Furthermore, to address the difficulty of accurately representing near-square objects with standard Gaussian distributions, research has proposed an anisotropic Gaussian representation combined with an improved Bhattacharyya Distance as the loss [26], specifically optimizing bounding box fitting accuracy for such targets.

On the other hand, to directly approximate the ultimate evaluation metric of Rotated IoU and seek a balance between computational complexity and accuracy, a series of novel, differentiable IoU-approximation losses has been proposed. For instance, the FPDIoU loss [27] constructs a comprehensive geometric metric by jointly considering the distances between four vertices, the center points, and the rotation angle, aiming for efficient and high-precision regression. The Ellipse IoU loss [28] offers a lighter-weight approach by approximating the true IoU through calculating the IoU of the rotated box’s inscribed ellipse, ensuring differentiability while simplifying computation.

Despite significant progress in improving the geometric precision of rotated box regression, the aforementioned methods share a common limitation: these loss functions primarily focus on regressing the geometric parameters of bounding boxes, without explicitly modeling and optimizing the prediction uncertainty generated by the model in the classification task, especially when facing rotated objects with highly similar features. Existing work improves localization quality indirectly by refining regression losses, but lacks a mechanism to directly guide the model to reduce the entropy of classification confidence for ambiguous samples to learn more discriminative feature representations. Consequently, detector classification performance remains challenged in complex scenarios where targets appear similar due to viewpoint, occlusion, or intra-class variation. The innovation of this work is that it precisely addresses this common shortcoming. We summarize the limitations of these existing techniques in Table 1. As shown, most approaches overlook the potential of external background context or lack explicit uncertainty modeling, which motivates the design of our proposed BIF-RCNN.

To tackle the problem that existing loss functions do not directly optimize the model’s classification prediction uncertainty, we design a joint optimization loss function based on prediction discrepancy and entropy constraint. By introducing an entropy loss term on top of the standard cross-entropy loss, which works collaboratively with the maximum classification loss, this function aims to actively compel the model to reduce predictive ambiguity for complex samples with feature similarity, thereby enhancing the model’s discriminative power and robustness directly at the level of the optimization objective.

3. Method

In this section, we introduce the DFM and the joint optimization loss function based on PDE Loss. The overall pipeline is depicted in Figure 3, providing a clear overview of how these components are integrated into our detection framework.

3.1. Dual-Level Rotated-Horizontal Feature Fusion Module (DFM)

In rotated object detection, two-stage detection algorithms typically outperform one-stage approaches. This advantage stems from the fact that two-stage methods first generate candidate RoIs using a RPN, followed by region-wise feature extraction via RoI Pooling or RoI Align. This process enables more accurate feature representation of the candidate regions, thereby improving both classification and regression performance.

To leverage this advantage, we design DFM specifically for two-stage rotated object detection frameworks. An illustration of the DFM is shown in Figure 3a.

First, we observe that the background surrounding a rotated bounding box may contain useful contextual information that can benefit detection. To leverage this, we extract the horizontal features of the rotated object by aligning the features of its minimum enclosing horizontal bounding box using a horizontal region alignment strategy.

However, the non-overlapping regions between the rotated box and its enclosing horizontal box may also introduce background noise. As illustrated in Figure 4, when target objects are densely and obliquely arranged, the internal area of the horizontal bounding box tends to be significantly larger than that of the rotated box, which may result in the inclusion of irrelevant background content. Such noise can adversely affect classification and regression performance. To address this issue, we apply convolutional operations to the horizontally aligned features to filter out background noise and retain only the most relevant information.

Figure 3. The overview of BIF-RCNN. Rotated Object Detection via Background-Aware Feature Fusion and joint loss optimization. Given the candidate regions produced by the RPN, we employ (a) DFM to adaptively fuse rotated and horizontal features, thereby generating more discriminative refined feature representations. (b) For the classification objective, we design a joint loss function that jointly enforces prediction difference and entropy-based regularization, which strengthens the regularization effect of the classification branch.

Figure 4. Diagram of densely and obliquely arranged targets. In dense and oblique scenarios, horizontal bounding boxes tend to include excessive background compared to rotated boxes. The green and yellow boxes represent different bounding box types: the green box highlights the rotated bounding box, and the yellow box highlights the horizontal bounding box.

Moreover, we observe that the degree of background information introduced varies with the rotation angle of the object. As illustrated in Figure 5, for objects of the same size, those rotated closer to 45 degrees tend to include more background content within their enclosing horizontal bounding boxes, whereas those with smaller rotation angles introduce less background.

Since the amount of introduced background information can be intuitively reflected by the ratio of the rotated bounding box area to its enclosing horizontal box area, we design a feature weighting mechanism based on this area ratio. Specifically, the horizontally aligned features are reweighted according to the proportion of the rotated area, enabling the model to adaptively adjust the influence of external background information.

As shown in Figure 6, the condition weight (CW) is generated based on the area ratio between the rotated bounding box and its enclosing horizontal bounding box. Specifically, the ratio is passed through a linear layer followed by a Sigmoid activation function. The output of this operation has the same dimensionality as the number of channels in the feature map.

We adopt the Linear + Sigmoid structure to allow the network to learn an optimal utilization strategy for the area ratio through training, rather than relying on manually designed priors, which may introduce biases and adversely affect detection performance.

After computing the condition weight, we perform weighted fusion of the horizontally aligned external features and the rotated features according to the following equation:

F e a t s_{f u s e d} = F e a t s_{o r i e n t e d} + Condition Weight \cdot F e a t s_{e x t e r n a l} .

(1)

3.2. PDE Loss

The features surrounding a rotated candidate box may implicitly contain high-value information that is beneficial for object prediction. From a theoretical perspective, this assumption is most beneficial for cases where target objects are difficult to classify using only their intrinsic features.

To address the classification difficulty caused by high feature similarity among rotated objects, we propose a novel classification loss—Prediction Difference and Entropy-Constrained Loss (PDE Loss). As shown in Figure 3b, PDE consists of cross-entropy loss, entropy regularization, and prediction difference constraint. It is formulated as follows:

L_{c l s} = λ_{c e} \cdot L_{c e} + λ_{p d e} \cdot (L_{d i f f} + L_{e n t r o p y}),

(2)

where

L_{c e}

is the traditional cross-entropy loss,

L_{d i f f}

is the prediction difference loss, and

L_{e n t r o p y}

is the entropy constraint loss.

λ_{c e}

and

λ_{p r e d i c t d i f f - e}

are hyperparameters that control the weighting of the cross-entropy loss, prediction difference loss, and entropy constraint loss, with values of 0.8 and 0.2, respectively. This loss is based on the traditional cross-entropy loss

L_{c e}

used for classification. The formula for calculating cross-entropy loss is given as follows:

L_{c e} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{c} y_{i, c} l o g (p_{i, c}),

(3)

where N is the number of samples involved in the classification loss calculation, C is the number of classes the model needs to predict,

y_{i, c}

is the true label of the i-th sample for class c (0 or 1),

p_{i, c}

is the predicted probability of the i-th sample for class c, and

l o g

denotes the logarithmic operation. We combine the traditional cross-entropy loss with the prediction difference loss

L_{d i f f}

and entropy constraint loss

L_{e n t r o p y}

proposed in this work. During training, this combined loss allows the model to focus more on the samples that are difficult to classify, thereby improving the model’s detection performance and robustness. The following sections will introduce the prediction difference loss and entropy constraint loss in detail.

3.2.1. Prediction Difference

During the classification stage, the model generates a probability for each category, indicating the likelihood that the target belongs to that class. Naturally, for rotated objects from different classes but with highly similar intrinsic features, the features extracted after multiple downsampling and convolution operations may not be sufficiently distinguishable. As a result, the predicted probabilities for these classes become very close, making the model prone to misclassification.

To address this issue, we propose a prediction difference loss that measures the interpolated margin between the highest predicted probability and the second-highest probability for each sample. This simple and intuitive design encourages the network to make more decisive classification predictions. Specifically, maximizing this probability difference suppresses classes with similar prediction scores, preventing the network from making ambiguous decisions and thereby enhancing classification discriminability. By introducing this loss, the model effectively reduces uncertainty in its predictions.

The prediction difference loss is defined as follows:

L_{d i f f} = - \frac{1}{N} \sum_{i = 1}^{N} (m a x (p_{i}) - s e c o n d_m a x (p_{i})),

(4)

where

p_{i}

represents the predicted probability distribution of the i-th sample, and

m a x (p_{i})

and

s e c o n d_m a x (p_{i})

denote the highest and second-highest predicted probabilities, respectively.

3.2.2. Entropy Constraint

Similar to the cross-entropy loss, the entropy constraint loss is also a traditional loss function, and its formulation is given as follows:

L_{e n t r o p y} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} p_{i, c} log (p_{i, c}) .

(5)

Although the entropy constraint loss and cross-entropy loss share similar names and mathematical forms, they differ fundamentally in their underlying purposes. The entropy constraint loss is primarily used to measure the uncertainty of the prediction distribution. In classification tasks, it encourages the model to produce more confident predictions by minimizing the entropy of the predicted probability distribution, thereby reducing ambiguity and improving stability and robustness.

In contrast, the cross-entropy loss measures the discrepancy between the predicted distribution and the ground-truth label distribution. In classification problems, it optimizes model accuracy by maximizing the predicted probability of the correct class. In multi-class classification tasks, cross-entropy drives the model to assign the highest probability to the true class, thus improving the correctness of predictions.

In summary, while the entropy constraint loss emphasizes constraining the model’s predictive behavior and enhancing its determinacy, the cross-entropy loss focuses on directly improving classification accuracy. Therefore, in this work, we incorporate the entropy constraint loss alongside the cross-entropy loss, allowing it to work collaboratively with the prediction difference loss to reduce uncertainty in challenging samples.

3.3. Training Loss

To simultaneously achieve precise object localization and discriminative classification, we formulate the training of BIF-RCNN as a multi-task learning problem. The overall objective function L is defined as the weighted sum of the classification loss

L_{c l s}

and the regression loss

L_{r e g}

. The total loss is formulated as follows:

L = L_{c l s} + λ_{r e g} \cdot L_{r e g},

(6)

where

λ_{r e g}

is a hyperparameter used to balance the regression task weight.

3.3.1. Classification Loss

As detailed in Section 3.2, standard cross-entropy loss is often insufficient for distinguishing rotated objects with high inter-class similarity and complex backgrounds. To address this, we employ the proposed Prediction Difference and Entropy-Constrained (PDE) Loss as the classification objective. This loss jointly optimizes the model by minimizing the prediction uncertainty and maximizing the margin between ambiguous classes. The formulation is given by:

\begin{matrix} L_{c l s} & = λ_{c e} \cdot L_{c e} + λ_{p d e} \cdot (L_{d i f f} + L_{e n t r o p y}) \\ = - λ_{c e} \frac{1}{N} \sum_{i = 1}^{N} log (p_{i, y_{i}}) \\ - λ_{p d e} \frac{1}{N} \sum_{i = 1}^{N} (max (p_{i}) - second_\max (p_{i})) \\ - λ_{p d e} \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{C} p_{i, c} log (p_{i, c}), \end{matrix}

(7)

where N represents the number of samples, C denotes the number of categories, and

p_{i}

is the predicted probability distribution for the i-th sample. The hyperparameters

λ_{c e}

and

λ_{p d e}

control the trade-off between the standard supervision and the uncertainty regularization. Based on our experimental observations (see Section 4.3.2), we set

λ_{c e} = 0.8

and

λ_{p d e} = 0.2

.

3.3.2. Regression Loss

Following the baseline Oriented R-CNN, we adopt the parameterized offset regression mechanism. The regression branch outputs offsets for the five-parameter tuple

t = (x, y, w, h, θ)

, representing the center coordinates, width, height, and rotation angle, respectively. The regression loss is calculated using the Smooth

L_{1}

function:

L_{r e g} = \frac{1}{N_{r e g}} \sum_{i = 1}^{N_{r e g}} \sum_{k \in {x, y, w, h, θ}} {smooth}_{L 1} (Δ_{i, k}),

(8)

where

N_{r e g}

is the number of positive samples, and

Δ_{i, k}

represents the difference between the predicted offset and the ground-truth target for the k-th geometric parameter. Specifically for the angle parameter

θ

, normalization or modulation is typically applied to handle angular periodicity (e.g., ensuring the loss is invariant to the period of

π

). The Smooth

L_{1}

function is defined as:

{smooth}_{L 1} (u) = \{\begin{matrix} 0.5 u^{2}, & if | u | < 1 \\ | u | - 0.5, & otherwise . \end{matrix}

(9)

By jointly optimizing

L_{c l s}

and

L_{r e g}

, the network is guided to learn discriminative feature representations enhanced by the DFM module, while simultaneously achieving robust geometric alignment for rotated objects.

4. Experiment

4.1. Setting

4.1.1. Dataset and Implementation Details

We conduct all experiments on the DOTA benchmark [29] which provides a large-scale and challenging dataset. It contains 2806 high-resolution images (ranging from 800 × 800 to 4000 × 4000 pixels), with a total of 188,282 annotated object instances across 15 object categories: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer Ball Field (SBF), Roundabout (RA), Harbor (HA), Swimming Pool (SP), and Helicopter (HC). To quantitatively evaluate the detection performance, we adopt the standard Mean Average Precision (mAP) as the primary metric. In all experimental tables, the numerical values represent the Average Precision (AP) for each specific category, and the last column reports the mean value (mAP) across all categories. Following the standard DOTA evaluation protocol [29], the AP is calculated with an Intersection over Union (IoU) threshold of 0.5 (consistent with PASCAL VOC 07 metric [30]).

For the DOTA dataset, the training, validation, and testing sets are divided following the official split ratio of 1/2, 1/6, and 1/3, respectively. We utilize the image tiling tool provided by MMRotate to preprocess the dataset. Specifically, the original large-scale aerial images are cropped into multiple 1024 × 1024 patches with a stride of 200 pixels, making them suitable for input into the network during training and validation.

During inference, the detection results from all cropped patches originating from the same original image are first merged and deduplicated. The final detection outputs are then submitted to the official DOTA evaluation server for performance assessment.

4.1.2. Baseline

Oriented R-CNN [9] is adopted as the baseline method to validate the effectiveness of the proposed approach. Oriented R-CNN is a two-stage rotated object detection model based on anchor boxes. It extends the traditional R-CNN framework with several modifications to better handle oriented objects in aerial imagery and other rotation-sensitive scenarios.

Specifically, in the feature extraction stage, Oriented R-CNN utilizes Oriented RoI Align, which performs feature alignment according to the orientation of the object. This allows the network to better capture the shape and direction of rotated instances, thereby improving detection accuracy. Furthermore, to address the periodicity issue in angle prediction (e.g., 0° and 180° may represent the same orientation but result in large losses during training), Oriented R-CNN introduces a vertex offset mechanism. This design enables the model to more stably predict object shapes and further enhances detection precision.

For the backbone network, the baseline model adopts the combination of ResNet-50 [31] and Feature Pyramid Network (FPN) [32].

4.2. Main Results

This study compares the proposed rotated object detection model—which integrates background information fusion and joint loss optimization—with mainstream single-stage [13,33,34,35,36,37,38,39,40,41,42,43] and two-stage [10,44,45,46] rotated object detectors. The comparative results are presented in Table 2.

Our BIF-RCNN demonstrates superior overall performance relative to existing state-of-the-art rotated detection algorithms. Notably, it achieves higher precision in approximately 33% of object categories, including BD, GTF, SV, and HC. For instance, the model attains a maximum per-category AP of 56.30%, outperforming Oriented R-CNN’s 52.28% by a margin of +4.02% AP.

4.3. Ablation Study

Given the limited exploration of external sample points surrounding targets in the existing literature, we systematically evaluated multiple approaches to achieve our design objectives within constrained computational resources. We experimented with three distinct methodologies, each of which is described in detail below.

4.3.1. Ablation on Training Strategies

Additional Horizontal Detection Head. The main idea of this method is to introduce an independent horizontal detection head that is trained using features extracted from the enclosing horizontal boxes of the rotated objects, and outputs a horizontal classification score. At the same time, a rotated detection head is trained on the features of the rotated bounding boxes to produce a rotated classification score.

These two classification scores are then fused using a predefined strategy to obtain the final classification result. The specific training strategies and their corresponding experimental results are presented in Table 3 and Table 4, respectively.

According to Table 4, Strategy-2 performs best compared to other strategies, achieving varying degrees of improvement in 26% of the categories, with the highest per-category performance increase reaching +8.41% AP. However, similarly, there are also significant performance declines in SBF and HA categories, which precisely validates the hypothesis proposed in this paper: the feature cross-fusion method may have category-specific applicability. When category features are not sufficiently distinct, background information may play a crucial auxiliary role. However, for certain categories, this strategy might introduce more background noise, and its computational complexity is higher than that of the Baseline. Therefore, this paper explores other possible methods.

Spatial Self-Attention + Multi-Head Attention + Angular Encoding Fusion. The spatial self-attention mechanism is a mechanism that enhances the model’s focus on important spatial locations. It offers significant advantages in image processing and computer vision tasks in particular. By calculating the correlations between each spatial position, this mechanism can adaptively assign different attention weights to different regions, thereby improving the model’s attention to key areas and its feature extraction capabilities. It is well-suited for tasks involving complex backgrounds or requiring the capture of global contextual information. On the other hand, the multi-head attention mechanism is an extension of the self-attention mechanism. It maps input features into multiple subspaces, allowing the model to understand and model the data from different perspectives. By integrating information from multiple attention heads, it can construct more precise feature representations. In rotated object detection, a single attention head may only focus on certain local features, while the multi-head attention mechanism expands the model’s perceptual capacity by processing multiple feature subspaces in parallel. This enables the model not only to effectively capture the features of rotated objects but also to integrate background information and improve detection performance in complex scenes. Therefore, we attempt a feature fusion approach that combines spatial attention and multi-head attention, integrating rotated features with horizontal features. The specific experimental results are shown in the Table 5.

According to Table 5, the overall performance is optimal when heads = 4. Although its performance is slightly lower (by 0.4 AP) than the training method, which involves separate training followed by collaboration as shown in Table 4, this approach offers one-step training compared to the method of adding a horizontal head, making it more concise and stable. Furthermore, when compared with the results in Table 4, it is evident that, regardless of the fusion strategy adopted, different categories exhibit varying sensitivities to horizontal features. Specifically, as shown in Table 6, categories such as BD, SP, and HC show significant performance improvement after incorporating horizontal features, while LV, BC, SBF, and HA experience a noticeable decline. Other categories remain largely unaffected. This observation once again confirms that different categories have varying sensitivities to surrounding features.

In addition, considering that different heads of the model may have varying effects on feature fusion and the rotation angles of bounding boxes could influence the model, we introduced dynamically learnable head weights and angular encoding based on the aforementioned attention mechanism. Experiments were conducted with heads = 4 and heads = 8, and the results are shown in Table 7. By comparing these results with those in Table 4 and Table 5, no significant performance change was observed. We hypothesize that this may be due to the inherent difficulty in training the dynamically learnable head weights, and angular encoding may not provide sufficient auxiliary support when training is unstable.

Dynamic Weighting Based on Rotated Area Ratio. The aforementioned experiments, whether involving the addition of new detection heads or attention mechanisms, have validated the hypothesis that horizontal features can be beneficial for detection. However, the final experimental results indicate a performance decline of over 1% mAP compared to the baseline model. We attribute this to the excessive number of inherently highly distinguishable targets in the DOTA dataset, where overly complex fusion methods introduced substantial background noise and increased computational costs, leading to unstable model training. Therefore, to reduce computational costs and account for the impact of the rotated area ratio, we ultimately designed a more concise dynamic weighted fusion method named DFM based on the rotated area ratio. The experimental results of DFM are presented in Table 8.

From the experimental results, it is evident that DFM, with its simple yet effective rotated area embedding and weighting approach, achieves performance on par with the baseline while significantly improving results for specific categories (e.g., achieving the largest performance gain in the HC category: 57.19% vs. 52.28%, +4.91% AP). Compared to earlier methods, such as adding rotated-horizontal heads or employing attention mechanisms, DFM enhances the extraction of effective background information without significantly compromising overall performance, thereby improving detection accuracy for categories sensitive to background details.

4.3.2. Ablation on the Joint Optimization Loss

To address the challenge of classifying feature-similar samples, we propose a Joint Optimization Loss based on Prediction Discrepancy and Entropy Constraint. This loss function works in synergy with the previously designed dual-level rotation-horizontal feature fusion module, tackling the difficulty of detecting feature-similar samples from both the feature fusion and classification loss optimization perspectives.

We first conducted experiments to tune the hyperparameters of the joint optimization loss function, exploring multiple combinations. The detailed results are presented in Table 9. As shown, the best performance was achieved when

λ_{c e} = 0.8

and

λ_{p d e} = 0.2

. Therefore, we adopt this setting—

λ_{c e} = 0.8

and

λ_{p d e} = 0.2

—as the final configuration in all subsequent experiments.

4.3.3. Component-Wise Abaltion

To separately analyze the experimental effects of the proposed dual-level rotation-horizontal feature fusion module and the joint loss function based on prediction discrepancy and entropy constraint, an ablation study was conducted on the DOTA dataset. The experimental results are presented in Table 10.

Impact of the dual-level rotation-horizontal feature fusion module (DFM) The DFM primarily enhances the model’s ability to utilize contextual information by incorporating features from the horizontal enclosing boxes of rotated proposals. When integrated into the Oriented R-CNN framework (i.e., Oriented R-CNN + DFM), the overall mean Average Precision (mAP) slightly decreases by 0.21%, but significant improvements are observed in certain categories. For instance, the AP for category HC increases from 52.28% to 57.19% (+4.91% AP), and for GTF from 70.86% to 72.14% (+1.28% AP).

These categories typically suffer from limited feature representation within the rotated bounding boxes alone. The DFM mitigates this issue by providing richer contextual features through the external horizontal boxes. However, performance slightly declines in some other categories, likely because their backgrounds are more complex, and the features from the enclosing horizontal boxes may introduce irrelevant noise in certain cases. Therefore, while applying DFM selectively can yield clear benefits for specific categories, its indiscriminate use across all categories may impair precision in those more sensitive to background interference.

Impact of the Joint Prediction Difference and Entropy Constraint Loss (PDE) The PDE loss guides the model’s classification optimization through a combined supervision of prediction discrepancy loss, entropy constraint loss, and cross-entropy loss, thereby increasing the model’s confidence in its predictions. When the PDE loss is incorporated into the Oriented R-CNN framework (i.e., Oriented R-CNN + PDE), the overall mAP decreases from 75.87% to 75.02%, a drop of 0.85% mAP.

This performance decline may be attributed to the overemphasis on inter-class discrimination without introducing additional feature information to guide the learning process. As a result, the feature representation for certain categories could be adversely affected, leading to a slight reduction in overall detection accuracy.

Combined Impact of DFM and PDE When both the DFM module and the PDE loss are incorporated into the Oriented R-CNN framework (i.e., Oriented R-CNN + DFM + PDE), the model achieves an mAP of 75.82%, which is comparable to the baseline. More importantly, the combined approach leads to more balanced performance across multiple categories and delivers notable improvements in certain specific classes. For example, compared to the baseline, performance improves in HC (56.30% vs. 52.28%, +4.02% AP), GTF (72.64% vs. 70.86%, +1.78% AP), and BD (83.03% vs. 82.12%, +0.91% AP).

Compared to using DFM or PDE individually, their combination partially compensates for the limitations of each component, resulting in more stable overall detection performance. In certain categories, DFM provides richer contextual information, while PDE reduces classification uncertainty through constrained loss terms. This synergy enhances detection accuracy, demonstrating that feature fusion and loss function optimization are complementary. Together, they contribute to improved model robustness and higher detection accuracy for feature-similar samples.

4.4. Visualization of Detection Results

The detection results of the proposed model on the DOTA dataset are visualized in Figure 7. The figure demonstrates that the model can accurately and completely identify objects across multiple categories such as baseball fields, ships, and bridges, indicating robust performance in rotated object detection.

4.5. Effectiveness and Limitations

To provide a critical evaluation relative to existing methods, we analyze the effectiveness and inherent trade-offs of the proposed BIF-RCNN framework.

Effectiveness via Contextual Verification. The core strength of BIF-RCNN lies in its ability to resolve semantic ambiguities through context, a capability often lacking in standard detectors. For instance, the baseline method Oriented R-CNN [9] focuses strictly on geometric alignment, treating the background within horizontal proposals as interference to be suppressed. While efficient, this approach fails to distinguish targets with identical internal textures (e.g., bridges vs. highways). Similarly, advanced feature-based methods like ReDet [16] enhance internal representations via rotation-equivariant networks but still overlook external environmental cues. In contrast, our BIF-RCNN strategically re-integrates this “background noise” via the DFM module to verify the object category. This advantage is quantitatively supported by the significant AP improvements in hard categories (e.g., +4.02% for HC), demonstrating that our context-aware approach outperforms pure geometric or internal-feature-based methods in complex scenes.

Limitations. This performance gain comes with specific constraints. First, regarding Context Dependency, unlike methods that rely solely on object geometry, our approach assumes that the surrounding environment carries discriminative information. In scenarios with uniform or non-informative backgrounds, the auxiliary gain from DFM may be limited. Second, in terms of Computational Cost, the dual-branch architecture for processing horizontal features introduces additional FLOPs. Compared to lightweight single-stage detectors, BIF-RCNN prioritizes detection precision over inference speed, which represents a trade-off between semantic robustness and real-time efficiency.

Figure 7. Partial visualization results on the DOTA dataset.

5. Conclusions

In this paper, we identified and addressed a critical limitation in existing rotated object detectors: the inadvertent loss of environmental context due to strict geometric alignment. To resolve this, we proposed BIF-RCNN, a novel framework that synergizes background context with internal object features via the dual-level rotation-horizontal feature fusion module (DFM) and explicitly reduces classification uncertainty with the Prediction Difference and Entropy-Constrained Loss (PDE Loss). Experimental results on the DOTA benchmark validate this approach, confirming that appropriately mining background information serves as a powerful auxiliary signal. Notably, our method achieved a substantial improvement of 4.02% AP for the Helicopter category, effectively distinguishing targets with high appearance similarity.

Our method uses a simple area ratio to decide how important the background is. In the future, we plan to replace this with a smarter design, like Cross-Attention, to help the model genuinely “understand” the connection between the object and its background, rather than just relying on geometric rules.

Author Contributions

Conceptualization, J.Z., X.X. and S.W.; methodology, J.Z., P.Z. (Pengfei Zhang) and S.S.; software, S.S.; validation, S.S., H.Z. and X.B.; formal analysis, Y.S.; investigation, S.S., H.Z. and K.X.; resources, P.Z. (Ping Zong); data curation, P.Z. (Ping Zong) and G.Z.; writing—original draft preparation, S.S., H.Z. and X.B.; writing—review and editing, G.Z., Z.O. and M.S.; visualization, S.S., H.Z.; supervision, Z.O., M.S. and Y.Z.; project administration, J.Z. and Z.O.; funding acquisition, J.Z. and P.Z. (Pengfei Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of State Grid Hebei Information and Telecommunication Branch grant number kj2024-018. The APC was funded by the Science and Technology Project of State Grid Hebei Information and Telecommunication Branch.

Data Availability Statement

There are no restrictions on the sharing of relevant data in this study.

Acknowledgments

During the preparation of this manuscript, the authors used Gemini 3 solely for language refinement, including grammar, phrasing, and text editing. The AI tool did not participate in the research design, data analysis, experimental development, scientific interpretation, or the generation of technical content. All AI-assisted text was fully reviewed, verified, and revised by the authors, who take full responsibility for the final scientific content. All authors have agreed to this acknowledgment.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DFM	Dual-Level Rotation-Horizontal Feature Fusion Module
PDE Loss	Prediction Difference and Entropy-Constrained Loss
BBox	Bounding Box
RPN	Region Proposal Network
MLP	Multi-Layer Perceptron
IoU	Intersection over Union
RoIs	Regions of Interest
CW	Condition Weight
FPN	Feature Pyramid Network

Appendix A

More Visualizations

In Figure A1 and Figure A2, we show more visualization results on the DOTA dataset. As illustrated by the selected examples, our proposed BIF-RCNN demonstrates strong detection capabilities across a wide range of object categories. In particular, the model accurately identifies rotated objects such as small and large vehicles, tennis courts, and airplanes, even in cluttered or complex scenes.

Figure A1. Visualizations of detection results on the DOTA Dataset. Detection comparison on the DOTA dataset between the ground-truth (top) and our proposed BIF-RCNN (bottom). The top row in each pair shows the original image with ground-truth annotations, while the bottom row presents the predictions of our model. BIF-RCNN successfully detects various objects such as small/large vehicles, tennis courts, roundabouts, and airplanes with high precision.

Figure A2. Visualizations of detection results on the DOTA Dataset. We present additional detection examples for Tennis Court, Roundabout, and Plane categories. As illustrated in the bottom row, BIF-RCNN accurately localizes targets with arbitrary orientations and varying scales. Notably, the model maintains high classification confidence even in densely arranged scenarios (left column) and for objects with complex backgrounds (right column), validating the effectiveness of the proposed background fusion and joint loss strategy.

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Zhang, G.; Ou, Z.; Xue, K.; Sun, J.; Zhu, Y.; Yao, S.; Shen, Y.; Song, M. DGFSD: Bridging the Gap between Dense and Sparse for Fully Sparse 3D Object Detection. In Proceedings of the 33rd ACM International Conference on Multimedia, Dublin, Ireland,27–31 October 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 4669–4678. [Google Scholar]
Zhang, G.; Song, Z.; Liu, L.; Ou, Z. FGU3R: Fine-Grained Fusion via Unified 3D Representation for Multimodal 3D Object Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
Luo, G.; Sun, J.; Jin, L.; Zhou, Y.; Xu, Q.; Fu, R.; Sun, X.; Ji, R. Domain incremental learning for object detection. Pattern Recognit. 2026, 170, 111882. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Object detection with multimodal large vision-language models: An in-depth review. Inf. Fusion 2026, 126, 103575. [Google Scholar] [CrossRef]
Liu, C.; Ma, X.; Yang, X.; Zhang, Y.; Dong, Y. COMO: Cross-mamba interaction and offset-guided fusion for multimodal object detection. Inf. Fusion 2026, 125, 103414. [Google Scholar] [CrossRef]
Zahid, F.; Rajput, S.; Ali, S.S.A.; Aromoye, I.A. Challenges and Innovations in 3D Object Recognition: The Integration of LiDAR and Camera Sensors for Autonomous Applications. Transp. Res. Procedia 2025, 84, 618–624. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE: New York, NY, USA, 2021; pp. 3520–3529. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: New York, NY, USA, 2019; pp. 2849–2858. [Google Scholar]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2980–2988. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; IEEE: New York, NY, USA, 2023; pp. 16794–16805. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 2786–2795. [Google Scholar]
Su, W.; Jing, D. DDL R-CNN: Dynamic direction learning R-CNN for rotated object detection. Algorithms 2025, 18, 21. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. arXiv 2020, arXiv:2012.04150. [Google Scholar] [CrossRef]
Ou, Z.; Chen, Z.; Shen, S.; Fan, L.; Yao, S.; Song, M.; Hui, P. Free3Net: Gliding Free, Orientation Free, and Anchor Free Network for Oriented Object Detection. IEEE Trans. Multimed. 2022, 25, 7089–7100. [Google Scholar] [CrossRef]
Wang, K.; Liu, J.; Lin, Y.; Wang, T.; Zhang, Z.; Qi, W.; Han, X.; Wen, R. ASL-OOD: Hierarchical Contextual Feature Fusion with Angle-Sensitive Loss for Oriented Object Detection. Comput. Mater. Contin. 2025, 82, 1879. [Google Scholar] [CrossRef]
Wang, W.; Cai, Y.; Luo, Z.; Liu, W.; Wang, T.; Li, Z. SA3Det: Detecting Rotated Objects via Pixel-Level Attention and Adaptive Labels Assignment. Remote Sens. 2024, 16, 2496. [Google Scholar] [CrossRef]
Dang, M.; Liu, G.; Kong, A.W.K.; Zheng, Z.; Luo, N.; Pan, R. RO2-DETR: Rotation-equivariant oriented object detection transformer with 1D rotated convolution kernel. ISPRS J. Photogramm. Remote Sens. 2025, 228, 166–178. [Google Scholar] [CrossRef]
Wang, S.; Jiang, H.; Yang, J.; Ma, X.; Chen, J. AMFEF-DETR: An End-to-End Adaptive Multi-Scale Feature Extraction and Fusion Object Detection Network Based on UAV Aerial Images. Drones 2024, 8, 523. [Google Scholar] [CrossRef]
Li, S.; Yan, F.; Liu, Y.; Shen, Y.; Liu, L.; Wang, K. A multi-scale rotated ship targets detection network for remote sensing images in complex scenarios. Sci. Rep. 2025, 15, 2510. [Google Scholar] [CrossRef] [PubMed]
Zhou, W.; Liu, X.; Zheng, Y.; Zhang, D.; Xiang, H. AFPN Based YOLOX for Rotation Object Detection in Remote Sensing Image. In Proceedings of the 2024 China Automation Congress (CAC), Qingdao, China, 1–3 November 2024; IEEE: New York, NY, USA, 2024; pp. 5841–5846. [Google Scholar] [CrossRef]
Thai, C.; Trang, M.X.; Ninh, H.; Ly, H.H.; Le, A.S. Enhancing rotated object detection via anisotropic Gaussian bounding box and Bhattacharyya distance. Neurocomputing 2025, 623, 129432. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. FPDIoU Loss: A loss function for efficient bounding box regression of rotated object detection. Image Vis. Comput. 2025, 154, 105381. [Google Scholar] [CrossRef]
Li, W.; Shang, R.; Ju, Z.; Feng, J.; Xu, S.; Zhang, W. Ellipse IoU Loss: Better Learning for Rotated Bounding Box Regression. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6001705. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 3974–3983. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017; pp. 2117–2125. [Google Scholar]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; IEEE: New York, NY, USA, 2021; pp. 2150–2159. [Google Scholar]
Yang, X.; Liu, Q.; Yan, J.; Li, A.; Zhang, Z.; Yu, G. R3det: Refined single-stage detector with feature refinement for rotating object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Lin, Y.; Feng, P.; Guan, J.; Wang, W.; Chambers, J. IENet: Interacting embranchment one stage anchor free detector for orientation aerial object detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Chen, Z.; Chen, K.; Lin, W.; See, J.; Yu, H.; Ke, Y.; Yang, C. Piou loss: Towards accurate oriented object detection in complex environments. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2020; pp. 195–211. [Google Scholar]
Pan, X.; Ren, Y.; Sheng, K.; Dong, W.; Yuan, H.; Guo, X.; Ma, C.; Xu, C. Dynamic refinement network for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: New York, NY, USA, 2020; pp. 11207–11216. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Yan, J.; Guo, Y. Learning modulated loss for rotated object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 19–21 May 2021; AAAI Press: Palo Alto, CA, USA, 2021; Volume 35, pp. 2458–2466. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5605814. [Google Scholar] [CrossRef]
Qian, W.; Yang, X.; Peng, S.; Zhang, X.; Yan, J. RSDet++: Point-based modulated loss for more accurate rotated object detection. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7869–7879. [Google Scholar] [CrossRef]
Guo, Z.; Liu, C.; Zhang, X.; Jiao, J.; Ji, X.; Ye, Q. Beyond bounding-box: Convex-hull feature adaptation for oriented and densely packed object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; IEEE: New York, NY, USA, 2021; pp. 8792–8801. [Google Scholar]
Xie, X.; Cheng, G.; Rao, C.; Lang, C.; Han, J. Oriented object detection via contextual dependence mining and penalty-incentive allocation. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5618010. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; pp. 8232–8241. [Google Scholar]
Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]

Figure 1. An illustration of our motivation. Left: Due to many objects having similar foreground features, e.g., the shape and pixel semantics of the bridge (top) and highway (bottom) are remarkably similar, false positives can easily occur. Right: Incorporating background features of the objects helps to distinguish their similar semantic information. (The red and yellow outlines are used to highlight the bridge and highway, respectively.)

Figure 2. IOU between detection bounding box and ground-truth.

Figure 5. The impact of rotation angle on introducing background information. Horizontal bounding boxes introduce more background when object orientations are close to 45°, and considerably less when they approach 0° or 90°. The red and green boxes represent different bounding box types: the red box highlights the horizontal bounding box, and the green box highlights the rotated bounding box.

Figure 6. Flowchart of condition weight generation. Conditional weight generation based on the area ratio between rotated and horizontal bounding boxes.

Table 1. Summary of limitations in existing rotated object detection techniques.

Category	Representative Methods	Key Mechanism	Main Limitations
Feature Extraction Design	ASL-OOD [20]	Swin Transformer + Context Fusion	Focus predominantly on internal object features or suppressing background noise, neglecting the potential discriminative value of external background context for confusing targets.
	SA3Det [21]	Pixel-level Self-Attention
	RO2-DETR [22]	Rotation-Equivariant Attention
	AMFEF-DETR [23]	Multi-scale Feature Interaction
Loss Function Design	KLD Loss [25]	Kullback–Leibler Divergence	Prioritize geometric alignment precision (e.g., IoU or Gaussian distribution), but lack explicit modeling of classification uncertainty for hard-to-distinguish samples.
	Bhattacharyya [26]	Anisotropic Gaussian Distribution
	FPDIOU Loss [27]	Point-based IoU Approximation
	Ellipse IoU [28]	Inscribed Ellipse IoU

Table 2. Comparison with state-of-the-art methods on the DOTA dataset. The values indicate the Average Precision (AP) in percentages (%), and mAP denotes the mean Average Precision. The abbreviations for categories are as follows: Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Small Vehicle (SV), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Basketball Court (BC), Storage Tank (ST), Soccer Ball Field (SBF), Roundabout (RA), Harbor (HA), Swimming Pool (SP), and Helicopter (HC). The bolded numbers do indeed represent the maximum values for each column.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
one-stage
IENet	80.20	64.54	39.82	32.07	49.71	65.01	52.58	81.45	44.66	78.51	46.54	56.73	64.40	64.24	36.75	57.10
KFIoU	88.83	77.51	47.79	74.28	71.27	62.72	74.75	90.72	82.34	81.61	58.44	64.23	64.39	67.87	44.07	70.05
PIoU	80.90	69.70	24.10	60.20	38.30	64.40	64.80	90.90	77.20	70.40	46.50	37.10	57.10	61.90	64.00	60.50
DRN	88.91	80.22	43.52	63.35	73.48	70.69	84.94	90.14	83.85	84.11	50.12	58.41	67.62	68.60	52.50	70.70
BBAVectors	88.35	79.96	50.69	62.18	78.43	78.98	87.94	90.85	83.58	84.35	54.13	60.24	65.22	64.28	55.70	72.32
RSDet	89.80	82.90	48.60	65.20	69.50	70.10	70.20	90.50	85.60	83.40	62.50	63.90	65.60	67.20	68.00	72.20
CFC-Net	89.08	80.41	52.41	70.02	76.28	78.11	87.21	90.89	84.47	85.64	60.51	61.52	67.82	68.02	50.09	73.50
R³Det	88.76	83.09	50.91	67.27	76.23	80.39	86.72	90.78	84.68	83.24	61.98	61.35	66.91	70.63	53.94	73.79
S²A-Net	89.11	82.84	48.37	71.11	78.11	78.39	87.25	90.83	84.90	85.64	60.36	62.60	65.26	69.13	57.94	74.12
RSDet++	86.80	82.70	54.60	71.70	76.60	71.20	83.50	87.40	83.40	85.30	72.40	62.90	70.90	72.30	70.40	75.40
CFA	88.07	74.57	49.25	74.43	79.02	74.14	86.76	90.87	80.39	86.03	48.49	58.89	64.38	66.87	22.51	69.63
DFDet	88.92	79.25	48.40	70.00	80.22	78.85	87.21	90.90	83.13	83.98	60.07	66.49	68.27	76.78	58.11	74.71
two-stage
R2CNN	80.94	65.67	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22	60.60
RoI-Trans	88.64	78.52	43.44	75.92	68.81	73.60	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
SCRDet	89.41	78.83	50.02	65.59	69.96	57.63	72.26	90.73	81.41	84.39	52.76	63.62	62.01	67.62	61.16	69.83
AOPG	89.27	83.49	52.50	69.97	73.51	82.31	87.95	90.89	87.64	84.71	60.01	66.12	74.19	68.30	57.80	75.24
BIF-RCNN (Ours)	89.28	83.03	53.37	72.64	79.01	82.07	87.97	90.89	87.18	84.91	62.03	66.37	73.84	68.48	56.30	75.82

Table 3. Descriptions of training strategies for the additional horizontal head.

Strategy ID	Description
1	Jointly train both the horizontal and rotated detection heads, along with the dynamic weighting parameters, for 12 epochs.
2	Train the horizontal and rotated heads separately for 12 epochs each. During the training of one head, the parameters of the other head are frozen. After both heads are trained, freeze their parameters and train only the dynamic weighting parameters for 4 additional epochs. The dynamic fusion is only activated when the score difference between the top two predicted categories of the rotated head is less than 0.05.
3	Train the horizontal and rotated heads separately for 12 epochs each, with the parameters of the other head frozen during each phase. Then, jointly fine-tune both heads along with the dynamic weighting parameters for an additional 4 epochs.

Table 4. Comparison results of different strategies. The values indicate AP (%), and category abbreviations follow the conventions defined in Table 2. The bolded numbers do indeed represent the maximum values for each column.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Baseline (heads = 0)	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
Strategy-1	88.95	74.04	45.41	67.12	73.40	77.79	87.37	90.86	83.14	83.76	45.99	57.63	65.52	62.73	41.61	69.69
Strategy-2	88.35	82.98	51.90	71.12	77.61	77.58	87.72	90.90	85.99	83.51	59.13	63.20	66.78	71.47	60.69	74.66
Strategy-3	89.39	82.66	52.03	70.60	77.37	77.38	87.72	90.90	85.90	83.62	59.02	61.30	66.63	71.02	60.65	74.41

Table 5. Experiments with different numbers of heads under the combination of space and multi-head attention. The values indicate AP (%), and category abbreviations follow the conventions defined in Table 2. The bolded numbers do indeed represent the maximum values for each column.

Num of Heads	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Baseline (heads = 0)	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
heads = 1	89.29	75.82	50.55	68.61	77.83	77.46	87.50	90.89	85.97	84.84	60.21	62.42	67.17	68.27	50.37	73.15
heads = 2	89.51	75.95	49.63	65.35	78.29	77.78	87.47	90.90	84.19	83.60	56.26	62.65	66.71	70.08	56.11	72.97
heads = 4	89.30	81.58	52.63	70.93	77.87	77.79	87.67	90.89	84.94	84.75	60.69	65.36	67.34	66.19	55.94	74.26
heads = 8	89.33	83.24	52.81	70.37	78.57	77.84	87.71	90.89	83.77	84.20	59.91	66.83	67.39	68.09	48.40	73.96

Table 6. Sensitivity of different categories to horizontal feature fusion.

Accuracy Change	Specific Categories
Significant Increase (>+1% mAP)	BD, SP, HC
No Significant Change (within ±1% mAP)	PL, BR, GTF, SV, SH, TC, ST, RA
Significant Decrease (< $- 1$ % mAP)	LV, BC, SBF, HA

Table 7. Experiments on attention mechanisms with learnable head weights and angular encoding. The values indicate AP (%), and category abbreviations follow the conventions defined in Table 2. The bolded numbers do indeed represent the maximum values for each column.

Num of Heads	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Baseline (epoch = 12)	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
heads = 4 (epoch = 12)	89.40	80.81	52.20	70.51	78.19	77.30	87.51	90.90	84.96	85.06	59.17	63.39	67.32	66.41	52.11	73.68
heads = 4 (epoch = 16)	89.33	80.81	52.92	70.01	78.04	77.60	87.60	90.87	85.62	85.07	61.55	66.62	66.71	67.36	52.52	74.23
heads = 8 (epoch = 12)	89.37	75.63	51.95	71.96	77.66	81.34	87.59	90.88	85.82	84.33	57.87	68.91	67.21	68.16	53.57	74.15
heads = 8 (epoch = 16)	89.27	80.74	53.01	71.61	77.71	77.46	87.50	90.86	85.91	84.49	61.43	66.79	67.49	68.25	53.98	74.43

Table 8. Experimental results of DFM. The values indicate AP (%), and category abbreviations follow the conventions defined in Table 2. The bolded numbers do indeed represent the maximum values for each column.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Baseline	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
Baseline + DFM	89.27	82.30	53.24	72.14	78.67	82.62	87.94	90.90	85.89	84.57	62.04	64.73	74.56	68.87	57.19	75.66

Table 9. Comparative results with mainstream rotated object detection models. The bolded numbers do indeed represent the maximum values for each column.

$λ_{ce}$	$λ_{pde}$	mAP
0.6	0.4	75.25
0.7	0.3	75.69
0.8	0.2	75.82
	0.4	75.79
	0.6	75.11
	0.8	70.90

Table 10. Results of ablation study. The values indicate AP (%), and category abbreviations follow the conventions defined in Table 2. The bolded numbers do indeed represent the maximum values for each column.

Methods	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
Baseline	89.46	82.12	54.78	70.86	78.93	83.00	88.20	90.90	87.50	84.68	63.97	67.69	74.94	68.84	52.28	75.87
Baseline + DFM	89.27	82.30	53.24	72.14	78.67	82.62	87.94	90.90	85.89	84.57	62.04	64.73	74.56	68.87	57.19	75.66
Baseline + PDE	89.34	81.93	52.49	73.12	79.00	81.68	88.08	90.90	86.05	84.58	60.58	64.03	67.94	68.65	56.96	75.02
Baseline + DFM + PDE	89.28	83.03	53.37	72.64	79.01	82.07	87.97	90.89	87.18	84.91	62.03	66.37	73.84	68.48	56.30	75.82

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhao, J.; Xu, X.; Wang, S.; Zhang, P.; Shen, S.; Zeng, H.; Bu, X.; Shen, Y.; Xue, K.; Zong, P.; et al. BIF-RCNN: Fusing Background Information for Rotated Object Detection. Algorithms 2026, 19, 139. https://doi.org/10.3390/a19020139

AMA Style

Zhao J, Xu X, Wang S, Zhang P, Shen S, Zeng H, Bu X, Shen Y, Xue K, Zong P, et al. BIF-RCNN: Fusing Background Information for Rotated Object Detection. Algorithms. 2026; 19(2):139. https://doi.org/10.3390/a19020139

Chicago/Turabian Style

Zhao, Jianbin, Xing Xu, Shaoying Wang, Pengfei Zhang, Shengyi Shen, Hui Zeng, Xiangshuai Bu, Yiran Shen, Kaiwen Xue, Ping Zong, and et al. 2026. "BIF-RCNN: Fusing Background Information for Rotated Object Detection" Algorithms 19, no. 2: 139. https://doi.org/10.3390/a19020139

APA Style

Zhao, J., Xu, X., Wang, S., Zhang, P., Shen, S., Zeng, H., Bu, X., Shen, Y., Xue, K., Zong, P., Zhang, G., Ou, Z., Song, M., & Zhu, Y. (2026). BIF-RCNN: Fusing Background Information for Rotated Object Detection. Algorithms, 19(2), 139. https://doi.org/10.3390/a19020139

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BIF-RCNN: Fusing Background Information for Rotated Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Feature Extraction Design for Rotated Object Detection

2.2. Loss Function Design for Rotated Object Detection

3. Method

3.1. Dual-Level Rotated-Horizontal Feature Fusion Module (DFM)

3.2. PDE Loss

3.2.1. Prediction Difference

3.2.2. Entropy Constraint

3.3. Training Loss

3.3.1. Classification Loss

3.3.2. Regression Loss

4. Experiment

4.1. Setting

4.1.1. Dataset and Implementation Details

4.1.2. Baseline

4.2. Main Results

4.3. Ablation Study

4.3.1. Ablation on Training Strategies

4.3.2. Ablation on the Joint Optimization Loss

4.3.3. Component-Wise Abaltion

4.4. Visualization of Detection Results

4.5. Effectiveness and Limitations

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

More Visualizations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI