MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection

Niu, Ge; Yang, Xiaolong; Wang, Xinhui; Liu, Yong; Cao, Lu; Yin, Erwei; Guo, Pengyu

doi:10.3390/app16010522

Open AccessArticle

MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection

by

Ge Niu

^†,

Xiaolong Yang

^†

,

Xinhui Wang

,

Yong Liu

,

Lu Cao

,

Erwei Yin

and

Pengyu Guo

^*

National Innovation Institute of Defense Technology, Academy of Military Sciences, Beijing 100071, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2026, 16(1), 522; https://doi.org/10.3390/app16010522

Submission received: 26 November 2025 / Revised: 28 December 2025 / Accepted: 30 December 2025 / Published: 4 January 2026

(This article belongs to the Collection Space Applications)

Download

Browse Figures

Versions Notes

Abstract

Oriented object detection in remote sensing images rapidly evolves as a pivotal technique, driving transformative advancements across geospatial analytics, intelligent transportation systems, and urban infrastructure planning. However, the inherent characteristics of remote sensing objects, including complex background interference, multi-scale variations, and high-density distribution, pose critical challenges to balance detection accuracy and computational efficiency. This paper presents an anchor-free framework that eliminates the intrinsic constraints of anchor-based detectors, specifically the positive–negative sample imbalance and the computationally expensive non-maximum suppression (NMS) process. By effectively integrating adaptive kernel module with boundary refinement network, we achieved lightweight and efficient detection. Our method adaptively generates convolutional kernels tailored for multi-scale objects to extract discriminative features, while utilizing a boundary refinement network to precisely capture oriented bounding boxes. Experiments were carried out on the widely recognized HRSC2016 and DOTA datasets for the oriented bounding box (OBB) task. The proposed approach achieves 90.13% mAP (VOC07 metric) on HRSC2016 with 61.60 M parameters and 158.84 GFLOPS. For the DOTA benchmark, we attain 75.84% mAP with 45.96 M parameters and 131.39 GFLOPs. Our work highlights a lightweight yet powerful architecture that effectively balances accuracy and efficiency, making it particularly suitable for resource-constrained edge platforms.

Keywords:

oriented object detection; multi-scale features; receptive field; boundary refinement

1. Introduction

In recent years, remote sensing image detection has witnessed surging application demands, with increasing requirements for both accuracy and processing speed of oriented object detection models. Objects in remote sensing images typically exhibit complex backgrounds, dense distributions, frequent occlusions, and multi-scale variations [1,2]. Conventional horizontal bounding boxes (HBBs) often lead to misaligned and overlapping annotations between adjacent objects, while their fixed aspect ratios cannot represent actual object geometries. To overcome these limitations, oriented bounding boxes tightly aligned with object contours are adopted for object detection. The fewer background pixels within the bounding box, the easier it is for the detector to distinguish the object, while overlaps between adjacent bounding boxes are also notably diminished. Consequently, numerous researchers focused their investigations on oriented bounding box representations.

Contemporary object detection frameworks predominantly adopt two-stage architectures [3]. This type of framework initially generates region proposals for coarse object localization, followed by a refinement-classification phase executed by task-specific subnetworks. Despite the exponential growth in AI accelerator capabilities surpassing Moore’s Law predictions, the persistent scarcity of scalable computing power continues to bottleneck the performance of object detection models. By contrast, the single-stage framework eliminates the computational overhead associated with both the generation of candidate bounding boxes by the NMS and the Region Proposal Network (RPN) operation in post-processing. For instance, Yang et al. (2019) [4] replace anchor boxes with center points to adaptively localize objects by leveraging their spatial information and semantic features from salient regions, which significantly reduces the computational complexity compared to a two-stage framework.

Currently, keypoint-based detectors, as representative single-stage frameworks [5,6,7,8], demonstrate reduced model complexity and enhanced small object detection capabilities compared to two-stage or anchor-based detectors [9,10]. These detectors localize bounding box corners and determine whether paired corners belong to the same object by measuring the distances between their embeddings, subsequently constructing object proposals from validated keypoint groups. Despite their improved detection accuracy, such methods generally suffer from high computational costs. Zhou et al. proposed CenterNet [7], an end-to-end framework that eliminates anchor boxes by detecting object centroids through keypoint heatmap regression and predicts bounding box dimensions (width w, height h) in a single forward pass. Theoretically, extending CenterNet to oriented object detection could be realized by learning an additional rotation angle

θ

alongside h and w. However, this extension introduces inhibitive challenges: Parameterizing bounding box dimensions (h, w) within isolated rotational reference frames for arbitrarily oriented objects induces divergent optimization landscapes in joint learning architectures.

Motivated by the pressing need for lightweight yet accurate object detection in edge computing scenarios such as on-orbit satellite processing, we introduce a novel framework integrating two synergistic components: the Multi-scale Adaptive Kernel (MAK) module and the Boundary Refinement (BR) network. Our approach fundamentally diverges from conventional multi-scale fusion strategies that rely on static feature stacking. The MAK module innovatively adapts kernel configurations in response to object scale and contextual cues, thereby enabling dynamic receptive field modulation rather than fixed-scale aggregation. Concurrently, the BR network advances beyond decoupled regression paradigms by jointly optimizing boundary vector predictions and orientation consistency, effectively mitigating the misalignment issues prevalent in existing methods. This integrated design not only elevates detection precision but also sustains a compact architectural footprint, rendering it particularly suitable for deployment in computationally constrained environments.

Unlike infrared or SAR methods that depend on thermal or microwave signals, our approach leverages RGB imagery to capture geometric cues such as shape, orientation, and boundary. The proposed approach utilizes vector-based predictions within four Cartesian coordinate quadrants to resolve orientation-related detection difficulties in aerial imagery. This strategy effectively mitigates the periodicity issue caused by direct regression of bounding box angles while enhancing the robustness of geometric representation. For higher detection accuracy, we observe that the receptive field dynamically impacts multi-scale object detection in complex backgrounds, where insufficient contextual information may lead to erroneous predictions (Figure 1). To resolve this, we introduce the adaptive kernel module to enhance features and leverage feature pyramid outputs to learn boundary-aware vectors. The proposed framework achieves computational complexity reduction while maintaining superior detection accuracy through novel single-stage anchor-free architecture with dynamic receptive field modulation. Core technical innovations comprise the following:

The proposed single-stage anchor-free detection framework integrates an adaptive kernel module that dynamically adjusts the receptive field to capture contextually relevant features while ensuring high detection accuracy and efficient inference. Experiments on the DOTA and HRSC2016 datasets validate competitive performance with minimized computational resource requirements.
Our framework integrates a feature pyramid for multi-scale object feature extraction and employs dual prediction heads to simultaneously determine object centroids and optimize boundary-aware vector representations, enhancing localization precision in remote sensing imagery. This approach is universally applicable to arbitrarily oriented objects through local coordinate system modeling for each instance, thereby achieving decoupling from the global coordinate system.

2. Related Work

2.1. Oriented Object Detection for Remote Sensing

The aerial perspective inherent in optical remote sensing imagery induces spatial discrepancy between classification confidence and bounding box accuracy during object detection. To counteract the impacts induced by such inconsistency, researchers have proposed solutions from different perspectives. Regarding task decoupling and feature alignment, the Oriented R-CNN [11] and ROI Transformer [12] frameworks adopt a decoupled regression mechanism that independently models angular orientation and spatial positioning (size and position) parameters. Specifically, the ROI Transformer aligns the features of rotated region proposals with object orientations, while Oriented R-CNN presents an orientation-sensitive Region Proposal Network (RPN) to optimize feature extraction. Moreover, the AO2-DETR employs the global modeling capabilities of transformers and a query decoupling design to realize implicit feature alignment between classification and localization tasks [13]. From the perspective of geometric representation refinement, the Gliding Vertex framework demonstrates exceptional performance in oriented object detection through its vertex-sliding mechanism on horizontal bounding boxes [14]. By contrast, the Gaussian representation (G-Rep) [15] applies Gaussian probability distributions to model the spatial distribution of objects, thereby unifying the mathematical formulations of diverse geometric representations such as rotated bounding boxes, quadrilaterals, and point sets. Concerning object scale variation, Fu et al. [16] address scale variation through hierarchical fusion of multi-scale features, adaptively integrating local directional patterns with global contextual representations. Pertaining to dynamic optimization strategies, the Dynamic Anchor Learning Network (DAL-Net) [17] proposes Dynamic Anchor Learning (DAL), which dynamically optimizes label assignment strategies by introducing a matching degree metric to holistically evaluate anchor boxes’ localization potential. LoRA-Det [18] introduces parameter-efficient fine-tuning with low-rank adaptation for satellite onboard processing, achieving near-full fine-tuning performance with only 12.4% updated parameters. ARS-DETR [19] proposes an aspect ratio-sensitive angle classification method and rotated deformable attention to enhance high-precision detection, particularly for objects with extreme aspect ratios. These methods reduce classification confidence–localization accuracy inconsistency in oriented object detection.

2.2. Keypoint-Based Technology

Keypoint-based object detection methods achieve superior localization accuracy with reduced computational overhead relative to anchor-based approaches, benefiting from their rotation-invariant representation learning and lightweight computational architecture. The nascent stage of keypoint detection primarily utilizes human-defined feature representations. Subsequently, the Rotation-Invariant Convolutional Neural Network [20,21] suppresses the orientation sensitivity and background interference barriers with keypoint-driven methods in remote sensing scenarios. It directly embeds rotation invariance into the feature learning process. To improve multi-scale object detection capability in keypoint-based frameworks, the multi-scale object proposal network [22] enhances keypoint localization accuracy through multi-scale feature integration, while achieving lightweight detection via parallel processing of multi-scale features using multi-branch convolutional kernels. Li et al. [23] streamlined the model architecture by replacing rotational convolution with a lightweight corner detection head, significantly enhancing computational efficiency through a dynamic corner pairing mechanism. To further refine the representation of bounding boxes, He et al. [24] extended keypoint-based detection to oriented objects by simplifying bounding boxes into rotated ellipses. The BBAVectors (Box Boundary-Aware Vectors) [25] adapt horizontal keypoint detection frameworks to oriented object detection by integrating angle encoding with geometry-aware feature interactions, maintaining boundary localization precision for arbitrary orientations. To achieve superior localization precision in keypoint-based frameworks, Song et al. [26] developed an end-to-end aircraft detection framework for rotated targets, utilizing ConvNeXt backbone architecture and cyclical focal loss optimization. The method demonstrated improved detection accuracy on the DOTA-Plane benchmark dataset compared to existing approaches. These keypoint-based detectors, compared to anchor-based detectors, exhibit advantages in accuracy and speed, leading to their increasing application in oriented object detection tasks.

2.3. Receptive Field Mechanism

Current research efforts frequently neglect the critical role of domain-specific prior information embedded within optical remote sensing scenarios, where such knowledge typically manifests as contextual information. The ability of detectors to capture such information typically depends on their receptive field, which fundamentally constrains object recognition accuracy. Several works demonstrate that increasing convolutional kernel dimensions directly expands the receptive field of detectors [27,28,29,30]. However, in terms of detection performance, larger convolutional kernels are not always better. The detection performance is influenced by multiple factors, including target scale, image resolution, and hardware limitations [31]. The Conv2Former [32] empowers the model with dynamic receptive field through convolutional modulation operations and large kernel convolution optimization, achieving Transformer-like global modeling capabilities while maintaining the computational efficiency of convolutional networks. Subsequently, the LSKNet is developed as a detector capable of acquiring sufficient receptive field of objects, efficiently capturing contextual information of various objects. Beyond altering convolutional kernel sizes, some researchers have developed other strategies to optimize the impact of receptive field on detectors. Song et al. [33] integrate a fully convolutional network (FCN) to generate probabilistic heatmaps of object center points, and incorporate deformable convolution with multi-scale feature fusion techniques, thereby significantly enhancing detection robustness under complex scenarios. Liu et al. [34] employ super-resolution reconstruction to enhance small object detection by addressing insufficient receptive field and capturing detailed features of small objects.

3. Proposed Method

Figure 2 demonstrates the framework design of the proposed method. The input image first passes through an FPN for multi-scale feature extraction. Subsequently, a dynamic selection mechanism adaptively optimizes kernel selection for different targets according to the extracted multi-scale features. Finally, the integrated features are processed by the decoder to generate coordinate predictions and category classifications.

3.1. Architecture

The designed network architecture (Figure 2) builds upon a Feature Pyramid Network framework optimized for object detection tasks. The architecture comprises four principal components: backbone network, neck module, adaptive kernel module, and head branches. The backbone network employs a ResNet101 structure that processes input RGB images

I \in R^{3 \times H \times W}

, where H and W represent image height and width. Feature extraction occurs through sequential convolutional operations, each coupled with batch normalization and ReLU activation layers. Each residual block in the backbone comprises convolutional layers with skip connections, which enable the network to learn residual mappings and facilitate the flow of information across layers. At the top of the backbone network, a Feature Pyramid Network [35] is employed as the neck module to refine the upsampled feature map through a 3 × 3 convolutional layer. The adaptive kernel module processes the input feature map and enhances its representational power through a series of convolutional and attention mechanisms. The feature map

F \in R^{K \times \frac{H}{s} \times \frac{W}{s}}

(where

K = 256

) is subsequently transformed into four branches: heatmap branch (

P \in R^{C \times \frac{H}{s} \times \frac{W}{s}}

), offset branch (

O \in R^{2 \times \frac{H}{s} \times \frac{W}{s}}

), orientation map branch (

α \in R^{1 \times \frac{H}{s} \times \frac{W}{s}}

), and box parameter branch (

B \in R^{10 \times \frac{H}{s} \times \frac{W}{s}}

). C denotes the category count and

s = 4

represents the scaling factor. This transformation employs two sequential

3 \times 3

convolutional layers, each maintaining 256 channels.

3.2. Backbone Network

During the backbone stage, we employed the classic ResNet series as well as the lightweight DecoupleNet. As outlined in Figure 2, the backbone architecture is designed as a four-stage hierarchical structure to facilitate feature extraction at varying scales. The feature dimensions at each stage are configured as

(\frac{H}{4}) \times (\frac{W}{4}) \times C

,

(\frac{H}{8}) \times (\frac{W}{8}) \times 2 C

,

(\frac{H}{16}) \times (\frac{W}{16}) \times 4 C

, and

(\frac{H}{32}) \times (\frac{W}{32}) \times 8 C

, where H, W, and C correspond to the original feature height, width, and channel number.

The backbone of DecoupleNet is meticulously designed to achieve efficient feature extraction for small-scale and multi-scale objects through the incorporation of two novel modules: the Feature Integration Downsampling (FID) module and the Multi-branch Feature Decoupling (MBFD) module. The FID module integrates the strengths of convolutional downsampling and max-pooling downsampling, leveraging group convolution, partial information interaction, and depthwise separable convolution to effectively address the issue of feature loss for small objects during traditional downsampling processes. Meanwhile, the MBFD module segments the feature maps into multiple branches, each of which is processed through distinct feature extraction techniques such as convolution, Medium-Range Lightweight Attention (MRLA), and global attention. This strategy optimizes the exploitation of feature redundancy and enhances the representation of multi-scale objects. By fusing the extracted features from these branches, DecoupleNet achieves a balanced feature representation that is both accurate and computationally efficient. The overall architecture is divided into four stages, each comprising multiple Decouple Blocks that progressively extract features at different scales. This hierarchical structure ultimately provides robust and versatile feature representations for remote sensing visual tasks.

The FPN serves as a fundamental architecture in computer vision, particularly for object detection and semantic segmentation tasks, by effectively integrating multi-scale feature representations. The FPN architecture is characterized by a top-down pathway and lateral connections. The top-down pathway performs upsampling of coarse-scale feature maps, while the lateral connections integrate these upsampled features with corresponding bottom-up features of equivalent spatial dimensions through element-wise summation.

F_{b}^{l}

,

F_{t}^{l}

, and

F_{o}

represent bottom-up, top-down upsampled, and output feature maps, respectively (l is the pyramid level). The process is as follows:

F_{t}^{l} = U (F_{t}^{l + 1}),

(1)

where U is the bilinear interpolation-based upsampling operation, and

F_{o}^{l} = F_{t}^{l} + W \cdot F_{b}^{l},

(2)

where W is a learnable weight matrix (which can be identity or a simple convolutional layer) for adjusting bottom-up feature contributions. The map

X \in R^{C \times \frac{H}{s} \times \frac{W}{s}}

, with

C = 256

and

s = 4

(scale factor, making the output four times smaller).

3.3. Adaptive Kernel

The adaptive kernel mechanism dynamically modulates multi-scale FPN-derived features to enhance kernel selection adaptability for heterogeneous targets. These refined features are subsequently processed by the decoder to generate predictions for both object coordinates and categories. The module consists of two key components: a Large Kernel Selection (LK Selection) sub-block and a Feed-forward Network (FFN) sub-block. The LK Selection sub-block implements large kernel convolutions through a decomposition approach, sequentially applying depthwise convolutions with progressively increasing kernel sizes and dilation rates. This strategy produces multiple feature representations with distinct receptive fields. A spatial selection mechanism then performs feature weighting and fusion according to input characteristics. The relationship between kernel size k, dilation rate d, and receptive field

R F

in the large kernel convolutions is formulated as follows:

d_{i - 1} < d_{i} \leq R F_{i - 1}, d_{1} = 1; k_{i - 1} \leq k_{i},

(3)

R F_{i} = d_{i} (k_{i} - 1) + R F_{i - 1}, R F_{1} = k_{1} .

(4)

To illustrate this point quantitatively, consider decomposing a single large kernel convolution into a stack of smaller ones while preserving the effective receptive field (ERF). For instance, a single convolution with kernel size k = 23 and dilation d = 1 yields an ERF of RF = 23. Alternatively, the same ERF can be attained by cascading two smaller kernels: the first with

k_{1} = 5

and

d_{1} = 1

, and the second with

k_{2} = 7

and

d_{2} = 3

. The ERF after the first layer is

R F_{1} = k_{1} = 5

. The ERF after the second layer is

R F_{2}

=

d_{2} (k_{2} - 1) + R F_{1} = 3 \times 6 + 5 = 23,

identical to that of the original large kernel convolution.

The proposed adaptive kernel module effectively captures the diverse range of contexts of objects in remote sensing scenarios, demonstrating state-of-the-art performance across multiple remote sensing tasks. Notably, this design achieves significant improvements in mAP while maintaining computational efficiency, as evidenced by only marginal increases in parameter quantity.

3.4. Boundary Refinement

3.4.1. Heatmap

The heatmap branch is used to predict the presence of objects at different locations in the image. The output of this branch is

P \in R^{C \times \frac{H}{s} \times \frac{W}{s}}

, where C is the number of categories.

Heatmap is commonly employed to localize specific keypoints in an input image. In this study, the proposed method is employed to detect the center points of arbitrarily oriented objects in remote sensing images. For ground truth, given the center point

μ = (μ_{x}, μ_{y})

of an oriented bounding box, we generate the ground truth heatmap

P \in R^{\tilde{K} \times \frac{H}{s} \times \frac{W}{s}}

by placing a 2D Gaussian

exp (- \frac{{(p_{x} - μ_{x})}^{2} + {(p_{y} - μ_{y})}^{2}}{2 σ^{2}})

around

μ

, where

σ

adapts to the size of the bounding box. The loss function formula is

L_{h} = - \frac{1}{C} \sum_{i} \{\begin{matrix} {(1 - {\hat{p}}_{i})}^{β} p_{i}^{α} log (1 - p_{i}) & if {\hat{p}}_{i} \neq 1 \\ {(1 - p_{i})}^{α} log (p_{i}) & otherwise \end{matrix},

(5)

The ground truth and predicted heatmap values are denoted as

\hat{p}

and p, respectively, where i represents the pixel location index, and C indicates the total number of objects. The hyperparameters

α

and

β

are determined empirically to optimize the model performance.

3.4.2. Offset

The offset branch predicts the offset values for each object location. Its output is

O \in R^{2 \times \frac{H}{s} \times \frac{W}{s}}

. In the inference phase of heatmap-based object detection for remote sensing imagery, an offset prediction map

O \in R^{2 \times \frac{H}{s} \times \frac{W}{s}}

is generated to address the quantization error between the continuous-valued center coordinates and their discrete counterparts. This compensation mechanism effectively bridges the gap introduced by the spatial downsampling factor s during heatmap generation. For a ground truth object center

\hat{c} = ({\hat{c}}_{x}, {\hat{c}}_{y})

in the input image space, the quantization-induced offset o is computed as follows:

o = (\frac{\hat{c} x}{s} - ⌊\frac{\hat{c} x}{s}⌋, \frac{\hat{c} y}{s} - ⌊\frac{\hat{c} y}{s}⌋),

(6)

Offset map optimization employs smooth

L_{1}

loss:

L_{o} = \frac{1}{N} \sum_{k = 1}^{N} Smooth L 1 (o_{k} - {\hat{o}}_{k}) .

(7)

3.4.3. Box Parameter

This branch predicts the box parameters for object bounding boxes. The output is

B \in R^{10 \times \frac{H}{s} \times \frac{W}{s}}

.

We use a box parameter tensor B of shape

10 \times \frac{H}{s} \times \frac{W}{s}

to capture box boundary-aware vectors which represent the OBBs of objects. Each object’s box boundary is represented by five two-dimensional vectors, resulting in ten channels in total. Given an object’s center point

c = (c_{x}, c_{y})

and orientation angle

θ

, in the Cartesian coordinate system centered at point c, the boundary vectors

v_{k}

are derived based on the radial distance d from the center to each corresponding box edge. The vector components are computed as

v_{k} = r (\cos (k \frac{π}{2} + θ), \sin (k \frac{π}{2} + θ)) for k = 0, \dots, 4 .

(8)

The ground truth of the box parameter map is generated from the bounding box ground truth, and its training loss is the smooth

L_{1}

loss

L_{b} = \frac{1}{N} \sum_{k = 1}^{N} {Smooth}_{L_{1}} (b_{k} - {\hat{b}}_{k}) .

(9)

3.4.4. Orientation Map

The orientation map branch predicts the orientation of the objects. Its output is

α \in R^{1 \times \frac{H}{s} \times \frac{W}{s}}

.

The proposed method employs an orientation map

α \in R^{1 \times \frac{H}{s} \times \frac{W}{s}}

to categorize bounding boxes into two distinct types: rotational bounding boxes and horizontal bounding boxes. HBBs are defined as those with an orientation angle

θ

within a certain threshold

θ_{t h}

, while RBBs have

θ

outside this threshold. A sigmoid function is applied to map the predicted orientation value to the range

[0, 1]

, and the ground truth of the orientation map is set to 0 for HBBs and 1 for RBBs. The orientation map is trained using the binary cross-entropy loss

L_{α} = - \frac{1}{N} \sum_{i} {\hat{α}}_{k} log (α_{i}) + (1 - {\hat{α}}_{k}) log (1 - α_{k}) .

(10)

The total loss L of the network is formulated as the weighted summation of individual losses across all output maps, expressed as follows:

L = L_{h} + λ_{o} L_{o} + λ_{b} L_{b} + λ_{α} L_{α},

(11)

where

λ_{o}

,

λ_{b}

, and

λ_{α}

are the weights for the offset loss, box parameter loss, and orientation map loss, respectively, and these weights can be empirically tuned.

Post-processing pipeline begins with decoding the model’s predictions, which include heatmaps, offsets, and box parameters. These predictions are transformed into geometric representations of objects, with each detection described by its center point and four corner coordinates. The predicted coordinates are rescaled to the original image dimensions based on the predefined downsampling ratio and the input image size. A critical step in this pipeline is non-maximum suppression (NMS), which filters out redundant detections by suppressing overlapping boxes, retaining only those with the highest confidence scores. The procedure guarantees that every identified object is delineated by one optimal bounding box with maximum confidence. Additionally, detections below a predefined confidence threshold are discarded to further refine the results. The refined detections are organized by category, with each category’s results stored in a structured dictionary. This organization allows for efficient access and manipulation of detection results for each class. Finally, the results are formatted and written to text files, with each line representing a detected object, including its bounding box coordinates, confidence score, and image ID.

Baseline method: This paper adopts BBAVectors as the baseline method, focusing on regressing the four vectors in the Cartesian coordinate system. The presented approach is evaluated against BBAVectors to highlight the benefits of incorporating feature pyramids and boundary vectors.

4. Experiments

4.1. Datasets

Several oriented object detection datasets are available in the remote sensing community, each with distinct characteristics. For instance, DIOR [36] and RSD-GOD [37] are primarily designed for horizontal bounding box detection and are widely used for general object detection. UCAS-AOD [38] focuses on aerial vehicles but is relatively small in scale. In contrast, DOTA-V1.0 [39] is widely regarded as the most comprehensive and challenging benchmark for oriented object detection, with extensive adoption in the prior literature. Similarly, HRSC2016 [40] has emerged as the predominant dataset for oriented ship detection, offering high-resolution imagery and precise annotations.

HRSC 2016 was developed by Northwestern Polytechnical University and made publicly available in 2016. It contains a total of 1680 images, among which 1061 are effectively annotated. The dataset is partitioned into 436 training images for model learning and parameter tuning, 181 validation images to optimize model performance and prevent overfitting, and 444 testing images for objective evaluation of the trained model’s performance. There are 2976 ship targets in total. The imagery exhibits spatial resolutions varying between 0.4 and 2 m, enabling clear capture of ship details. The dimensions of the images range between 300 × 300 pixels and 1500 × 900 pixels. The dataset adopts the oriented bounding box annotation format, using rotation box annotations to precisely capture ship orientation and attitude. It serves as a fundamental benchmark for ship detection algorithm development in remote sensing applications, with widespread adoption in comparative studies assessing method performance.

The DOTA-V1.0 benchmark represents a comprehensive collection for oriented object detection tasks, comprising 2806 remotely sensed images acquired through diverse sensor platforms. This dataset presents significant challenges due to substantial variations in object scales, orientations, and geometric configurations. The imagery resolution ranges between 800 × 800 and 4000 × 4000 pixels, encompassing 188,282 precisely annotated instances across 15 object categories, including Plane (PL), Soccer-Ball Field (SBF), Bridge (BR), Baseball Diamond (BD), Small Vehicle (SV), Ground Track Field (GTF), Large Vehicle (LV), Tennis Court (TC), Ship (SH), Storage Tank (ST), Basketball Court (BC), Roundabout (RA), Swimming Pool (SP), Harbor (HA),and Helicopter (HC). The dataset is partitioned into training (1411 images), validation (458 images), and testing (937 images) subsets. For improved detection accuracy, standard practice involves tiling the original imagery into 600 × 600 pixel patches with a 100-pixel stride. This yields 69,337 training/validation patches and 35,777 testing patches for performance assessment. Final detections undergo non-maximum suppression (IOU threshold = 0.3) prior to online evaluation to remove duplicate predictions. As a widely adopted benchmark, this dataset has substantially advanced oriented object detection research.

Therefore, to comprehensively assess the performance of the proposed MAK-BRNet, we evaluate it on these two established benchmarks, DOTA-V1.0 and HRSC 2016, which employ oriented bounding boxes for annotation and are highly relevant for evaluating multi-scale and rotation-aware detection methods.

4.2. Implementation Details

In the implementation stage of our research, we carefully address several crucial aspects to ensure the efficacy and reliability of our approach.

During model training, the MAK-BRNet is configured with a batch size of 12 and implemented across two NVIDIA RTX 3090 TI GPUs. For optimal convergence on the DOTA-V1.0 dataset, approximately 200 training epochs are executed to capture intricate data patterns. Given the relatively smaller size and unique features of the HRSC 2016 dataset, we train the MAK-BRNet for 100 epochs to attain superior performance. Moreover, we filter out images without targets to facilitate better network convergence. The computational efficiency of the MAK-BRNet is evaluated on an NVIDIA 3090 TI GPU, with performance benchmarks conducted using both HRSC 2016 and DOTA-V1.0 datasets to assess real-world applicability.

We execute the implementation of our method by leveraging the PyTorch=1.12.1 framework. The network backbone utilizes ImageNet pre-trained weights for initialization. This initialization strategy capitalizes on the rich feature representations from the large-scale ImageNet dataset. For the input images, we follow the baseline network protocol by training on both the train and val splits and testing on the test split. The input imagery is standardized to 608 × 608 pixels, while maintaining an output resolution of 152 × 152 pixels. This parameter configuration optimizes the trade-off between computational efficiency and detection accuracy.

In terms of data augmentation, we solely employ flipping to verify the model’s effectiveness. Random flipping enriches the diversity of the training data by incorporating mirror images of the original data.

The optimization procedure is of great significance for model training. MAK-BRNet employs the Adam optimizer, initialized with a learning rate of

1 \times 10^{- 4}

. We optimize the overall loss function

L = L_{h} + L_{o} + L_{b} + L_{α}

, where

L_{h}

stands for the loss associated with the heatmap,

L_{o}

is the loss for the offset,

L_{b}

is the loss for the box parameters, and

L_{α}

is the loss for the orientation. This all-encompassing loss function takes into consideration various facets of the object detection task and steers the model towards learning the optimal parameters.

During the testing phase, we resized the data to 608 × 608. For the DOTA-V1.0 dataset, we employed two scales (0.5 and 1.0) for image cropping during inference. The use of dual scales was intended to mitigate potential boundary-level missed detections that may arise from single-scale cropping, with the results fused to ensure final detection accuracy. For evaluation metrics on DOTA-V1.0 and HRSC 2016 datasets, we utilized the mAP with a 0.5 polygon IoU threshold. The model’s computational complexity was quantified through parameter count (Params) and floating point operation (FLOP) measurements. For the post-processing non-maximum suppression (NMS), we selected 0.3 as the final hyperparameter based on the evaluation results from multiple trials with values of 0.1, 0.2, and 0.3, as assessed by the online server.

4.3. Ablation Experiments and Discussions

Ablation studies on both the DOTA-V1.0 and HRSC 2016 datasets (Table 1 and Table 2) systematically evaluate the contribution of each framework component. Our proposed MAK-BRNet includes three experimental settings: MAK-BRNet-O, MAK-BRNet-R, and MAK-BRNet-D. MAK-BRNet-O denotes the results without using the adaptive kernel module, MAK-BRNet–R represents the results with the adaptive kernel module incorporated, and MAK-BRNet-D indicates the results using a lightweight backbone. The experimental configuration facilitates systematic assessment of individual component contributions to model performance.

4.3.1. Ablation Study on DOTA-V1.0

The ablation analysis conducted on DOTA-V1.0 systematically assesses the MAK-BRNet fusion architecture’s efficacy in oriented object detection tasks. The results are presented in Table 1, where we compare different configurations of our method, including BBAVectors, MAK-BRNet-O, MAK-BRNet-R, and MAK-BRNet-D, across various backbone networks.

BBAVectors, serving as the baseline method with a ResNet-101 backbone, achieves an mAP of 72.32% at the cost of relatively high computational complexity (176.61 G FLOPs) and a moderate inference speed (39.41 FPS). The baseline method establishes a fundamental framework for oriented object detection, yet it demonstrates potential for optimization regarding computational efficiency and compactness.

MAK-BRNet-O, which integrates the FPN structure with ResNet backbones but without the adaptive kernel module, demonstrates competitive performance while reducing computational complexity. For example, MAK-BRNet-O with ResNet-101 achieves an mAP of 75.72%, a decrease of only 0.36% compared to BBAVectors, but with significantly reduced FLOPs (128.68 G) and a higher inference speed (43.14 FPS). This suggests that the FPN structure effectively enhances feature extraction efficiency without substantial performance degradation.

MAK-BRNet-R, incorporating the adaptive kernel module into the FPN structure, further improves detection accuracy with a slight increase in computational cost. Using ResNet-101, MAK-BRNet-R attains an mAP of 75.84%, representing a 3.52% improvement over BBAVectors, while maintaining reasonable computational efficiency (131.39 G FLOPs) and inference speed (43.43 FPS). The adaptive kernel module significantly improves local feature sensitivity, consequently boosting detection accuracy.

MAK-BRNet-D employs the DecoupleNet backbone to optimize the accuracy–efficiency trade-off. DecoupleNet D2 yields an mAP of 73.59% with only 81.11 G FLOPs and 9.49 M parameters, achieving 43.46 FPS. This configuration is particularly suitable for resource-constrained environments due to its reduced model complexity while maintaining competitive performance.

Based on the analysis of the remote sensing data, our proposed methods, MAK-BRNet-O and MAK-BRNet-R, consistently outperformed the baseline method (BBAVectors) across various ResNet backbones. The combined FPN and adaptive kernel architecture substantially improves detection precision while optimizing computational efficiency and accelerating inference. The MAK-BRNet-D approach incorporating DecoupleNet demonstrated an optimal balance between detection accuracy and computational efficiency, particularly advantageous for deployment in resource-limited scenarios. These experimental outcomes confirm the performance enhancement of the proposed methodology for oriented object detection on the DOTA-V1.0 benchmark, indicating substantial applicability in real-world implementations.

4.3.2. Ablation Study on HRSC 2016

As shown in Table 2, the ablation studies conducted on the HRSC 2016 dataset demonstrate the performance improvements achieved through the proposed approach, as evidenced by comparing different configurations (BBAVectors, MAK-BRNet-O, MAK-BRNet-R, and MAK-BRNet-D) across various backbone networks.

BBAVectors, the baseline approach employing a ResNet-101 backbone yields an mAP of 88.22% at a computational cost of 176.53 GFLOPs, while maintaining an inference rate of 49.87 frames per second, indicating high detection accuracy at the cost of substantial computational resources. MAK-BRNet-O, integrating the FPN structure with ResNet backbones without the adaptive kernel module, shows performance improvement and computational efficiency. For example, MAK-BRNet-O with ResNet-101 achieves an mAP of 89.68%, a 1.46% increase over BBAVectors, while reducing FLOPs to 128.59 G and maintaining a reasonable inference speed of 54.14 FPS. MAK-BRNet-R, incorporating the adaptive kernel module into the FPN structure, further enhances detection accuracy. Using ResNet-101, MAK-BRNet-R attains an mAP of 89.73%, a slight improvement over MAK-BRNet-O, with a marginal increase in computational cost (131.31 G FLOPs) and a slightly reduced inference speed (49.48 FPS), suggesting effective feature refinement for better detection performance. MAK-BRNet-D, utilizing the lightweight DecoupleNet as the backbone, offers a practical solution for resource-constrained environments. DecoupleNet D2 achieves an mAP of 86.05% with only 81.03 G FLOPs and 9.49 M parameters, at an inference speed of 53.85 FPS. Although less accurate than MAK-BRNet-R with ResNet-101, its significantly reduced computational complexity makes it suitable for applications with limited computational resources. Our proposed methods demonstrate consistent improvements in detection accuracy and computational efficiency across various backbone architectures on the HRSC2016 dataset. The integration of FPN and adaptive kernel modules significantly enhances detection performance, while MAK-BRNet-D provides a valuable alternative for scenarios requiring efficient computation.

4.4. Comparison with the State-of-the-Art Detection Methods

MAK-BRNet was assessed and benchmarked on two distinct datasets: the DOTA-V1.0 dataset (Table 3) and the HRSC 2016 dataset (Table 4). Traditional two-stage methods based on anchor boxes require a substantial number of anchor frames, which leads to significant computational complexity and extensive hyperparameter tuning, thereby severely compromising computational speed. Anchor-free detectors, by eliminating anchor boxes in network design, can mitigate these issues. However, they generally exhibit slightly inferior performance compared to their anchor-based counterparts. Nevertheless, despite being configured as an anchor-free detector, our MAK-BRNet achieved highly competitive performance across all datasets.

4.4.1. Results on DOTA-V1.0

The performance of MAK-BRNet is evaluated and compared with leading existing approaches on the DOTA-V1.0 dataset. As shown in Table 3, our methods demonstrate superior performance in terms of mean Average Precision (mAP) and accuracy across various object categories. Our approach achieves an overall mAP of 75.84%, which ranks among the leading performers among the methods compared in this study. The experimental results on the DOTA-V1.0 dataset confirm that the proposed method achieves higher detection accuracy for oriented objects across multiple categories in remote sensing imagery compared to current state-of-the-art approaches.

MAK-BRNet demonstrates superior performance compared to the baseline approach, BBAVectors [25], which attained an mAP of 72.32%. The improvement is evident across multiple categories. For example, the MAK-BRNet attains 76.78% accuracy in the Small Vehicles category, representing a significant improvement over existing methods. Similarly, in the category of Large Vehicles, our method attains an accuracy of 81.07%, further highlighting its robustness and effectiveness in handling diverse object shapes and sizes.

Moreover, our method maintains high accuracy in categories where other methods struggle, such as Basketball Court (BC) and Storage Tank (ST). Our approach achieves accuracies of 87.48% and 86.23%, respectively, indicating its ability to effectively detect and classify complex objects with varying aspect ratios and orientations.

When compared to other methods, MAK-BRNet-R shows significant improvements. For instance, LoRA-Det [18] achieves a high mAP of 75.07%, but MAK-BRNet-R surpasses it with an mAP of 75.84%. The proposed method demonstrates enhanced detection accuracy for specific object categories including Helicopter, Basketball Court, Large Vehicle, and Plane. SASM [3] has an mAP of 74.92%, slightly lower than our method. Our approach outperforms SASM in most categories, particularly in Helicopter, Large Vehicle, Baseball Diamond, Plane, and Basketball Court, demonstrating more consistent and accurate detection across different object scales and orientations. AO2-DETR [13] achieves an mAP of 72.15%, which is lower than our method. MAK-BRNet-R demonstrates notable performance enhancements in object categories including Large Vehicle, Ship, Small Vehicle, and Basketball Court, highlighting its effectiveness in handling complex object detection scenarios. Additionally, ROI Trans.+FPN [12] has an mAP of 69.56%, lower than our proposed method. MAK-BRNet-R demonstrates better performance in multiple categories, indicating more accurate detection and classification capabilities.

As illustrated in Figure 3, MAK-BRNet demonstrates sound performance on the challenging DOTA-V1.0 benchmark, which encompasses diverse categories, multi-scale, and arbitrary orientations for object detection in remote sensing imagery. Our solution consistently generates semantically accurate and spatially precise rotated bounding boxes across a diverse set of objects—including Small Vehicle (SV) in parking lots, Soccer-Ball Field (SBF) along roads, ships (SH) in waterways, Helicopter (HC) on runways, and artificial structures such as Roundabout (RA) and Tennis Court (TC). Notably, the network exhibits strong feature extraction and contextual modeling capabilities, particularly when handling small objects (e.g., vehicles and ships), densely arranged instances (e.g., clustered parking areas), and objects with extreme aspect ratios (e.g., storage tanks and bridges). These results validate the generalization ability and practical utility of our proposed architecture in complex real-world scenarios, providing a solid technical foundation for intelligent interpretation of high-resolution remote sensing imagery.

4.4.2. Results on HRSC2016

The performance evaluation of MAK-BRNet was conducted on the HRSC2016 dataset in comparison with current state-of-the-art techniques. As presented in Table 4, the method achieves an mAP of 90.13%, demonstrating enhanced detection capability for oriented objects in this dataset.

When compared to other methods, MAK-BRNet shows significant improvements. For example, RRD [47] with VGG16 as the backbone achieves an mAP of 84.30%. Our method surpasses these approaches by a substantial margin, demonstrating superior detection capabilities. ROI Trans. [12] achieves an mAP of 86.20% with ResNet-101 as the backbone, which is still lower than our proposed method. BBAVectors [25] achieves an mAP of 88.22% with ResNet-101, but our method further improves upon this result. AproNet [48] achieves an mAP of 90.00% with ResNet-101, which is very close to our method. However, our approach still manages to achieve a slightly higher mAP, indicating its effectiveness in handling the HRSC2016 dataset.

As shown in Figure 4, MAK-BRNet exhibits exceptional ship detection performance on the HRSC 2016 dataset. Our proposed methods enable accurate localization and delineation of vessels across diverse scales, arbitrary orientations, and varying densities, maintaining high recall and precise bounding box alignment even under challenging scenarios including cluttered harbor scenes, partial occlusion, and low-contrast marine settings. Remarkably, the network reliably produces highly accurate rotated bounding boxes for objects with extreme aspect ratios, including aircraft carriers and destroyers, substantially reducing background false detections commonly associated with traditional horizontal bounding box detectors. These outcomes fully attest to the effectiveness and robustness of our method for fine-grained, high-precision object detection in remote sensing applications.

MAK-BRNet achieves enhanced detection accuracy through its novel architecture and optimized combination of specialized methods for oriented targets. It employs a robust feature extraction backbone and an optimized detection pipeline, achieving high accuracy and efficiency. Moreover, the use of a lightweight DecoupleNet as the backbone results in the smallest model parameter size with minimal performance compromise.

Although MAK-BRNet does not establish new state-of-the-art records in either detection accuracy or computational efficiency in isolation, it simultaneously improves both metrics relative to the baseline, achieving a favorable balance between precision and speed.

4.5. Comparison with Baseline Methods

To highlight the advantages of our proposed FPN and feature integration architecture, we conducted a comparative study with a baseline method. Specifically, MAK-BRNet and the baseline method share the same decoder and post-processing steps, and the training process is identical, without employing any data augmentation techniques. Experimental results demonstrate performance enhancements of 1.91% and 3.52% on the HRSC 2016 and DOTA-V1.0 datasets, respectively, when compared with BBAVectors, as detailed in Table 3 and Table 4. Furthermore, when compared to the U-Net architecture, our FPN structure demonstrates superior performance for the detection of oriented objects. This is evidenced by its faster inference speed, fewer model parameters, and higher detection accuracy. The experimental outcomes demonstrate the proposed architecture’s capability in addressing oriented object detection challenges with both effectiveness and efficiency.

Our method attributes its generalizability to novel component designs: the Orientation Map module supports diverse angle definition systems, while the Adaptive Kernel module enables transferable scale adaptation through its structured weight allocation. Furthermore, the boundary representation naturally extends to polygonal instances with variable vertex counts, underscoring its versatility beyond oriented bounding boxes.

Figure 5 presents a detailed qualitative comparison of MAK-BRNet against the baseline approach across various detection scenarios. In Figure 5a, the baseline method fails to process objects with low contrast or partial occlusion. In contrast, our method successfully identifies these objects, demonstrating its robustness in capturing a broader range of object appearances. Figure 5b illustrates the baseline method’s erroneous detections, where a small vehicle is mistakenly identified as a large vehicle. This issue likely stems from the method’s loss of detail during feature extraction and its limited ability to distinguish between different object classes. However, our method accurately differentiates between targets and non-targets, achieving more precise and reliable detection. Figure 5c shows the baseline method generating redundant detections, where multiple bounding boxes are applied to the same object and the background is incorrectly identified as different objects. Such redundancy can lead to confusion and inefficiency in subsequent analysis stages. Our method effectively mitigates this issue, thereby enhancing the clarity and utility of the detection results. Figure 5d demonstrates that in scenarios where objects are oriented at angles, the baseline method predicts the angles of non-oriented targets randomly. Our method tends to unify the orientation of such randomly angled objects. Figure 5e shows that the baseline method often assigns low confidence scores to correct detections, which can lead to the rejection of valid detections during post-processing. Our method, on the other hand, assigns higher confidence scores to correct targets, thus reducing the likelihood of false negatives and improving overall detection performance. In summary, MAK-BRNet demonstrates certain advantages over the baseline approach in these scenarios, and these improvements are crucial for enhancing the reliability and effectiveness of remote sensing applications, as evidenced by the visual evidence provided in the figure.

4.6. Visual Analysis

MAK-BRNet’s detection results underwent qualitative visual assessment on the DOTA-V1.0 benchmark dataset. Figure 6 presents the detailed information of the images. The input images cover a wide range of scenes with objects of various categories, scales, and aspect ratios. Our proposed method demonstrates remarkable effectiveness in detecting oriented objects within remote sensing scenarios. It accurately predicts oriented bounding boxes that are well-aligned with the targets, even for small objects and those with large aspect ratios. The methods demonstrate consistent detection accuracy across varied and challenging scenes, confirming its robustness for practical remote sensing implementations.

Figure 7 visualizes feature maps before and after the adaptive kernel module. Following its processing, the feature representations exhibit markedly tighter semantic alignment with the predicted parameters. Specifically, pre-adaptive activations corresponding to object centers (x, y) are spatially diffuse, whereas post-adaptive activations collapse into sharp peaks precisely at the target centers while suppressing background responses. Consequently, the module delivers pronounced gains for densely packed objects, where the concentrated features substantially lower both false-positive and false-negative rates. These observations confirm that the learned, object-specific kernels successfully steer the feature focus onto the targets, thereby validating the efficacy of the proposed architecture.

5. Conclusions

In summary, MAK-BRNet integrates a multi-scale feature pyramid architecture with adaptive kernel selection and boundary optimization mechanisms, achieving accurate oriented object detection in aerial imagery. The framework innovatively combines hierarchical feature extraction with boundary-sensitive regression vectors. The FPN backbone captures detailed features at multiple scales, ensuring that objects of different sizes are accurately represented. Meanwhile, the adaptive kernel module significantly improves target-background discrimination. By employing large kernel convolutions and attention mechanisms, this module refines the features extracted by the FPN backbone, effectively suppressing background objects while providing appropriate receptive fields for the detector. Finally, the boundary refinement network accurately localizes oriented object bounding boxes, with experimental validation on both DOTA-V1.0 and HRSC 2016 datasets confirming the method’s effectiveness. It achieves enhanced computational efficiency and better adaptability to different object scales and orientations compared to existing detectors, indicating significant potential for practical applications. Nevertheless, certain limitations remain: the model relies on fixed-size inputs, which may lead to boundary artifacts in gigapixel imagery, and generalization to specialized tasks such as small object detection or fine-grained categorization requires further investigation. Future research will focus on several key directions. In the context of satellite remote sensing, efforts will be directed toward model lightweighting to enable efficient onboard data processing, thereby enhancing real-time monitoring capabilities and reducing the need for extensive ground-based computation. More broadly, future directions may include optimizing the backbone–head combination, exploring new data augmentation strategies, and applying the method to more complex datasets. The developed techniques contribute to improved detection accuracy and processing speed in remote sensing applications, increasing their practical utility. The MAK-BRNet framework demonstrates significant progress in addressing the challenges of oriented object detection.

Author Contributions

Conceptualization, P.G.; methodology, G.N. and X.Y.; validation, X.W.; formal analysis, Y.L. and L.C.; investigation, X.W.; resources, P.G.; data curation, X.Y.; writing—original draft, X.Y.; writing—review and editing, G.N., P.G. and E.Y.; visualization, X.Y.; supervision, Y.L.; project administration, G.N. and X.W.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61901504.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The code supporting the findings of this study is openly available in the GitHub repository MAK-BRNet at https://github.com/Sonel-YXL/MAK-BRNet (Version V1.0, accessed on 16 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lu, W.; Chen, S.B.; Shu, Q.L.; Tang, J.; Luo, B. DecoupleNet: A Lightweight Backbone Network With Efficient Feature Decoupling for Remote Sensing Visual Tasks. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4414613. [Google Scholar] [CrossRef]
Wang, D.; Zhang, J.; Xu, M.; Liu, L.; Wang, D.; Gao, E.; Han, C.; Guo, H.; Du, B.; Tao, D.; et al. Mtp: Advancing remote sensing foundation model via multi-task pretraining. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11632–11654. [Google Scholar] [CrossRef]
Hou, L.; Lu, K.; Xue, J.; Li, Y. Shape-adaptive selection and measurement for oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 22 February–1 March 2022; Volume 36, pp. 923–932. [Google Scholar]
Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar]
Zhao, Z.; Xue, Q.; He, Y.; Bai, Y.; Wei, X.; Gong, Y. Projecting points to axes: Oriented object detection via point-axis representation. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 161–179. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Zhou, X.; Zhuo, J.; Krahenbuhl, P. Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 850–859. [Google Scholar]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Yi, J.; Wu, P.; Metaxas, D.N. ASSD: Attentive single shot multibox detector. Comput. Vis. Image Underst. 2019, 189, 102827. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 2849–2858. [Google Scholar]
Dai, L.; Liu, H.; Tang, H.; Wu, Z.; Song, P. Ao2-detr: Arbitrary-oriented object detection transformer. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2342–2356. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 1452–1459. [Google Scholar] [CrossRef]
Hou, L.; Lu, K.; Yang, X.; Li, Y.; Xue, J. G-rep: Gaussian representation for arbitrary-oriented object detection. Remote Sens. 2023, 15, 757. [Google Scholar] [CrossRef]
Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 2355–2363. [Google Scholar]
Pu, X.; Xu, F. Low-rank adaption on transformer-based oriented object detector for satellite onboard processing of remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5202213. [Google Scholar] [CrossRef]
Zeng, Y.; Chen, Y.; Yang, X.; Li, Q.; Yan, J. ARS-DETR: Aspect ratio-sensitive detection transformer for aerial oriented object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5610315. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-scale object detection in remote sensing imagery with convolutional neural networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
Li, Y.; Mao, H.; Liu, R.; Pei, X.; Jiao, L.; Shang, R. A lightweight keypoint-based oriented object detection of remote sensing images. Remote Sens. 2021, 13, 2459. [Google Scholar] [CrossRef]
He, X.; Ma, S.; He, L.; Ru, L.; Wang, C. Learning rotated inscribed ellipse for oriented object detection in remote sensing images. Remote Sens. 2021, 13, 3622. [Google Scholar] [CrossRef]
Yi, J.; Wu, P.; Liu, B.; Huang, Q.; Qu, H.; Metaxas, D. Oriented object detection in aerial images with box boundary-aware vectors. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2150–2159. [Google Scholar]
Song, F.; Ma, R.; Lei, T.; Peng, Z. RAIH-Det: An end-to-end rotated aircraft and aircraft head detector based on ConvNeXt and cyclical focal loss in optical remote sensing images. Remote Sens. 2023, 15, 2364. [Google Scholar] [CrossRef]
Hu, Y.; Jing, Z.; Su, Y.; Zhang, F.; Jing, D. ASSK-Net: Automatic Spatial Selection Kernel Network for Faster Oriented Object Detection. J. Indian Soc. Remote Sens. 2025, 53, 3935–3948. [Google Scholar] [CrossRef]
Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 18 2022; pp. 11976–11986. [Google Scholar]
Ding, X.; Zhang, X.; Han, J.; Ding, G. Scaling up your kernels to 31x31: Revisiting large kernel design in cnns. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11963–11975. [Google Scholar]
Liu, S.; Chen, T.; Chen, X.; Chen, X.; Xiao, Q.; Wu, B.; Kärkkäinen, T.; Pechenizkiy, M.; Mocanu, D.; Wang, Z. More convnets in the 2020s: Scaling up kernels beyond 51x51 using sparsity. arXiv 2022, arXiv:2207.03620. [Google Scholar]
Wei, X.; Li, Z.; Wang, Y. SED-YOLO based multi-scale attention for small object detection in remote sensing. Sci. Rep. 2025, 15, 3125. [Google Scholar] [CrossRef] [PubMed]
Hou, Q.; Lu, C.Z.; Cheng, M.M.; Feng, J. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8274–8283. [Google Scholar] [CrossRef] [PubMed]
Song, Q.; Yang, F.; Yang, L.; Liu, C.; Hu, M.; Xia, L. Learning point-guided localization for detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 14, 1084–1094. [Google Scholar] [CrossRef]
Liu, J.; Zhang, J.; Ni, Y.; Chi, W.; Qi, Z. Small-Object Detection in Remote Sensing Images with Super Resolution Perception. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 15721–15734. [Google Scholar] [CrossRef]
Liao, M.; Zou, Z.; Wan, Z.; Yao, C.; Bai, X. Real-time scene text detection with differentiable binarization and adaptive scale fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 919–931. [Google Scholar] [CrossRef]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Zhuang, S.; Wang, P.; Jiang, B.; Wang, G.; Wang, C. A single shot framework with multi-scale feature fusion for geospatial object detection. Remote Sens. 2019, 11, 594. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the International Conference on Pattern Recognition Applications and Methods; SciTePress: Setúbal, Portugal, 2017; Volume 2, pp. 324–331. [Google Scholar]
Yang, X.; Sun, H.; Fu, K.; Yang, J.; Sun, X.; Yan, M.; Guo, Z. Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sens. 2018, 10, 132. [Google Scholar] [CrossRef]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar] [CrossRef]
Yang, X.; Sun, H.; Sun, X.; Yan, M.; Guo, Z.; Fu, K. Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network. IEEE Access 2018, 6, 50839–50849. [Google Scholar] [CrossRef]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards multi-class object detection in unconstrained remote sensing imagery. In Proceedings of the Asian Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2018; pp. 150–165. [Google Scholar]
Hu, Z.; Gao, K.; Zhang, X.; Wang, J.; Wang, H.; Yang, Z.; Li, C.; Li, W. EMO2-DETR: Efficient-matching oriented object detection with transformers. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5616814. [Google Scholar] [CrossRef]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar] [CrossRef]
Liao, M.; Zhu, Z.; Shi, B.; Xia, G.s.; Bai, X. Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5909–5918. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5602511. [Google Scholar] [CrossRef]
Yu, Y.; Da, F. Phase-shifting coder: Predicting accurate orientation in oriented object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 13354–13363. [Google Scholar]

Figure 1. Comparative analysis of object detection in remote sensing images using different architectures. (a) Detection results using a U-Net-based detector, where blue bounding boxes indicate correct predictions (Small Vehicle class), and red bounding boxes represent misclassified objects (predicted as Large Vehicle when ground truth is Small Vehicle). (b) Detection results using a detector based on FPN with expanded receptive fields, using blue for correct predictions and red for misclassifications.

Figure 2. Overall architecture of MAK-BRNet. It produces heatmaps, offsets, box parameters, and orientation maps through upsampling of the Feature Pyramid Network and stacking of the feature maps. An adaptive kernel is employed to endow the features with a larger receptive field. Bounding boxes are extracted from the final outputs using simple image-processing algorithms.

Figure 3. Detection results of MAK-BRNet on the test dataset of DOTA-V1.0. Rotated bounding boxes around objects denote model-generated detections.

Figure 4. Detection results of MAK-BRNet-R on the HRSC2016 dataset.

Figure 5. Comparison of detection results. (a) Missing detection. (b) Wrong detection. (c) Redundant detection. (d) Angle correction. (e) Increased confidence.

Figure 6. Detection results of MAK-BRNet-R on the DOTA-V1.0 dataset.

Figure 7. (a): Visualization of detection results on the test dataset of DOTA-V1.0. (b,c): Visualization of feature maps before and after processing through the adaptive kernel module in our method further validates its effectiveness.

Table 1. Results on the testing dataset of DOTA-V1.0.

Method	Backbone	mAP (↑)	FLOPs (↓)	#P (↓)	FPS (↑)
BBAVectors	ResNet-101	72.32	176.61	53.43	39.41
MAK-BRNet-O	ResNet-18	72.14	83.64	14.33	64.56
	ResNet-34	74.83	97.30	24.44	58.62
	ResNet-50	74.00	101.18	26.85	53.60
	ResNet-101	75.72	128.68	45.84	43.14
	ResNet-152	75.51	156.21	61.48	36.04
MAK-BRNet-R	ResNet-18	72.53	83.35	14.45	59.38
	ResNet-34	74.65	100.02	24.56	54.29
	ResNet-50	74.58	103.89	26.96	50.96
	ResNet-101	75.84	131.39	45.96	43.43
	ResNet-152	75.82	158.92	61.60	34.17
MAK-BRNet-D	DecoupleNet D0	69.73	74.94	4.81	44.38
MAK-BRNet-D	DecoupleNet D2	73.59	81.11	9.49	43.46

mAP: mean Average Precision; FLOPs: floating point operations (G); #P: parameters (M); FPS: frames per second; ↑: higher is better; ↓: lower is better.

Table 2. Results on the testing dataset of HRSC2016.

Method	Backbone	mAP (↑)	FLOPs (↓)	#P (↓)	FPS (↑)
BBAVectors	ResNet-101	88.22	176.53	53.43	49.87
MAK-BRNet-O	ResNet-18	79.00	83.56	14.33	91.27
	ResNet-34	89.10	97.22	24.44	83.77
	ResNet-50	88.34	101.09	26.85	72.03
	ResNet-101	89.68	128.59	45.83	54.14
	ResNet-152	89.97	156.13	61.48	39.71
MAK-BRNet-R	ResNet-18	86.02	86.27	14.45	82.61
	ResNet-34	88.16	99.93	24.55	72.53
	ResNet-50	88.70	103.81	26.96	65.75
	ResNet-101	89.73	131.31	45.95	49.48
	ResNet-152	90.13	158.84	61.60	37.25
MAK-BRNet-D	DecoupleNet D0	73.71	74.86	4.81	56.46
MAK-BRNet-D	DecoupleNet D2	86.05	81.03	9.49	53.85

mAP: mean Average Precision; FLOPs: floating point operations (G); #P: parameters (M); FPS: frames per second; ↑: higher is better; ↓: lower is better.

Table 3. Performance comparisons on DOTA-V1.0 dataset.

Method	mAP	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC
FR-O [39]	54.13	79.42	77.13	17.70	64.05	35.3	38.02	37.16	89.41	69.64	59.28	50.3	52.91	47.89	47.40	46.30
R-DFPN [41]	57.94	80.92	65.82	33.77	58.94	55.77	50.94	54.78	90.33	66.34	68.66	48.73	51.76	55.10	51.32	35.88
$R^{2} CNN$ [42]	60.67	80.94	65.75	35.34	67.44	59.92	50.91	55.81	90.67	66.92	72.39	55.06	52.23	55.14	53.35	48.22
Yang et al. [43]	62.29	81.25	71.41	36.53	67.44	61.16	50.91	56.60	90.67	68.09	72.39	55.06	55.60	62.44	53.35	51.47
ICN [44]	68.16	81.36	74.30	47.70	70.32	64.89	67.82	69.98	90.70	79.06	78.20	53.64	62.90	67.02	64.17	50.23
ROI Trans. [12]	67.74	88.53	77.91	37.63	74.08	66.53	62.97	66.57	90.50	79.46	76.75	59.04	56.73	62.19	55.56	55.56
ROI Trans.+FPN [12]	69.56	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67
EMO2-DERT [45]	70.91	87.99	79.46	45.74	66.64	78.90	73.90	73.30	90.40	80.55	85.89	55.19	63.62	51.83	70.15	60.04
AO2-DETR [13]	72.15	86.01	75.92	46.02	66.65	79.70	79.93	89.17	90.44	81.19	76.00	56.91	62.45	64.22	65.80	58.96
SASM [3]	74.92	86.42	78.97	52.47	69.84	77.30	75.99	86.72	90.89	82.63	85.66	60.13	68.25	73.98	72.22	62.37
ARS-DETR [19]	75.47	87.65	76.54	50.64	69.85	79.76	83.91	87.92	90.26	86.24	85.09	54.58	67.01	75.62	73.66	63.39
LoRA-Det [18]	75.07	89.38	79.56	51.59	72.10	77.48	83.13	87.80	90.88	84.52	85.42	60.44	65.35	66.24	67.28	64.82
BBAVectors $+ r / h$ [25]	72.32	88.35	79.96	50.69	62.18	78.43	78.98	87.94	90.85	83.58	84.35	54.13	60.24	65.22	64.28	55.70
MAK-BRNet-R(Ours)	75.84	89.22	83.52	51.25	67.40	76.78	81.07	87.13	90.87	87.48	86.23	59.07	64.14	72.48	69.92	71.09

Table 4. Detection results of MAK-BRNet on the test dataset of HRSC2016.

Method	Backbone	mAP (↑)
$R^{2} PN$ [46]	VGG-16	79.60
RRD [47]	VGG-16	84.30
ROI Trans. [12]	ResNet-101	86.20
BBAVectors [25]	ResNet-101	88.22
AproNet [48]	ResNet-101	90.06
PSC [49]	DarkNet-53	90.00
MAK-BRNet-R (Ous)	ResNet-152	90.13

mAP: mean Average Precision; ↑: higher is better.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Niu, G.; Yang, X.; Wang, X.; Liu, Y.; Cao, L.; Yin, E.; Guo, P. MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection. Appl. Sci. 2026, 16, 522. https://doi.org/10.3390/app16010522

AMA Style

Niu G, Yang X, Wang X, Liu Y, Cao L, Yin E, Guo P. MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection. Applied Sciences. 2026; 16(1):522. https://doi.org/10.3390/app16010522

Chicago/Turabian Style

Niu, Ge, Xiaolong Yang, Xinhui Wang, Yong Liu, Lu Cao, Erwei Yin, and Pengyu Guo. 2026. "MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection" Applied Sciences 16, no. 1: 522. https://doi.org/10.3390/app16010522

APA Style

Niu, G., Yang, X., Wang, X., Liu, Y., Cao, L., Yin, E., & Guo, P. (2026). MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection. Applied Sciences, 16(1), 522. https://doi.org/10.3390/app16010522

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MAK-BRNet: Multi-Scale Adaptive Kernel and Boundary Refinement Network for Remote Sensing Object Detection

Abstract

1. Introduction

2. Related Work

2.1. Oriented Object Detection for Remote Sensing

2.2. Keypoint-Based Technology

2.3. Receptive Field Mechanism

3. Proposed Method

3.1. Architecture

3.2. Backbone Network

3.3. Adaptive Kernel

3.4. Boundary Refinement

3.4.1. Heatmap

3.4.2. Offset

3.4.3. Box Parameter

3.4.4. Orientation Map

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Ablation Experiments and Discussions

4.3.1. Ablation Study on DOTA-V1.0

4.3.2. Ablation Study on HRSC 2016

4.4. Comparison with the State-of-the-Art Detection Methods

4.4.1. Results on DOTA-V1.0

4.4.2. Results on HRSC2016

4.5. Comparison with Baseline Methods

4.6. Visual Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI