A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion

Wang, Qi; Xu, Guanghu; Jing, Donglin

doi:10.3390/app15158727

Open AccessArticle

A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion

by

Qi Wang

^1,*

,

Guanghu Xu

¹

and

Donglin Jing

²

¹

School of Computer Science, Nanjing University of Information Science and Technology, Nanjing 210044, China

²

School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(15), 8727; https://doi.org/10.3390/app15158727

Submission received: 3 June 2025 / Revised: 4 August 2025 / Accepted: 5 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Innovations in Remote Sensing Technology for Resource and Environmental Monitoring)

Download

Browse Figures

Versions Notes

Abstract

Remote sensing targets typically exhibit characteristics of gradual scale changes and diverse orientations. Most existing remote sensing detectors adapt to these differences by adding multi-level structures for feature fusion. However, this approach leads to incomplete coverage of the overall target by the extracted local features, resulting in the loss of critical directional information and an increase in computational complexity which affect the detector’s performance. To address this issue, this paper proposes a Rotation Target Detection Network based on Multi-kernel Interaction and Hierarchical Expansion (MIHE-Net) as a systematic solution. Specifically, we first refine scale modeling through the Multi-kernel Context Interaction (MCI) module and Hierarchical Expansion Attention (HEA) mechanism, achieving sufficient extraction of local features and global information for targets of different scales. Additionally, the Midpoint Offset Loss Function is employed to mitigate the impact of gradual scale changes on target direction perception, enabling precise regression for targets across various scales. We conducted comparative experiments on three commonly used remote sensing target datasets (DOTA, HRSC2016, and UCAS-AOD), with mean average precision (mAP) as the core evaluation metric. The mAP values of the method in this paper on the three datasets reached 81.72%, 92.43%, and 91.86% respectively, which were 0.65%, 1.93%, and 1.87% higher than those of the optimal method, significantly outperforming existing one-stage and two-stage detectors. Through multi-scale feature interaction and direction-aware optimization, MIHE-Net effectively addresses the challenges posed by scale gradation and direction diversity in remote sensing target detection, providing an efficient and feasible solution for high-precision remote sensing target detection.

Keywords:

depthwise separable convolution; progressive scale variation; targets with diverse orientations

1. Introduction

Remote sensing target detection is a technology that automatically identifies and locates specific ground objects by analyzing remote sensing images. With the improvement of remote sensing detection technology and imaging technology, its application scenarios have been continuously expanded, playing important roles in urban planning, agricultural monitoring, military reconnaissance and other fields. Significant progress has been made in this field with the integration of deep learning and high-resolution remote sensing data, attracting increasing attention [1,2,3,4,5,6]. While most existing methods follow the paradigm of feature extraction followed by bounding box regression/target classification, differences in the backbone networks used in the feature extraction stage lead to various differences among these methods. The feature sharing mechanism is illustrated in Figure 1.

In terms of feature extraction, approaches rooted in convolutional neural networks (CNNs) [7,8,9,10,11,12,13] focus on multiscale feature enhancement and contextual information extraction. For example, DetectoRS [14] combines the recursive feature interaction mechanism with a multi-scale atrous convolution fusion strategy. By constructing bidirectional feedback paths, it enables high-level semantic information to be reversely injected into low-level feature maps, thereby strengthening the feature representation capability for small targets. AODet [15] distinguishes foreground and background regions from multi-level feature maps through a foreground proposal network, operating only in foreground regions to reduce redundant computations and enhance the detection capability for targets in aerial images. Emerging transformer-based methods leverage the advantages of transformers in long-range dependency modeling to optimize target feature representation. For example, Swin Transformer [16] achieves cross-window connections through a sliding window mechanism, efficiently processing large-scale remote sensing images and enhancing the network’s ability to fuse global and local information. ViTDet [17] optimizes the feature extraction capability for densely arranged small targets by introducing a pyramid vision transformer architecture and combining it with a dynamic sparse attention mechanism, which significantly reduces computational complexity while maintaining global modeling capabilities.

In terms of bounding box representation, owing to the intrinsic characteristics of remote sensing targets such as multi-directional rotation and geometric elongation, directly predicting the target angle can cause the problem of discontinuous boundary in angle representation of the network loss function [18,19,20,21,22]. Therefore, GWD [23] transforms the angle prediction problem into a regression problem. Using the Gaussian weighted distance, it designs a continuously differentiable loss metric, enabling the loss function to smoothly transition at the boundary of angle representation. CSL [24] models the periodicity of angles in classification tasks by designing circular smooth labels, enabling the network to learn the cyclic characteristics of angles. Meanwhile, it uses the classification results to guide the regression task and fine-tune the angles through cosine distance metrics, which improves the accuracy and stability of bounding box representation for rotated targets. S²A-Net [25] proposes a scale-adaptive rotated bounding box representation method which decomposes the rotated bounding box into scale factors and angle parameters. By decoupling angle and scale information, it uses a multi-branch structure for separate regression. This reduces the ambiguity of angle representation, effectively solving the discontinuous boundary problem and achieving accurate positioning of remote sensing rotated targets.

In summary, although existing methods have optimized multi-scale feature fusion capabilities to a certain extent and alleviated the problem of inaccurate angle prediction in rotating target detection, they overlook the impact of gradual changes in target scale during the feature extraction process of the detector on the diversity of target orientations. In detail:

(1): Progressive Scale Variation: In the same scene, targets of different sizes (such as various ships of different sizes) exhibit continuous progressive changes; this progressive variation in scale poses significant challenges to traditional target detection algorithms, as detectors with fixed-scale designs often struggle to accurately detect targets of all sizes simultaneously. To address the issues caused by scale variation, existing methods commonly use multi-level structures such as feature pyramids for feature fusion. Although increasing the number of stacked layers can cover a wider scale range to some extent, this significantly increases computational complexity; conversely, reducing the number of layers compromises detection accuracy. Consequently, existing methodologies face substantial challenges in striking an optimal balance between detection accuracy and computational efficiency. Against this backdrop, LSKNet [26] proposes a method aimed at selectively expanding the spatial receptive field for large-scale targets to capture more scene contextual information. This is achieved by introducing large-kernel convolutions and dilated convolutions into the backbone network. However, large-kernel convolutions may introduce substantial background noise, which negatively impacts the precise detection of small targets. On the other hand, while dilated convolutions have advantages in expanding the receptive field, they may overlook fine-grained information within the receptive field, which can lead to overly sparse feature representations and affect detection accuracy.
(2): Coupling of Scale and Orientation: In remote sensing target detection tasks, existing methods typically model scale variation or orientation variation independently, while overlooking the deep coupling relationship between the two. When the target size is large, the features extracted within the target often fail to fully cover the entire target due to the limited local receptive field of convolutional operations, resulting in the absence of critical directional information in local features. This phenomenon makes it difficult to accurately capture the target’s orientation and shape structure by relying solely on local scale features, thereby affecting the accuracy of target detection. Therefore, the increase in target scale not only exacerbates the locality problem of feature extraction but also indirectly weakens the orientation perception ability. To effectively address this challenge, it is necessary to first perform fine-grained scale modeling during feature extraction in order to ensure that local features have sufficient global perception capabilities. Subsequently, by integrating an orientation modeling mechanism, the method can fully exploit the expressive ability of target orientation features under the background of scale variation, thereby achieving collaborative perception and enhancement of scale and orientation and improving the adaptability and robustness of detectors in complex remote sensing scenarios.

To address these issues, we propose MIHE-Net. In the feature extraction stage, we introduce the MCI module and HEA module, which utilize a set of parallel deep convolutions and depth-wise separable strip convolutions to simultaneously extract local and global features of remote sensing targets. The obtained feature maps are then fed into subsequent parameter regression branches. In the parameter regression stage, we propose a midpoint offset method to represent oriented targets and design a new loss function for bounding box regression, helping to address the problem of feature coupling between target scale and orientation in existing object detection algorithms. Our proposed MIHE-Net can be regarded as an extension of existing multi-oriented target detectors that demonstrates remarkable generality and enables compatibility with various detectors.

This paper makes the following main contributions:

(1): This paper systematically analyzes and reveals the deep coupling relationship between target scale variation and direction perception ability in remote sensing images for the first time. When the target scale is large, local feature extraction is limited by the receptive field range, making it difficult for local features to effectively express the overall direction information of the target, which affects the accuracy of target positioning and recognition. To address this phenomenon, in this paper we begin with scale modeling, then integrate direction modeling strategies to achieve collaborative perception and enhancement of scale and direction, thereby providing a new idea that can improve comprehensive scale-direction modeling capability in remote sensing target detection.
(2): A multi-kernel contextual interaction module is constructed to effectively address the problem of gradual target scale changes. Based on existing multi-scale feature extraction methods, this paper adopts a parallel arrangement structure of non-dilated deep convolution kernels with different sizes to achieve efficient extraction of multi-scale features and modeling of local contextual information. This design not only effectively alleviates the feature sparsity problem caused by dilated convolutions but also improves the density and integrity of feature representation, thereby enhancing the detector’s modeling capability for targets of different scales in complex remote sensing scenes.
(3): Through a collaborative optimization framework of the multi-kernel contextual interaction module and hierarchical expansion attention mechanism, the proposed method can establish an effective complementary relationship between feature extraction at different scales and global contextual modeling, thereby significantly improving the performance of remote sensing target detection.
(4): Finally, an anchor box convergence method is proposed; by generating boundary center offset amounts for external oriented rectangular anchor boxes using pre-extracted fine-scale feature information, this offset regression method compensates for the problem of drastic changes in loss functions caused by angle regression while making the anchor boxes converge compactly and accurately.

2. Related Works

2.1. Rotation Feature Extraction

Yuan et al. [27] addressed the issue of traditional convolutional neural network (CNN)-based detectors mainly extracting input feature maps within square windows, which can include a large amount of irrelevant information from surrounding areas when processing elongated objects. They proposed Strip R-CNN, which effectively combines the advantages of square convolutions and strip convolutions while encountering less feature redundancy. To tackle the feature misalignment challenge in the refinement phase, R³Det [28] employs feature interpolation to re-encode the positional information of the bounding box under refinement into corresponding feature points. Subsequently, it reconstructs the entire feature map at the pixel level, thereby achieving feature reconstruction and alignment. To guide the neural network model to pay more attention to effective features and suppress background noise, Li et al. [29] designed Non-local Awareness Pyramid Attention (NP-Attention). By learning both the spatial multi-scale nonlocal dependency and channel dependency, It mitigates the discrepancy between the network’s attention focus and the real-world objects in images.

Han et al. [30] proposed RoI Transformer, which transforms horizontal regions of interest (HRoIs) into rotated regions of interest (RRoIs). Subsequently, a Rotation–Position-Sensitive RoI Align (RPS-RoI-Align) module is incorporated to extract rotation-invariant regional features. Aiming at the problem of large target scale variations in remote sensing images leading to insufficient and unbalanced sampling in feature maps of different levels, Guan et al. [31] proposed an adaptive scale sampling (ADS) strategy. This strategy establishes a mapping relationship between the size of target scales and the levels of feature maps, allocating more samples from high-level feature maps to large-scale targets and the opposite for small-scale targets. This strategy achieves sufficient sampling and a more balanced distribution of samples across different scales.

2.2. Bounding Box Representation

Huang et al. [32] proposed the General Gaussian Heatmap Label Assignment method (GGHL), which employs an anchor-free target-adaptive label assignment (OLA) strategy based on 2D Gaussian heatmaps. This strategy assigns each label to Gaussian candidate positions in the feature map via a one-to-many mapping, enabling the assigned candidate objects to better align with the shape and orientation of real-world objects. R²PN [33] generates proposal regions in arbitrary directions by rotating anchor boxes to more orientations, and employs a rotated RoI pooling layer (R2oI) as the max pooling operation on arbitrary-direction regions. This ensures that all detection steps can be integrated into a unified network and enables end-to-end training. Xu et al. [34] constructed a semantic segmentation-guided RPN (sRPN) to suppress background clutter as much as possible. By introducing a semantic segmentation module, their approach obtains bounding box-based masks and semantic features; the masks guide the generation of horizontal candidate boxes, while the semantic features are fused during the ROI pooling process to enable a more accurate estimation of the bounding box. Li et al. [35] introduced a superellipse-modulated 2D Gaussian function, termed the super-Gaussian distribution, which preserves angular information by maintaining anisotropy across all aspect ratios.

Although existing rotated feature extraction methods have made some progress, they focus more on redundancy reduction, lack explicit modeling of multi-scale contextual dependencies, treat scale and orientation as independent variables, and ignore the coupling relationship between them. Thus, they fail to completely solve the key issues of gradual scale changes, scale-orientation coupling, and stable rotation regression. Based on these outstanding issues, our proposed MIHE-Net focuses more on the collaborative perception of scale and orientation, enabling progressive scale modeling combined with orientation without modifying the existing detection framework.

3. Methodology

The architecture of our proposed MIHE-Net is illustrated in Figure 2. The backbone network consists of four stages, each of which is composed of a Multi-kernel Context Interaction (MCI) module and a Hierarchical Expansion Attention (HEA) module. Subsequently, a comprehensive in-depth elaboration is provided for the proposed MCI module, HEA module, overall architecture, and Midpoint Offset Loss Function.

3.1. MCI Module

In remote sensing images, objects are generally distributed across various scales. Precise object recognition hinges not only on visual appearance cues but also on their contextual information. Contemporary approaches enhance the receptive field through the integration of large-kernel convolutions and dilated convolutions within the backbone architecture. However, the adoption of large-kernel convolutions inevitably introduces extraneous background noise, while dilated convolutions inherently overlook fine-grained spatial details within the expanded receptive field. In order to capture multi-scale texture features when objects present various different scale situations, we introduce the MCI module. First, local features are extracted through a small kernel convolution. Following this, a parallel configuration of depthwise convolution layers is utilized to extract multi-scale contextual information. Ultimately, the locally extracted feature maps and multi-scale contextual representations are fused along the channel dimension, facilitating the aggregation of spatio-contextual semantics. In detail:

Local Feature Extraction: First, a 3 × 3 convolution is applied to the input feature map to extract fine-grained local features, laying the foundation for subsequent context modeling.

Multi-Scale Context Extraction: Second, a cascade of convolutional layers with progressively increasing kernel sizes (5 × 5, 7 × 7, 9 × 9, 11 × 11) extracts hierarchical contextual representations. Each convolutional layer operates independently, ensuring that it can both cover the details of small targets and capture the overall context of large targets. The formulas are as follows:

L_{l - 1, n} = {Conv}_{k_{s} \times k_{s}} (X_{l - 1, n}^{(2)}), n = 0, \dots, N_{l} - 1

(1)

Z_{l - 1, n}^{(m)} = {DWConv}_{k^{(m)} \times k^{(m)}} (L_{l - 1, n}), m = 1, \dots, 4

(2)

where

L_{l - 1, n}

represents the local features extracted by a

k_{s} \times k_{s}

convolution, here set to

k_{s} = 3

, and

Z_{l - 1, n}^{(m)}

denotes the contextual features extracted by the m-th

k^{(m)} \times k^{(m)}

depthwise convolution (DWConv), here set to

k^{(m)} = 2 \times (m + 1) + 1

.

Feature Fusion: Lastly, a 1 × 1 convolution is used as a channel fusion mechanism to integrate features with different receptive field sizes, which produces the fused feature

P_{l - 1, n}

. Through this methodology, our MCI module effectively captures comprehensive contextual dependencies while preserving the fidelity of local texture details. The formula is shown below.

P_{l - 1, n} = {Conv}_{1 \times 1} (L_{l - 1, n} + \sum_{m = 1}^{4} Z_{l - 1, n}^{(m)})

(3)

3.2. HEA Module

The MCI module is primarily used to extract multi-scale local contextual information. To better incorporate contextual information over large distances and enable global context extraction, we introduce the HEA module. The process is as follows: first, global average pooling is applied to obtain local regional features; then, two depthwise separable strip convolutions are used. Meanwhile, the size of the depthwise separable strip convolution kernel is set according to the depth of the MCI module group, ensuring that the receptive field increases as the MCI module group to which the HEA module belongs becomes deeper. This strengthens the MCI module’s ability to capture interdependencies between distant pixels while enhancing feature extraction of the core parts of target objects. In detail:

Regional Feature Compression: Global average pooling is performed on the input feature map to compress the spatial dimension into a regional feature vector, thereby capturing the overall contextual trend. we obtain the local regional feature

F_{l - 1, n}^{pool}

through a 1 × 1 convolution after average pooling:

F_{l - 1, n}^{pool} = {Conv}_{1 \times 1} (P_{a v g} (X_{l - 1, n}^{(2)})), n = 0, \dots, N_{l} - 1

(4)

where

P_{a v g}

denotes the average pooling operation.

Long-Distance Dependency Modeling: We leverage a pair of depthwise separable strip convolutions to approximate the computational effect of a standard large-kernel depthwise separable convolution:

\begin{matrix} F_{l - 1, n}^{w} = {DWConv}_{1 \times k_{b}} (F_{l - 1, n}^{pool}) \\ F_{l - 1, n}^{h} = {DWConv}_{k_{b} \times 1} (F_{l - 1, n}^{w}) \end{matrix}

(5)

where

k_{b}

is the size of the depthwise separable strip convolution kernel. To expand the receptive field of the HEA module as the MCI module to which the HEA module belongs grows deeper, we set

k_{b} = (11 + 2 n)

, calculating the kernel size

k_{b}

as a function of the MCI module depth n.

Attention Weight Generation: Lastly, our HEA module generates an attention weight

A_{l - 1, n}

, where the sigmoid function ensures that

A_{l - 1, n}

falls within the range of (0,1). This weight is further used to enhance the output of the MCI module; the formula is provided below.

A_{l - 1, n} = Sigmoid ({Conv}_{1 \times 1} (F_{l - 1, n}^{h})),

(6)

3.3. Overall Structure

The backbone network of MIHE-Net is structured into four sequential stages. The input and output of stage l are

F_{l - 1}

and

F_{l}

, respectively. The input

F_{l - 1}

undergoes downsampling and is then split into two parts,

X_{l - 1}^{(1)}

and

X_{l - 1}^{(2)}

, along the channel dimension, which are fed into two separate paths:

\begin{matrix} X_{l - 1} = {Conv}_{3 \times 3} (DS (F_{l - 1})) \\ X_{l - 1}^{(1)} = X_{l - 1} [: \frac{1}{2} C_{l}, \dots], X_{l - 1}^{(2)} = X_{l - 1} [\frac{1}{2} C_{l} :, \dots] \end{matrix}

(7)

where

D S

denotes the downsampling operation and

C_{l}

denotes the number of channels. One path is a feed-forward network (FFN) that takes

X_{l - 1}^{(1)}

as input, passes it through two 3 × 3 convolutions, and outputs

X_{l}^{(1)}

. The other path consists of the MCI module and the HEA module, taking

X_{l - 1}^{(2)}

as input and producing

X_{l}^{(2)}

. Specifically,

X_{l - 1, n}^{(2)}

is processed by the MCI module to obtain the output feature

P_{l - 1, n}

, while the HEA module generates the attention weight

A_{l - 1, n}

. This attention weight

A_{l - 1, n}

is then used to enhance the feature resulting from the MCI module, yielding the enhanced feature

F_{l - 1, n}^{attn}

. The formula is as follows:

F_{l - 1, n}^{attn} = (A_{l - 1, n} ⊙ P_{l - 1, n}) \oplus P_{l - 1, n} .

(8)

The enhanced feature

F_{l - 1, n}^{attn}

is then passed through a 1 × 1 convolution to produce

X_{l}^{(2)}

:

X_{l, n}^{(2)} = {Conv}_{1 \times 1} (F_{l - 1, n}^{attn}) .

(9)

The final output of the entire process is then

F_{l} = {Conv}_{1 \times 1} (Concat (X_{l}^{(1)}, X_{l}^{(2)})),

(10)

where

C o n c a t

denotes the concatenation operation.

3.4. Midpoint Offset Loss Function

For rotated object detection, the rotation angles of bounding boxes are typically divided into several discrete intervals; however, this discretization introduces quantization errors. In addition, regression of the rotation angles is necessary in order to handle periodic problems, which increases the difficulty of training. If a complete parameter regression method is adopted, then mismatch between the physical characteristics of angle regression and the optimization mechanism of neural networks occurs, leading to unstable training and high computational overhead. Some existing anchor-free methods directly predict the center points and sizes of objects without presetting anchor boxes; however, these are prone to confusion about the center points of objects, and their detection effects are limited for multi-scale objects.

In order to design a bounding box representation scheme that is easy to train and can achieve accurate regression for objects of different scales, we first preset horizontal reference boxes of different scales through anchor boxes. This approach can inherit the multi-scale priors of horizontal anchor boxes, assist the network in learning the common direction patterns of objects, and reduce the ambiguity of anchor-free methods. Then, by introducing the offset of the midpoints at the top and right sides, direction information is added to the horizontal boxes to improve positioning accuracy. This method only needs to add two additional offset parameters to describe oriented objects on the basis of traditional bounding boxes, without the need to introduce complex angle parameters. Finally, a loss function between the offsets is designed to learn the parameters. Specifically:

Anchor Box Initialization: Across all hierarchical feature maps, we instantiate three horizontally-oriented anchor boxes per spatial position, configured with aspect ratios of 1:2, 1:1, 2:1. Each anchor box a is represented by a four-dimensional vector

a = (a_{x}, a_{y}, a_{w}, a_{h})

, where

(a_{x}, a_{y})

is the central coordinate of the anchor box and where

a_{w}

and

a_{h}

denote the width and height of the anchor box, respectively.

Offset Prediction: One of the two sibling 1 × 1 convolutional layers is the regression branch, which outputs the offset

δ = (δ_{x}, δ_{y}, δ_{w}, δ_{h}, δ_{α}, δ_{β})

of the candidate box relative to the anchor box. We can obtain the oriented candidate box by decoding the regression output. The decoding process is described as follows:

\begin{matrix} Δ α = δ_{α} \cdot w, Δ β = δ_{β} \cdot h, \\ w = a_{w} \cdot e^{δ_{w}}, h = a_{h} \cdot e^{δ_{h}}, \\ x = δ_{x} \cdot a_{w} + a_{x}, y = δ_{y} \cdot a_{h} + a_{y}, \end{matrix}

(11)

among which

(x, y)

represents the central coordinates of the predicted candidate box and

w, h

represents the width and height of the bounding rectangle of the predicted oriented candidate box. As shown in Figure 3,

Δ α

and

Δ β

are the offsets relative to the midpoints of the upper and right sides of the bounding rectangle, respectively.

Decoding and Loss Calculation: We generate the oriented candidate box O based on

(x, y, w, h, Δ α, Δ β)

. Then, the coordinate sets of the four vertices of each candidate box are obtained as

V = (A, B, C, D)

. The coordinates of the four vertices can be expressed as follows:

\begin{matrix} A = (x + Δ α, y - h / 2), \\ B = (x + w / 2, y + Δ β), \\ C = (x - Δ α, y + h / 2), \\ D = (x - w / 2, y - Δ β) . \end{matrix}

(12)

Leveraging this encoding scheme, we parameterize each oriented bounding box through joint prediction of its enclosing rectangle parameters

(x, y, w, h)

and midpoint offset deltas

(Δ α, Δ β)

to enable precise regression.

Positive and negative sample definitions are specified as follows. First, a binary label

p^{*} \in {0, 1}

is assigned to each anchor box, where 0 and 1 denote negative and positive samples, respectively. An anchor box is classified as a positive sample if either of the following conditions is met: the intersection over union (IoU) between the anchor box and any ground-truth box exceeds 0.7; or the anchor box achieves the highest IoU with a ground-truth box among all anchor boxes and this IoU value is greater than 0.3. Anchor boxes with IoU values lower than 0.3 relative to all ground-truth boxes are labeled as negative samples. Samples that do not fall into either category are treated as invalid and excluded from the training process. The subsequent definition of the

L_{1}

loss function is as follows:

\begin{matrix} L_{1} & = \frac{1}{N} \sum_{i = 1}^{N} F_{c l s} (p_{i}, p_{i}^{*}) + \frac{1}{N} p_{i}^{*} \sum_{i = 1}^{N} F_{r e g} (δ_{i}, t_{i}^{*}) \end{matrix}

(13)

where i denotes the index of the anchor box, N (defaulting to

N = 256

) represents the total number of samples within a mini-batch.

p_{i}^{*}

signifies the ground-truth label for the i-th anchor box,

p_{i}

is the output from the classification branch, quantifying the probability of the candidate box belonging to the foreground, and

t_{i}^{*}

represents the supervised offset of the ground-truth box relative to the i-th anchor box, which is a parameterized six-dimensional vector

t_{i}^{*} = [t_{x}^{*}, t_{y}^{*}, t_{w}^{*}, t_{h}^{*}, t_{α}^{*}, t_{β}^{*}]

characterizing the displacement of the predicted candidate box from the i-th anchor box:

\begin{matrix} δ_{α} = Δ α / w & δ_{β} = Δ β / h \\ δ_{w} = log (w / w_{a}) & δ_{h} = log (h / h_{a}) \\ δ_{x} = (x - x_{a}) / w_{a} & δ_{y} = (y - y_{a}) / h_{a} \\ t_{α}^{*} = Δ α_{g} / w_{g} & t_{β}^{*} = Δ β_{g} / h_{g} \\ t_{w}^{*} = log (w_{g} / w_{a}) & t_{h}^{*} = log (h_{g} / h_{a}) \\ t_{x}^{*} = (x_{g} - x_{a}) / w_{a} & t_{y}^{*} = (x_{g} - x_{a}) / h_{a} \end{matrix}

(14)

where

(x_{g}, y_{g})

,

w_{g}

, and

h_{g}

respectively define the bounding rectangle’s central coordinates, width, and height, while

Δ α_{g}

and

Δ β_{g}

respectively quantify the top/right vertex offsets relative to the top-edge midpoint and left-edge midpoint.

4. Experiments

4.1. Datasets

The DOTA-v1.0 dataset [36] is a large-scale aerial image object detection dataset comprising 2806 images across fifteen categories: Basketball Court (BC), Roundabout (RA), Harbor (HA), Swimming Pool (SP), Helicopter (HC), Plane (PL), Baseball Diamond (BD), Bridge (BR), Ground Track Field (GTF), Soccer Field (SBF), Large Vehicle (LV), Ship (SH), Tennis Court (TC), Small Vehicle (SV), and Storage Tank (ST). Collected from diverse sensors and platforms, this dataset features images with resolutions ranging from 800 × 800 to 20,000 × 20,000 pixels, presenting objects with significant variations in scale, orientation, and shape.

The HRSC2016 dataset [37] was released by Northwestern Polytechnical University in 2016, focusing primarily on ship detection. This dataset contains 1061 images. The objects within it are categorized into three major categories and 27 subcategories, encompassing 2976 targets in total. This makes it of great research value for studying ship detection, especially detailed classification of detection tasks.

The UCAS-AOD dataset [38] is a dataset for aircraft and vehicle detection constructed from Google Earth imagery. This dataset contains 2420 images and 14,596 instances, covering two types of samples (aircraft and vehicles) along with some negative samples (background).

For these three datasets, during the preprocessing stage we normalized the image pixel values to the range [0, 1], and removed extreme noise pixels using the threshold of mean pixel count ± 3 times the standard deviation. In addition, during the training phase we adopted data augmentation strategies to enhance the generalization ability of the model, specifically including random horizontal/vertical flipping, which means randomly performing horizontal flipping or vertical flipping on training images. In order to truly reflect the model’s performance on the original data, the validation and test sets were not augmented. The dataset division methods were tailored to the characteristics of each dataset: DOTA-v1.0 was randomly divided into training, validation, and test sets in an 8:1:1 ratio; for HRSC2016, we adopted the official fixed division (436 images in the training set, 181 in the validation set, and 182 in the test set, focusing on ship detection); finally, UCAS-AOD was randomly divided in a 7:1:2 ratio.

4.2. Experimental Evaluation Metrics

In the context of remote sensing object detection, each image is likely to include objects belonging to various categories at distinct spatial locations, requiring simultaneous evaluation of a model’s classification and localization performance. Because the classification criteria in traditional image processing cannot be directly applied to object detection tasks, the mean average precision (mAP) was used to measure the quality of detectors. Specifically:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N},

(15)

where

T P

stands for true positive,

F P

stands for false positive, and

F N

stands for false negative. An excellent detector is expected to achieve very good performance in both recall and precision; therefore, it is important to comprehensively consider the two factors P (precision) and R (recall). A better evaluation metric can be obtained by constructing the P-R curve for each category and calculating the average value of the area under the curve. For a dataset with

N_{c}

categories, the definition of mAP is as follows:

m A P = \frac{1}{N_{C}} \sum_{i = 1}^{N_{c}} \int_{0}^{1} P_{i} (R_{i}) d_{i} .

(16)

To comprehensively evaluate the detector’s performance, we introduce a multi-dimensional evaluation metric system in which computational complexity is quantified by floating-point operations (FLOPs), while the storage requirements and structural scale of the model are measured using the number of model parameters (Params).

4.3. Parameter Setting

We adopted ResNet as the backbone of MIHE-Net, with the original ResNet fully pretrained on ImageNet. The positive sample matching threshold was set to 0.5 and the detection head confidence was configured as 0.6. The proposed MIHE-Net was optimized using the Adam optimizer with a momentum of 0.9 and a weight decay coefficient of 0.0001. All models underwent 200 epochs of training, starting with an initial learning rate of 0.01, which was decayed to 0.001 after 80 epochs. A batch size of 32 was employed throughout the training process. To ensure the reproducibility of the experiments, we fixed the random seed to 42. Experiments were conducted on a server based on the PyTorch framework with four RTX 4090D GPUs. The Python version was 3.8.13 while the PyTorch version was 1.10.1, along with torchvision 0.11.2 compatible with PyTorch (for data loading and transformation), CUDA 11.3 supporting GPU-accelerated training, numpy 1.21.5 for array operations, OpenCV-python 4.5.5 for image processing, and scikit-learn 0.24.2 for calculating the evaluation metrics. RetinaNet, a simple yet effective one-stage detector, serves as the baseline; it is worth noting that introducing any module may increase computational complexity.

To ensure the fairness and comparability of the experimental results, all comparative models were trained and tested in exactly the same hardware and software environments. Specifically, all models adopted the same dataset splitting strategy as mentioned above, with consistent training and test sets to avoid biases caused by different data divisions. In addition, all models use the same training hyperparameters, including the number of training epochs, initial learning rate and its decay strategy, batch size, optimizer parameters (momentum and weight decay), and fixed random seed. This ensures that the observed performance differences are solely due to the inherent characteristics of the model architectures rather than to changes in experimental settings.

4.4. Ablation Experiment

We conducted relevant ablation experiments on the DOTA-v1.0, HRSC2016, and UCAS-AOD datasets to validate the performance of the proposed MIHE-Net.

First, we conducted research on the multi-scale kernel design in the MCI module using the DOTA dataset. Table 1 shows that when only smaller 3 × 3 kernels are used, the model achieves suboptimal performance. This can be attributed to its limited ability to extract texture information. To address this, a multi-scale kernel structure with kernel sizes ranging from 3 × 3 to 11 × 11 and a stride of 2 was implemented, under which the model achieved its highest performance. Subsequent experiments tested a configuration with increased kernel sizes and a stride of 4, but the performance fell short of the optimal level. Further investigations revealed that exclusive use of large kernels not only elevates computational complexity but also degrades performance, with mAP dropping by 0.49% and 0.84%, respectively. This suggests that large kernels may introduce background noise, leading to performance deterioration.

On the basis of the multi-scale kernel design in Table 1 that achieved optimal performance, we studied the number of kernels in the multi-scale kernel design. As shown in Table 2, with the increment in the number of kernels, the network’s performance exhibited a gradual upward trajectory, reaching the best effect when five kernels were used.

Next, we verified the effectiveness of the HEA module. First, its impact was examined by applying HEA with different kernel sizes, with the results shown in Table 3. Empirical evidence suggests that smaller kernels struggle to model long-range dependencies, resulting in limited performance, whereas larger kernels mitigate this issue by integrating broader contextual information. Our proposed kernel size expansion strategy, which involves increasing the stripe convolution kernel size as the network blocks deepen, achieves optimal performance.

Because our MIHE-Net is composed of four stages, we subsequently investigated the deployment location of the HEA module. Table 4 shows that deploying the HEA module at any stage can provide performance improvements.

Table 5 presents the outcomes of the baseline model incorporating our proposed MCI module, HEA module, and Midpoint Offset Loss Function on the HRSC2016 dataset. On this dataset, the baseline model yields an mAP of 89.21%, which is because traditional detection networks with fixed-scale designs struggle to simultaneously and accurately detect objects of various sizes. When combined with our proposed MCI module, the detector’s performance is optimized by 2.31%. This indicates that the MCI module can perform refined scale modeling in rotated object detection tasks, fully extracting local features of objects at different scales while incorporating feature-rich contextual information to improve the network’s detection performance. Adding HEA module to the network enables the detector to extract global contextual information. Further incorporating long-range contextual information ensures that local features have sufficient global awareness, increasing the detector’s performance by 1.19%. Adding the Midpoint Offset Loss Function allows the detector to fully exploit the expressive power of object orientation features in the context of scale changes. This achieves collaborative perception and enhancement of scale and orientation, improving the detector’s performance by 1.62%. Together, the MCI module, HEA module, and Midpoint Offset Loss Function improve the performance of MIHE-Net by 3.22% compared to the baseline model, resulting in more accurate rotated object detection.

Analogous experimental outcomes to those presented in Table 6 were observed the UCAS-AOD dataset, with networks integrating heterogeneous modules demonstrated superior performance compared to those relying solely on a single module. In addition, the results indicate no conflicts between the proposed modules. When all proposed modules are used, better regression and classification results are achieved and the detector demonstrates optimal performance, reaching an mAP of 91.86% on the UCAS-AOD dataset.

4.5. Comparison of Results for Different Datasets

4.5.1. Evaluation on DOTA

Table 7 shows the results of our proposed MIHE-Net and other remote sensing image rotation detection methods on the DOTA dataset. For the Basketball Court (BC), Soccer Ball Field (SBF), Roundabout (RA), Swimming Pool (SP), and Helicopter (HC) categories, MIHE-Net achieved the highest detection accuracy of 88.42%, 73.16%, 71.76%, 82.46%, and 84.76%, respectively. Additionally, MIHE-Net obtained the best average result of 81.72% across all categories. LSKNet-T [26] performed the best in remote sensing rotated object detection tasks, as it not only considers the extraction of rotational features in remote sensing targets but also selectively expands the spatial receptive field of large-sized targets to capture more scene context information by combining target features with contextual features to assist target detection. However, the use of large-kernel convolutions in LSKNet-T may introduce substantial background noise, which is unfavorable for the precise detection of small targets. Meanwhile, it ignores the impact of features extracted by the detector on subsequent rotational bounding box regression. MIHE-Net remedies this deficiency, increasing the mAP by 0.65%.

We visualize some detection results of MIHE-Net on the DOTA dataset in Figure 4. The three images in the first row depict targets of various scales, for which MIHE-Net still achieves precise detection. The three images in the second row contain many small and dense targets; nonetheless, MIHE-Net still accurately detects objects such as cars and ships by utilizing contextual information. The first image in the third row contains targets arranged in a rotating and dense manner, while the second image shows rotating multi-directional targets and the third contains targets with both large and small aspect ratios. Despite this, the rotating bounding boxes generated by MIHE-Net remain well-aligned with the targets, ensuring that all target areas are included without covering excessive background areas.

These results indicate that capturing multi-scale texture features when objects exhibit various scales and achieving bounding box regression when objects are rotated in multiple directions are both crucial for accurately detecting rotated targets in remote sensing images. Our MCI module design can capture local and global contextual information at different scales, combining target features with contextual features while maintaining the extraction of fine-grained detail features. In addition, the Midpoint Offset Loss Function makes the network’s regression results for rotated bounding boxes more accurate.

4.5.2. Evaluation on HRSC2016

Table 8 presents the detection results of the selected comparative methods and MIHE-Net on the HRSC2016 dataset. MIHE-Net achieves an mAP of 92.43%, with Params of 13.69 MB and FLOPs of 71 G. While achieving the highest mAP, MIHE-Net also maintains a relatively low number of parameters and computational load; thus, while ensuring detection accuracy, it also has high operational efficiency and good potential for practical deployment. Figure 5 displays some of the detection results of MIHE-Net on the HRSC2016 dataset. Most of the ship targets in the HRSC2016 dataset are rotated in multiple directions, have extreme aspect ratios, and come in different sizes (the three images in the first row). This leads to target features becoming blurred and indistinguishable when the background mimics the target’s appearance (first column of the second row), the image background is highly intricate (second column of the second row), or the image exhibits pronounced light-dark fluctuations (third column of the second row). Nonetheless, MIHE-Net demonstrates strong robustness by accurately detecting ship targets and shows some anti-interference capabilities, enabling the network to more accurately estimate target scales and boundaries.

4.5.3. Evaluation on UCAS-AOD

As indicated by the detection results in Table 9, our proposed MIHE-Net outperforms all existing methods in terms of overall performance, with its mAP reaching 91.86%. The visualization results of UCAS-AOD are presented in Figure 6. In the first row, depicting airplanes of varying scales and orientations, and the second row, showcasing densely arranged small vehicles and airplanes, the detection boxes of MIHE-Net exhibit superior coverage of the targets. These outcomes further validate the efficacy of our approach. Additionally, in the third row, where some vehicles are occluded by trees and nearly devoid of texture details, our method demonstrates remarkable capability in precisely locating such targets. MIHE-Net can comprehensively extract target features to accurately capture target direction information, endowing the detector with robustness and enabling high-quality detection.

4.5.4. Failure Case Analysis

In the first image of Figure 7, MIHE-Net is only able to detect a portion of the occluded cars, resulting in missed detections. For such small-scale and severely occluded targets, the global context modeling capability of the HEA module is limited by the proportion of features in the visible area, while the multi-scale features extracted by the MCI module contain a large amount of background interference information, leading the classifier to misjudge them as background. In the second image of Figure 7, the bounding boxes output by the model are offset for targets with extreme aspect ratios. The convolution kernel size of the current MCI module (maximum 11 × 11) has limited adaptability to targets with aspect ratios exceeding 10:1, causing the features at both ends of the long axis to fail to fuse effectively. This results in “shrinking” or “stretching” deviations during bounding box regression. In the third image of Figure 7, the background of the ship is mixed with a large number of texture features similar to the ship, and the bounding boxes output by the model are offset. In the multi-scale features extracted by MIHE-Net, the high-frequency components of background noise mask the edge features of the target, making the feature anchors referenced during bounding box regression inaccurate, which ultimately leads to positioning offsets.

5. Conclusions

In the task of remote sensing object detection, it is necessary to address the inability of feature pyramid structures with fixed-scale design in the backbone network to adapt to progressive scale changes of objects, as well as the problem that relying solely on local-scale features makes it difficult for the detector to accurately capture object orientations. To solve these issues, we propose a Rotation Target Detection Network Based on Multi-kernel Interaction and Hierarchical Expansion (MIHE-Net). The proposed method extracts local features of objects through the proposed MCI module, ensures that local features have sufficient global perception capability via the proposed HEA module, and achieves precise object regression using the proposed Midpoint Offset Loss Function. Comparative experiments with other advanced detectors show that the resulting detector not only achieves excellent detection performance but also efficiently balances computational complexity and detection accuracy.

MIHE-Net demonstrates significant practical application value in remote sensing target detection, particularly due to its excellent ability to handle targets with progressive scale changes and multi-directional orientations. In military reconnaissance, it can efficiently capture the gradual scale changes of targets such as ships and aircraft (from far to near or small to large) and accurately locate moving targets with different heading angles, ensuring stable detection in complex scenarios. In agriculture, it can precisely capture plot boundaries with different orientations, such as diagonal ridges and irregular plot contours, enabling high-precision identification of crop types and plot information. However, there remains room for improvement in scenarios involving severely occluded targets and targets with extreme aspect ratios. In future research, the global context modeling capability of the HEA module could be further enhanced to capture long-range dependencies in dense scenes; alternatively, an occlusion-aware mechanism could be introduced to improve the feature integrity of partially occluded targets. Additionally, applications in synthetic aperture radar (SAR) could be explored by optimizing the feature extraction strategy of the MCI module to deal with speckle noise in SAR images.

Author Contributions

G.X. and D.J.: experimental design, data collection, data analysis, interpretation of results, drafting the original manuscript; Q.W.: critical guidance and supervision to ensure the quality and direction of the research. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author. The links to obtain the research data are https://github.com/ming71/HRSC2016_SOTA and https://github.com/ming71/UCAS-AOD-benchmark (accessed on 3 June 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mei, J.; Zheng, Y.B.; Cheng, M.M. D2ANet: Difference-aware attention network for multi-level change detection from satellite imagery. Comput. Vis. Media 2023, 9, 563–579. [Google Scholar]
Sun, X.; Tian, Y.; Lu, W.; Wang, P.; Niu, R.; Yu, H.; Fu, K. From single-to multi-modal remote sensing imagery interpretation: A survey and taxonomy. Sci. China Inf. Sci. 2023, 66, 140301. [Google Scholar]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar]
Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar]
Zhang, J.; Zhang, R.; Xu, L.; Lu, X.; Yu, Y.; Xu, M.; Zhao, H. Fastersal: Robust and real-time single-stream architecture for rgb-d salient object detection. IEEE Trans. Multimed. 2024, 27, 2477–2488. [Google Scholar]
Zhang, R.; Yang, B.; Xu, L.; Huang, Y.; Xu, X.; Zhang, Q.; Jiang, Z.; Liu, Y. A benchmark and frequency compression method for infrared few-shot object detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5001711. [Google Scholar]
Hong, D.; Han, Z.; Yao, J.; Gao, L.; Zhang, B.; Plaza, A. SpectralFormer: Rethinking hyperspectral image classification with transformers. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5518615. [Google Scholar] [CrossRef]
Chen, H.; Chu, X.; Ren, Y.; Zhao, X.; Huang, K. Pelk: Parameter-efficient large kernel convnets with peripheral convolution. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5557–5567. [Google Scholar]
Huang, J.; Yuan, X.; Lam, C.T.; Ke, W.; Huang, G. Large kernel convolution application for land cover change detection of remote sensing images. Int. J. Appl. Earth Obs. Geoinf. 2024, 132, 104077. [Google Scholar]
Hou, Q.; Lu, C.Z.; Cheng, M.M.; Feng, J. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2024, 46, 8274–8283. [Google Scholar]
Hu, Z.; Gao, K.; Zhang, X.; Wang, J.; Wang, H.; Yang, Z.; Li, C.; Li, W. EMO2-DETR: Efficient-matching oriented object detection with transformers. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5616814. [Google Scholar]
Deng, C.; Jing, D.; Han, Y.; Wang, S.; Wang, H. FAR-Net: Fast anchor refining for arbitrary-oriented object detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505805. [Google Scholar]
Deng, C.; Jing, D.; Han, Y.; Chanussot, J. Toward hierarchical adaptive alignment for aerial object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615515. [Google Scholar]
Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
Wang, X.; Chen, H.; Chu, X.; Wang, P. AODet: Aerial Object Detection Using Transformers for Foreground Regions. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4106711. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; pp. 280–296. [Google Scholar]
Yang, X.; Yang, X.; Yang, J.; Ming, Q.; Wang, W.; Tian, Q.; Yan, J. Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. Adv. Neural Inf. Process. Syst. 2021, 34, 18381–18394. [Google Scholar]
Yang, X.; Zhou, Y.; Zhang, G.; Yang, J.; Wang, W.; Yan, J.; Zhang, X.; Tian, Q. The KFIoU loss for rotated object detection. arXiv 2022, arXiv:2201.12558. [Google Scholar]
Pu, Y.; Wang, Y.; Xia, Z.; Han, Y.; Wang, Y.; Gan, W.; Wang, Z.; Song, S.; Huang, G. Adaptive rotated convolution for rotated object detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6589–6600. [Google Scholar]
Zhang, R.; Cao, Z.; Huang, Y.; Yang, S.; Xu, L.; Xu, M. Visible-infrared person re-identification with real-world label noise. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 4857–4869. [Google Scholar]
Zhu, H.; Jing, D. Optimizing slender target detection in remote sensing with adaptive boundary perception. Remote Sens. 2024, 16, 2643. [Google Scholar]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking rotated object detection with gaussian wasserstein distance loss. arXiv 2021, arXiv:2101.11952. [Google Scholar]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar]
Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 16794–16805. [Google Scholar]
Yuan, X.; Zheng, Z.; Li, Y.; Liu, X.; Liu, L.; Li, X.; Hou, Q.; Cheng, M.M. Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection. arXiv 2025, arXiv:2501.03775. [Google Scholar]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21), Online, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Cai, Z.; Tao, R. A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5601920. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. Redet: A rotation-equivariant detector for aerial object detection. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2786–2795. [Google Scholar]
Guan, J.; Xie, M.; Lin, Y.; He, G.; Feng, P. Earl: An elliptical distribution aided adaptive rotation label assignment for oriented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5619715. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A general Gaussian heatmap label assignment for arbitrary-oriented object detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar]
Zhang, Z.; Guo, W.; Zhu, S.; Yu, W. Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1745–1749. [Google Scholar]
Li, C.; Xu, C.; Cui, Z.; Wang, D.; Jie, Z.; Zhang, T.; Yang, J. Learning object-wise semantic representation for detection in remote sensing imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Long Beach, CA, USA, 16–20 June 2019. pp. 20–27.
Li, Z.; Hou, B.; Wu, Z.; Guo, Z.; Ren, B.; Guo, X.; Jiao, L. Complete rotated localization loss based on super-Gaussian distribution for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5618614. [Google Scholar]
Xia, G.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.J.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
Liu, Z.; Yuan, L.; Weng, L.; Yang, Y. A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proceedings of the 6th International Conference on Pattern Recognition Applications and Methods ICPRAM, Porto, Portugal, 25–28 January 2017; Volume 1, pp. 324–331. [Google Scholar] [CrossRef]
Zhu, H.; Chen, X.; Dai, W.; Fu, K.; Ye, Q.; Jiao, J. Orientation robust object detection in aerial images using deep convolutional neural network. In Proceedings of the 2015 IEEE International Conference on Image Processing (ICIP), Quebec City, QC, Canada, 27–30 September 2015; pp. 3735–3739. [Google Scholar] [CrossRef]
Ming, Q.; Zhou, Z.; Miao, L.; Zhang, H.; Li, L. Dynamic anchor learning for arbitrary-oriented object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Online, 2–9 February 2021; Volume 35, pp. 2355–2363. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5605814. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Yang, X. Sparse label assignment for oriented object detection in aerial images. Remote Sens. 2021, 13, 2664. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning roi transformer for oriented object detection in aerial images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar] [CrossRef]
Ming, Q.; Miao, L.; Zhou, Z.; Yang, X.; Dong, Y. Optimization for Arbitrary-Oriented Object Detection via Representation Invariance Loss. IEEE Geosci. Remote Sens. Lett. 2022, 19, 8021505. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3520–3529. [Google Scholar]
Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Dong, Y.; Yang, X. Task interleaving and orientation estimation for high-precision oriented object detection in aerial images. ISPRS J. Photogramm. Remote Sens. 2023, 196, 241–255. [Google Scholar]
Nabati, R.; Qi, H. Rrpn: Radar region proposal network for object detection in autonomous vehicles. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 3093–3097. [Google Scholar]
Qian, W.; Yang, X.; Peng, S.; Guo, Y.; Yan, J. Learning Modulated Loss for Rotated Object Detection. arXiv 2019, arXiv:1911.08299. [Google Scholar]
Yang, X.; Yan, J. On the Arbitrary-Oriented Object Detection: Classification based Approaches Revisited. arXiv 2020, arXiv:2003.05597. [Google Scholar]
Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. arXiv 2018, arXiv:1808.01244. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 9627–9636. [Google Scholar]

Figure 1. Visualization of feature sharing.

Figure 2. Architecture of the proposed MIHE-Net.

Figure 3. Illustration of midpoint displacement representation.

Figure 4. Example detection results of MIHE-Net on the DOTA dataset.

Figure 5. Example detection results of MIHE-Net on the HRSC2016 dataset.

Figure 6. Example detection results of MIHE-Net on the UCAS-AOD dataset.

Figure 7. Failure cases caused by occlusion, extreme aspect ratios, and background clutter.

Table 1. Multi-scale kernel design.

Kernel Design	Parameters (MB)	FLOPs (G)	mAP
(3, 3, 3, 3, 3)	12.62	62.40	$79.94 \pm 0.12$
(3, 5, 7, 9, 11)	13.69	70.20	$81.72 \pm 0.15$
(3, 5, 9, 13, 17)	14.99	79.57	$81.06 \pm 0.12$
(11, 11, 11, 11, 11)	15.13	80.61	$80.45 \pm 0.10$
(15, 15, 15, 15, 15)	17.44	92.45	$80.32 \pm 0.13$

Table 2. Kernel number.

Kernel Number	Parameters (MB)	FLOPs (G)	mAP
2	12.56	61.95	$78.64 \pm 0.23$
3	12.78	63.57	$79.16 \pm 0.18$
4	13.13	66.24	$80.07 \pm 0.12$
5	13.69	70.20	$81.72 \pm 0.15$
6	14.35	75.26	$81.60 \pm 0.14$

Table 3. Kernel size in HEA.

Kernel Design	Parameters (MB)	FLOPs (G)	mAP
(3, 3, 3)	13.50	68.95	$80.52 \pm 0.14$
(5, 5, 5)	13.52	69.08	$80.66 \pm 0.10$
(5, 7, 7)	13.54	69.21	$80.75 \pm 0.25$
(7, 11, 11)	13.58	69.47	$80.89 \pm 0.22$
Expansive	13.69	70.20	$81.72 \pm 0.15$

Table 4. Location for implementing HEA.

Stage Apply	Parameters (MB)	FLOPs (G)	mAP
None	12.03	61.72	$80.13 \pm 0.23$
1	12.19	64.04	$80.35 \pm 0.25$
2	12.31	65.45	$80.48 \pm 0.15$
3	12.97	66.59	$80.97 \pm 0.12$
ALL	13.69	70.20	$81.72 \pm 0.15$

Table 5. Impact of individual components on the HRSC2016 dataset.

With MCI	With HEA	With Midpoint Offset	mAP
✗	✗	✗	$89.21 \pm 0.17$
✓	✗	✗	$91.52 \pm 0.19$
✗	✓	✗	$90.40 \pm 0.15$
✗	✗	✓	$90.83 \pm 0.11$
✓	✓	✓	$92.43 \pm 0.19$

Table 6. Impact of individual components on UCAS-AOD dataset.

With MCI	With HEA	With Midpoint Offset	mAP
✗	✗	✗	$89.63 \pm 0.28$
✓	✓	✗	$91.28 \pm 0.25$
✗	✓	✓	$90.96 \pm 0.19$
✓	✗	✓	$91.22 \pm 0.28$
✓	✓	✓	$91.86 \pm 0.22$

Table 7. Performance assessment on the DOTA dataset. Bold values indicate the best values with statistically significant differences from other results at p < 0.05.

Methods	Backbone	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP
DAL [39]	ResNet-101	88.68	76.55	45.08	66.80	67.00	76.76	79.74	90.84	79.54	78.45	57.71	62.27	69.05	73.14	60.11	71.44
CFC-Net [40]	ResNet-101	89.08	80.41	52.41	70.02	76.28	78.11	87.21	90.89	84.47	85.64	60.51	61.52	67.82	68.02	50.09	73.50
SLA [41]	ResNet-50	88.33	84.67	48.78	73.34	77.47	77.82	86.53	90.72	86.98	86.43	58.86	68.27	74.10	73.09	69.30	76.36
RoI Transformer [42]	ResNet-101	88.64	78.52	43.44	75.92	68.81	73.68	83.59	90.74	77.27	81.46	58.39	53.54	62.83	58.93	47.67	69.56
RIDet-O [43]	ResNet-101	88.94	78.45	46.87	72.63	77.63	80.68	88.18	90.55	81.33	83.61	64.85	63.72	73.09	73.13	56.87	74.70
Oriented R-CNN [44]	ResNet-50	89.84	85.43	61.09	79.82	78.71	85.35	88.82	90.88	86.68	87.73	72.21	70.80	82.42	78.18	74.11	80.87
TIOE-Det [45]	DarkNet-50	89.76	85.23	56.32	76.17	80.17	85.58	88.41	90.81	85.93	87.27	68.32	70.32	68.93	78.33	68.87	78.69
S²A-Net [25]	ResNet-101	89.28	84.11	56.95	79.21	80.18	82.93	89.21	90.86	84.66	87.61	71.66	68.23	78.58	78.20	65.55	79.15
LSKNet-T [26]	-	89.14	83.20	60.78	83.50	80.54	85.87	88.64	90.83	88.02	87.31	71.55	70.74	78.66	79.81	78.16	81.07
MIHE-Net	ResNet-50	87.25	81.32	58.76	78.41	77.83	83.57	86.29	90.15	86.73	85.92	70.28	69.45	76.82	80.17	83.26	79.83
MIHE-Net	ResNet-101	88.62	83.07	60.57	81.17	80.06	85.41	87.76	90.82	88.42	87.30	73.16	71.76	79.12	82.46	84.76	81.72

Table 8. Performance evaluation on the HRSC2016 dataset.

Methods	Backbone	Parameters (MB)	FLOPs (G)	Size	mAP
RRPN [46]	ResNet101	44.5	125	800 × 800	79.08
RoI-Transformer [42]	ResNet-101	55.1	200	512 × 800	86.20
RSDet [47]	ResNet-50	28.9	73	800 × 800	86.5
DAL [39]	ResNet-101	36.4	216	416 × 416	88.95
R³Det [28]	ResNet-101	41.9	216	800 × 800	89.26
SLA [41]	ResNet-101	49.0	130	768 × 768	89.51
CSL [48]	ResNet-50	37.4	236	800 × 800	89.62
GWD [23]	ResNet-101	55.2	121	800 × 800	89.85
TIOE-Det [45]	ResNet-101	48.6	139	800 × 800	90.16
S²A-Net [25]	ResNet-101	38.6	198	512 × 800	90.17
ReDet [30]	ResNet-101	31.6	89	512 × 800	90.46
Oriented RCNN [44]	ResNet-101	41.1	199	1333 × 800	90.50
MIHE-Net	ResNet-101	13.69	71	800 × 800	92.43

Table 9. Performance evaluation on the UCAS-AOD dataset.

Methods	Backbone	Size	Car	Airplane	mAP
SLA [41]	ResNet-50	800 × 800	88.57	90.30	89.44
TIOE-Det [45]	ResNet-50	800 × 800	88.83	90.15	89.49
CornerNet [49]	Hourglass Network	800 × 800	89.69	89.25	89.47
FCOS [50]	ResNet-50	800 × 800	89.49	90.05	89.77
RIDet-O [43]	ResNet-50	800 × 800	88.88	90.35	89.62
DAL [39]	ResNet-50	800 × 800	89.25	90.49	89.87
S²A-Net [25]	ResNet-50	800 × 800	89.56	90.42	89.99
MIHE-Net	ResNet-101	800 × 800	91.71	92.01	91.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Q.; Xu, G.; Jing, D. A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion. Appl. Sci. 2025, 15, 8727. https://doi.org/10.3390/app15158727

AMA Style

Wang Q, Xu G, Jing D. A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion. Applied Sciences. 2025; 15(15):8727. https://doi.org/10.3390/app15158727

Chicago/Turabian Style

Wang, Qi, Guanghu Xu, and Donglin Jing. 2025. "A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion" Applied Sciences 15, no. 15: 8727. https://doi.org/10.3390/app15158727

APA Style

Wang, Q., Xu, G., & Jing, D. (2025). A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion. Applied Sciences, 15(15), 8727. https://doi.org/10.3390/app15158727

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Rotation Target Detection Network Based on Multi-Kernel Interaction and Hierarchical Expansion

Abstract

1. Introduction

2. Related Works

2.1. Rotation Feature Extraction

2.2. Bounding Box Representation

3. Methodology

3.1. MCI Module

3.2. HEA Module

3.3. Overall Structure

3.4. Midpoint Offset Loss Function

4. Experiments

4.1. Datasets

4.2. Experimental Evaluation Metrics

4.3. Parameter Setting

4.4. Ablation Experiment

4.5. Comparison of Results for Different Datasets

4.5.1. Evaluation on DOTA

4.5.2. Evaluation on HRSC2016

4.5.3. Evaluation on UCAS-AOD

4.5.4. Failure Case Analysis

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI