A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels

Deng, Leilei; Sun, Lifeng; Li, Hua

doi:10.3390/sym17101621

Open AccessArticle

A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels

by

Leilei Deng

^1,2,

Lifeng Sun

³ and

Hua Li

^1,*

¹

College of Computer Science and Technology, Changchun University of Science and Technology, Changchun 618307, China

²

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

³

Department of Computer Science and Technology, Tsinghua University, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(10), 1621; https://doi.org/10.3390/sym17101621

Submission received: 21 May 2025 / Revised: 2 September 2025 / Accepted: 18 September 2025 / Published: 1 October 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

In recent years, convolutional neural network-based object detectors have achieved extensive applications in remote sensing (RS) image interpretation. While multi-scale feature modeling optimization remains a persistent research focus, existing methods frequently overlook the symmetrical balance between feature granularity and morphological diversity, particularly when handling high-aspect-ratio RS targets with anisotropic geometries. This oversight leads to suboptimal feature representations characterized by spatial sparsity and directional bias. To address this challenge, we propose the Parallel Interleaved Convolutional Kernel Network (PICK-Net), a rotation-aware detection framework that embodies symmetry principles through dual-path feature modulation and geometrically balanced operator design. The core innovation lies in the synergistic integration of cascaded dynamic sparse sampling and symmetrically decoupled feature modulation, enabling adaptive morphological modeling of RS targets. Specifically, the Parallel Interleaved Convolution (PIC) module establishes symmetric computation patterns through mirrored kernel arrangements, effectively reducing computational redundancy while preserving directional completeness through rotational symmetry-enhanced receptive field optimization. Complementing this, the Global Complementary Attention Mechanism (GCAM) introduces bidirectional symmetry in feature recalibration, decoupling channel-wise and spatial-wise adaptations through orthogonal attention pathways that maintain equilibrium in gradient propagation. Extensive experiments on RSOD and NWPU-VHR-10 datasets demonstrate our superior performance, achieving 92.2% and 84.90% mAP, respectively, outperforming state-of-the-art methods including EfficientNet and YOLOv8. With only 12.5 M parameters, the framework achieves symmetrical optimization of accuracy-efficiency trade-offs. Ablation studies confirm that the symmetric interaction between PIC and GCAM enhances detection performance by 2.75%, particularly excelling in scenarios requiring geometric symmetry preservation, such as dense target clusters and extreme scale variations. Cross-domain validation on agricultural pest datasets further verifies its rotational symmetry generalization capability, demonstrating 84.90% accuracy in fine-grained orientation-sensitive detection tasks.

Keywords:

deep convolutional neural networks; object detection in remote sensing; parallel interleaved convolutional kernel; hierarchical modulation

1. Introduction

Remote sensing technology provides economical and real-time data support for global resource exploration, environmental change monitoring, and disaster emergency response. Its operational mechanism relies on the non-contact collection of surface electromagnetic radiation information, which is accomplished through sensors mounted on satellites, aircraft, and near-ground platforms. As high-resolution earth observation system technology advances at a fast pace [1], the temporal, spatial, and spectral resolution of remote sensing data has been steadily improved. This advancement has established a robust data basis for the interpretation of detailed surface information [2,3]. However, the explosive expansion of large-scale remote sensing data, coupled with the need for semantic comprehension in complex scenarios, has posed substantial difficulties for traditional analysis methods. These traditional methods, which rely on artificial rules and shallow machine learning, struggle to meet current demands—especially in advanced visual tasks like target recognition and feature classification. Such tasks are in urgent need of more intelligent data processing models.

Against this context, remote sensing object detection has emerged as a critical breakthrough technology. As a core technique for locating and identifying specific feature targets in high-resolution images, it plays a key role in domains such as national defense security, disaster risk assessment, and smart city development [4]. Even so, traditional remote sensing object detection methods encounter significant obstacles. These obstacles stem from three main issues in remote sensing images: complex background interference, wide variations in target scales (for instance, the high proportion of small targets), and the occlusion of densely distributed targets [5]. Consequently, traditional methods suffer from limited feature representation capabilities, low computational efficiency, and insufficient generalization performance.

The rapid evolution of deep learning has injected new vitality into the field of remote sensing object detection. In comparison with traditional methods, deep learning approaches exhibit better detection performance. This advantage is attributed to their larger receptive fields and more precise hierarchical feature extraction capabilities [6]. Currently, remote sensing object detection algorithms based on deep learning can be categorized into two technical frameworks: two-stage detection and single-stage detection.

Two-stage detection algorithms, such as CSL [7], Mask R-CNN (Convolutional Neural Network) [8], and RoI-trans [9], adopt the region proposal and refined classification architecture to achieve high-precision detection. On the one hand, some classic studies from the past, such as Zhong et al. [10], improved target localization accuracy using a balanced position-sensitive structure. Li et al. [11] developed a dual-branch feature fusion network with local-global feature coordination to reduce false alarms for ambiguous targets. Xu et al. [12] proposed a dynamic convolution module to adaptively model the geometric deformations of targets. CAD-Net [13] enhances the connection between target features and corresponding scenarios by learning the global and local contexts of regions of interest (ROIs), thereby strengthening the network’s feature representation. ReDet [14] utilizes group convolution to generate rotation-equivariant features, and then combines rotation-invariant ROI alignment to extract rotation-invariant features from the rotation-equivariant features, enabling accurate detection of rotated targets.However, although the two-stage cascade architecture enhances target localization capability, its complex network structure typically increases parameter counts by from 3 to 5 times compared to single-stage models of comparable accuracy, severely restricting large-scale remote sensing applications.

As a result, researchers have adapted single-stage object detection frameworks for remote sensing imagery. Zhang et al. [15] designed a depthwise separable convolution module for synthetic aperture radar (SAR) ship detection, reducing computational complexity in feature extraction. Ma et al. [16] optimized the YOLOv3 model to enable real-time processing in earthquake-damaged building detection. To address the problem of small target omission, Li et al. [17] integrated a multi-level feature fusion unit into the SSD framework, significantly enhancing the detection capability for small targets.

In recent years, research on remote sensing image target detection technology has centered on two key challenges: optimizing feature representation and enhancing computational efficiency, with notable breakthroughs achieved in adapting to complex scenes and improving real-time processing capabilities.

ABNet proposed by Liu et al. [18] developed an adaptive feature pyramid network that quantifies the channel contribution of each layer via channel attention. By integrating spatial attention to pinpoint key regions, it achieves adaptive weighted fusion of multi-layer feature maps, effectively mitigating the issue of weak features in small targets. Building on this, RAOD [19] incorporates a non-local feature enhancement module. After uniformly resampling multi-layer pyramid features to an intermediate scale, it reinforces cross-layer feature correlations by modeling long-range dependencies, thereby markedly enhancing the ability to distinguish dense targets. AFC-Net [20] developed feature competition selection mechanism, ensuring that the detection of each target fully utilizes information from all feature layers.

At the level of feature expression, Rao et al. [21] have further overcome limitations in feature representation by proposing a specialized levy-associative CNN model. This model achieves multi-level feature fusion via object labeling and matrix signal processing, enhances multi-scale target adaptability in end-to-end training, and illustrates the potential of cross-task migration. Additionally, Xuan Pang et al. [22] developed a hybrid algorithm based on three-frame difference and hue-saturation-value (HSV) spatial segmentation. This algorithm employs motion detection and color space analysis, integrating H-component region growth and S-component adaptive threshold segmentation to accurately capture moving small targets. It offers a novel approach to combining traditional methods with deep learning for dynamic detection in low-computing-power environments.

Driven by the dual goals of feature optimization and computational efficiency, deformable convolution technology has gradually emerged as a key breakthrough in addressing challenges related to geometric deformation and scale adaptation. By dynamically adjusting the sampling positions of the convolution kernel, this method overcomes the rigid geometric constraints of traditional convolution operations, offering more flexible feature modeling capabilities for image processing and target detection tasks. In the domain of image processing, the LF-DFnet network, developed by Wang et al. [23], achieves super-resolution for optical flow field images via an angular deformable alignment module (ADAM). It fuses multi-view angle information through bidirectional feature alignment to generate high-resolution images rich in details, validating the potential of deformable convolution in feature modeling for complex scenes.

In the area of general target detection, Kang et al. [24] integrated MobileNetv3 with deformable convolution to build an efficient detection framework, reducing model parameters while preserving robustness against target deformation. Meanwhile, Zhou et al. [25] developed a deformable convolution-based ResNet50 backbone network for urban aerial images, integrating the attention mechanism and Soft-NMS algorithms to address target occlusion and scale disparity issues in aerial photography. Zha et al. [26] innovatively incorporated the dynamic sampling mechanism of deformable convolution, enhancing the bidirectional feature pyramid network through grouped deformable convolution (GD-BiFPN). This strengthens the ability to capture local details in shaded regions, showcasing the efficiency and robustness of deformable convolution in vertical domains like electric utility monitoring. Research on deformable convolution has progressed from basic feature alignment to multi-level feature fusion, with simultaneous attention to both efficiency optimization and scene adaptability.

While deformable convolution greatly boosts the ability of remote sensing target detection to adapt to geometric deformation and scale changes through its dynamic sampling mechanism, current approaches still grapple with limitations stemming from gradual scale variations. In a single scene, targets of varying sizes (e.g., different types of ships) display continuous, step-by-step changes. Such progressive scale shifts present major hurdles for traditional detection algorithms: detectors designed with fixed scales often struggle to preserve symmetrical feature representation across size dimensions when simultaneously detecting multi-scale targets. Existing multi-level structures like feature pyramids seek to tackle this via hierarchical fusion, yet they inherently introduce uneven computational burdens between scale branches—deeper pyramids drive exponential increases in complexity, while shallower ones sacrifice detection precision.

The LSKNet [5] method employs large-kernel convolutions to expand receptive fields, but its directionally skewed sampling patterns disturb the rotational symmetry needed for detecting arbitrarily oriented targets. Moreover, dilated convolutions produce geometrically irregular sampling grids that fail to retain the intrinsic shape symmetry of elongated targets such as ships or aircraft. These drawbacks arise from fundamental breaks in scale-space symmetry within feature interaction mechanisms.

To overcome these challenges, we propose the Parallel Interleaved Convolutional Kernel Network (PICK-Net), which incorporates symmetry-aware multi-scale modeling through two core innovations. First, rather than utilizing dilated or large kernels, we employ parallel depthwise kernels arranged in mirror symmetry (Figure 1) to attain balanced receptive field coverage while reducing background noise to a minimum. Second, the Global Complementary Attention Mechanism (GCAM) utilizes bidirectional attention symmetry, optimizing channel and spatial features independently via orthogonal pathways. This successfully addresses gradient competition while preserving rotational equivariance in feature responses. Our key contributions are as follows:

(1): A systematic analysis uncovering the connection between progressive scale variation and the degradation of feature symmetry—specifically, how limited receptive fields undermine directional completeness in large targets while introducing scale-specific noise asymmetry in small targets.
(2): A parallel multi-scale backbone network that attains computational efficiency via symmetrical kernel grouping. Here, depthwise kernels of complementary sizes work together to capture scale-invariant features without the artifacts caused by dilation.
(3): A collaborative framework where parallel convolutions and GCAM foster a symbiotic scale-direction balance: the former guarantees spatial symmetry through equidistant sampling, while the latter preserves channel balance via decoupled attention mechanisms.

The rest of this paper is organized as follows: Section 2 reviews related work, Section 3 details the design of PICK-Net, Section 4 introduces experimental settings and result analysis, and Section 5 summarizes the research findings and prospects, helping readers quickly grasp the context of this paper.

2. Related Works

The development of remote sensing image target detection technology has long centered on two core challenges: optimizing feature representation in multi-source heterogeneous scenes and achieving coordinated improvement of geometric positioning accuracy. In recent years, researchers have steadily refined detector architectures across aspects including multi-scale feature fusion, rotating target adaptation, and enhanced detection of small targets. They have also redefined the bounding box characterization paradigm through innovative methods such as dynamic anchor frame mechanisms, decoupled regression strategies, and probabilistic matching approaches. These advancements not only markedly enhance the detection robustness in complex remote sensing scenarios but also uncover the deep coupling mechanism between feature space and geometric space at the theoretical level, paving a key foundation for building next-generation detection frameworks that balance efficiency and accuracy. In the following subsections, we systematically review relevant research progress from two critical directions: (1) innovations in target detector architectures; (2) optimization of bounding box characterization.

2.1. Target Detector Architecture

Research into remote sensing target detectors has focused on addressing the unique challenges presented by image characteristics. To tackle the issues of diverse object scales and complex backgrounds in high-resolution images, Zhang et al. [27] developed a multi-scale hard sample mining network (MSHEMN). This network enables cross-layer feature interactions via lateral connectivity blocks (LCBs), integrates high-resolution details with rich semantic information through an adaptive feature merging (AFM) mechanism, and dynamically adjusts training sample weights to direct the model toward learning hard samples. The method achieved an average precision (AP) of 62.6% on the HRRSD dataset, marking the first validation of the effectiveness of hard sample optimization strategies in complex remote sensing scenes.

In terms of adapting to target orientations, Pan et al. [28] refined the YOLOv5 architecture by introducing a circular smoothing label (CSL) method to reconfigure the angle prediction branch, alongside innovative loss function design. This effectively reduced ambiguity in IoU calculation for tilted bounding boxes, with model parameters controlled at 16.3 M, providing a lightweight solution for rotating target detection. Meanwhile, Shamsolmoali et al. [29] proposed the Image Pyramid Single Stage Detector (IPSSD) to address small target detection challenges. By constructing a multi-scale image pyramid input network and embedding deformable convolutions in the top layer of the feature pyramid network to strengthen small target feature representation, their progressive feature fusion strategy achieved a mAP of 79.24 on the DOTA dataset.

Recent studies have further advanced detection efficiency and scale adaptability. Jiarui Zhang et al. [30] introduced a lightweight asymmetric detection head (LADH-Head) based on YOLOv5s. This head decouples channel and spatial attention computations via the C3CA module and integrates the CARAFE operator for content-aware feature restructuring. Their proposed XIoU loss function dynamically adjusts the aspect ratio penalty term, notably boosting localization accuracy for densely arranged targets. In another approach, the CRPN-SFNet framework developed by Lin et al. [31] overcomes the scale constraints of traditional detectors. Its Weak Semantic Region Proposition Network (CRPN) uses lightweight convolutions to rapidly generate candidate regions, proving highly effective for detecting both extremely small and very large targets.

2.2. Boundary Box Characterization

The manner in which bounding boxes are characterized in object detection exerts a substantial impact on both the accuracy of target localization and the efficiency of regression processes. Conventional methodologies, including those reliant on Anchor-based mechanisms [32] and pixel-level coordinate regression, demonstrate restricted adaptability when confronted with complex geometric transformations and occluded scenarios. Liu et al. [33] presented an innovative framework DAB-DETR by parameterizing bounding boxes as Dynamic Anchor Boxes and harnessing the cross-attention mechanism within the Transformer decoder to directly model the geometric interrelationships between targets and anchors. This advancement alleviates the constraint of predefined scales that is inherent in traditional anchor box parameters. To address the long-tailed distribution challenge, Kukartsev et al. proposed YOLOv8 [34] with a task-decoupling strategy, segregating bounding box regression from classification tasks. It further incorporates a dynamic matching mechanism to optimize the initial distribution of boxes, thereby significantly enhancing localization robustness. Nevertheless, this approach suffers from limitations such as inferior performance in small object detection, a high demand for large volumes of training data, and prolonged training durations. In response to the sensitivity of traditional non-maximal suppression (NMS) to overlapping boxes, CenterNetv2 [35] introduces a probability-driven one-to-one label assignment strategy. This method integrates geometric similarity and semantic consistency into a matching cost function, effectively mitigating false detections while preserving high recall rates.

3. Methodology

Traditional deformable convolution has an inherent trade-off between geometric adaptation and computational efficiency: the dense sampling pattern brings significant computational redundancy, while the coupled optimization of channel modulation and spatial deformation is prone to gradient conflicts. In this paper, we propose a feature extraction method, Complementary Interleaved Global Attention Module (CIGAM), which breaks through the above limitations through the synergistic design of dynamic sparse sampling and decoupled modulation architecture. Its core innovations are (1) establishing a multi-granularity sparse sampling group, combined with a gated activation strategy to achieve dynamic allocation of computational resources; (2) constructing a channel-space decoupled modulation system, and enhancing the robustness of gradient optimization under deformation conditions through gradient stabilization techniques. The method enables a significant reduction in computational complexity while maintaining the geometric modeling capability, providing an effective infrastructure for efficient high-precision visual perception. The CIGAM consists of two modules, Parallel Interleaved Convolution (PIC) and the Global Complementary Attention Mechanism (GCAM).

3.1. Parallel Interleaved Convolution (PIC)

The Parallel Interleaved Convolution module realizes efficient deformable feature extraction through the mechanism of multi-granularity feature decoupling and resource dynamic allocation. Compared with the fixed sampling mode of traditional deformable convolution, PIC introduces a decomposable sparse sampling structure: the standard convolutional

K \times K

kernel is decomposed into G independent sampling groups, and each group contains only S sparse sampling points (

S ≪ K^{2}

), which reduces the computational redundancy while maintaining the geometric adaptation capability. Each group of sampling points covers different scales of receptive fields through a uniformly distributed initialization strategy, which ensures the ability to capture multi-resolution features.

The dynamic adjustment of the sampling position is realized by a lightweight offset prediction network, which is mathematically expressed as follows:

\begin{matrix} Δ p_{k} = W_{i} * X (k = 1, \dots, K), \\ Δ P g = {Δ p_{g}^{1}, Δ p_{g}^{2}, . . ., Δ p_{g}^{S}} = F_{offset}^{g} (X), \end{matrix}

(1)

where

F_{offset}^{g}

consists of a two-layer

1 \times 1

convolution and outputs offset vectors with a channel number of

2 S

.

W_{i}

represents the convolution weight parameter of the i-th layer in the offset prediction network, where the index i ranges from 1 to 2 (since the offset prediction network consists of two layers of 1 × 1 convolutions). Assuming there are a total of S independent sampling groups, with each group containing K sampling points, then

p_{k}

is the offset of the kth sampling point and

P_{g}

is the vector formed by the sampling offsets of the whole group.

In order to improve the computational efficiency, a gated activation mechanism based on global features is designed to generate group activation weights by Sigmoid function:

\begin{matrix} G_{c} = Sigmoid (W_{g} \cdot GAP (X) + b_{g}), \\ GAP (X) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{i, j}, \end{matrix}

(2)

where GAP is global average pooling and X is the input feature map (number of channels C, spatial dimension

H \times W

).

b_{g}

is the bias term of the fully connected layer in the gating activation mechanism, and its dimension is consistent with that of

W_{g}

.This operation compresses the feature map into a vector of channel dimensions by averaging all the spatial locations of each channel. GAP

(\cdot)

captures the global statistical information of the input features through the averaging operation of the spatial dimensions, eliminates the local noise interference, and provides stable signals for the subsequent gating decisions.

W_{g}

is the weights of the fully-connected layer,

G_{c}

is the group activation weight vectorand Sigmoid is the activation function with an output range of

(0, 1)

. When the activation weight

G_{g}

of group g is less than some empirical value t, the sampling points of group g will be skipped. Finally, the Parallel Interleaved Convolution module is characterized as follows:

Y = \sum_{g = 1}^{G} G_{g} \cdot (\sum_{s = 1}^{S} w_{g}^{s} \cdot B (X, p_{0} + p_{g}^{s} + Δ p_{g}^{s}))

(3)

Here, Y represents the output feature of Parallel Interleaved Convolution (PIC), and its dimension is consistent with that of the input feature map X, which is (C×H×W).

B (\cdot)

denotes the bilinear interpolation operation and

w_{g}^{s}

is the learnable convolution parameter.When the number of activation groups

G_{active} < G

, the computational complexity can be reduced from

O (G \cdot S \cdot H W C)

to

O (G_{active} \cdot S \cdot H W C)

to achieve speedup ratios of up to

G / G_{active}

times.

3.2. Global Complementary Attention Mechanism (GCAM)

Conventional modulation mechanisms for deformable convolution model channel attention coupled with spatial deformation strength, leading to conflicting parameter optimization directions. This work proposes a decoupled modulation strategy to alleviate the inherent contradiction in the modulated signal by independently modeling the feature channel importance and spatial deformation confidence. The channel modulation factor

α \in R^{C}

is generated by a compressive excitation network:

\begin{matrix} α = Sigmoid (W_{2} \cdot δ (W_{1} \cdot GAP (X))), \end{matrix}

(4)

where

W_{1}

and

W_{2}

have a multiplicative relationship in the number of channels, constituting a bottleneck structure:

\begin{matrix} Channel (W_{1}) = r \cdot Channel (W_{2}), \end{matrix}

(5)

where r is the channel compression ratio.

The spatial modulation factor Bs is then calculated based on an exponential decay function of the offset magnitude:

\begin{matrix} β_{s} = exp (- \frac{∥ Δ p_{g}^{s} ∥^{2}}{γ \cdot σ^{2}}) \end{matrix}

(6)

The parameter

γ = 0.6

controls the strength of the spatial constraints and

σ

is the initialized standard deviation.

The feature fusion stage utilizes a two-path parallel structure. The channel modulation path performs channel-by-channel scaling and the spatial modulation path performs deformation confidence weighting. The two features are spliced and fused by

3 \times 3

convolution:

\begin{matrix} Y_{channel} = α \cdot Y_{base}, \\ Y_{space} = \sum_{g, s} β_{s} \cdot Y_{base}, \\ Y_{out} = {Conv}_{3 \times 3} ([Y_{channel}; Y_{space}]), \end{matrix}

(7)

where

Y_{base}

,

Y_{channel}

,

Y_{space}

, and

Y_{out}

denote the output of the backbone, the output of the channel dimension, the output of the spatial dimension, and the output of GCAM. ⊙ denotes channel-by-channel multiplication.

To enhance the training stability under large offset conditions, an orientation-sensitive gradient scaling mechanism is proposed:

\begin{matrix} \frac{\partial L_{n + 1}}{\partial Δ p} \leftarrow \frac{1}{1 + {λ ∥ Δ p ∥}^{2}} \cdot \frac{\partial L_{n}}{\partial Δ p}, \end{matrix}

(8)

where

λ

is the damping coefficient, an operation that suppresses the gradient explosion while preserving the effective deformation learning capability.

L_{n + 1}

represents the loss after the next iterative update.

4. Experiments

The experimental results are presented on two typical public datasets containing targeted objects: RSOD [36] and NWPU-VHR10 [37]. Details regarding evaluation metrics, datasets, method implementation, and experimental outcomes are discussed in the subsequent subsections.

4.1. Evaluataion Metrics

In the performance evaluation of detection task, the synergistic quantification of detection precision and localization accuracy is the core challenge. In this study, we adopt the average precision metric system based on Intersection over Union (IoU), which can comprehensively reflect the robustness of the model in complex scenarios through multi-dimensional threshold scanning and statistical modeling. Set

A P_{I o U = 0.5}

and

m A P

as evaluation benchmarks.

A P

is defined as the area enclosed by the Precision-Recall (PR) curve and the axes,

m A P

is the average AP for each category.

I o U = 0.5

is the threshold of IoU for determining positive and negative cases. The mathematical expression for AP and IoU is as follows:

\begin{matrix} A P = \int_{0}^{1} P (R) d R, \end{matrix}

(9)

\begin{matrix} I o U = \frac{B_{p r e d} \cap B_{g t}}{B_{p r e d} \cup B_{g t}}, \end{matrix}

(10)

where

P (R)

represents the PR curve,

B_{p r e d}

is the prediction frame, and

B_{g t}

is the true labeling frame. During the statistical process of detection results, the determination of True Positive (TP), False Positive (FP), and False Negative (FN) needs to be carried out accordingly:

TP: the number of prediction frames that satisfy

I o U (B_{p r e d}, B_{g t}) > = t

and have the correct category, characterizing the valid targets correctly detected by the model.

FP: the number of prediction frames with

I o U (B_{p r e d}, B_{g t}) < t

, category-incorrect prediction frames, and duplicate detection frames, reflecting the false alarm rate.

FN: the number of true labeled frames not covered by any prediction frame, reflecting the model leakage detection rate.

Based on the above statistics, the precision rate (P) and the recall rate (R) are calculated as follows:

\begin{matrix} P = \frac{T P}{T P + F P} \\ R = \frac{T P}{T P + F N} \end{matrix}

(11)

4.2. Datasets

4.2.1. RSOD Dataset

The RSOD dataset is an openly accessible benchmark dataset annotated in PASCAL VOC format specifically designed for identifying typical artificial targets in RS imagery. It comprises four object categories with distinct scale and morphological variations: aircraft, playgrounds, overpasses, and storage tanks. These categories demonstrate complementary characteristics in spatial distribution, textural features, and background complexity.

Constructed through a stratified sampling strategy, the dataset features uniform image resolutions of 1024 × 1024 pixels across all categories. Annotation files include target bounding box coordinates and class labels, enabling joint research on object detection and fine-grained classification tasks. By incorporating balanced multi-scale target distributions (1–53 instances per image) and complex backgrounds, RSOD provides a standardized testbed for evaluating model generalization capabilities in RS scenarios. The category-specific image counts and instance statistics are presented in the following Table 1.

4.2.2. NWPU-VHR10 Dataset

The NWPU-VHR10 dataset, released by Northwestern Polytechnical University in 2014, is the first publicly available benchmark dataset for multi-class object detection in high-resolution optical remote sensing images under complex scenarios. It comprises 800 very-high-resolution (VHR) remote sensing images, including 650 target-containing images and 150 background images, with spatial resolutions of 0.5–2 m and dimensions ranging from

500 \times 500

to

1100 \times 1100

pixels. These images were cropped from Google Earth and the Vaihingen dataset, followed by expert manual annotation.

The dataset covers 10 categories of geospatial targets: airplane, ships, storage tanks, baseball fields, bridges, and others, totaling 3651 annotated instances stored in horizontal bounding boxes. Characterized by large-scale target variations, high background complexity, and class imbalance, NWPU-VHR10 has become a standard benchmark for evaluating deep learning models in RS object detection. It has significantly driven algorithm advancements in multi-scale feature fusion and small object detection.

This study selects RSOD and NWPU-VHR10 as experimental datasets mainly based on the following considerations: Both are widely recognized benchmark datasets in the field of remote sensing object detection. Their data cover multi-source remote sensing scenarios such as aerial and satellite images, include multi-scale targets (e.g., small-sized vehicles, buildings, and large-sized airport runways) and complex backgrounds (e.g., cloud occlusion, terrain interference). Additionally, they provide complete annotations and public download channels officially, ensuring high accessibility and facilitating the reproduction and horizontal comparison of research results. Besides the aforementioned datasets, datasets such as DOTA (for rotated object detection) and HRRSD (for high-resolution remote sensing scenarios) also have similar research value. However, considering that this study focuses on ‘verifying the performance of general object detection’, the target categories of RSOD and NWPU-VHR10 (e.g., aircraft, ships, vehicles) are more in line with the core needs of this research, hence their priority selection. It should be noted that the selected datasets have certain limitations: RSOD contains only four target categories (aircraft, playground, bridge, oil depot), resulting in a relatively narrow category coverage; although NWPU-VHR10 includes 10 target categories, the sample size of some categories (e.g., aircraft, vehicles) is significantly larger than that of others (e.g., stadium, port), leading to class imbalance. In response to the above limitations, we have adopted corresponding measures in the experimental design: for class imbalance, a ’weighted cross-entropy loss’ was used to balance the weights of positive/negative samples and different categories; to address the narrow category coverage. In subsequent work, we will extend the research to multi-category datasets such as DOTA and HRRSD to further verify the generalization ability of the model, so as to make up for the shortcomings of the current study.

4.3. Implementation Details

To ensure the fairness of the experiments, all comparative experiments and ablation experiments were conducted in the same experimental environment: all experiments were carried out on the Linux Ubuntu 16.04 operating system, with the PyTorch 2.0.1 framework selected for training and prediction tasks, and run on NVIDIA GeForce RTX 4080Ti (Santa Clara, CA, USA). During the training process, considering factors such as the number of model parameters, hardware conditions, and training speed comprehensively, and after verification through multiple rounds of experiments, the final training strategy was determined as follows: the batch size was set to 8, the number of iteration rounds was set to 300, the Stochastic Gradient Descent (SGD) optimizer was adopted, the maximum learning rate was 0.01, the minimum learning rate was 0.001, and the cosine annealing strategy was used for learning rate decay. The processing strategies for the RSOD and NWPU-VHR10 datasets were as follows: for image scaling, an adaptive scaling mechanism was employed—images in the RSOD dataset were uniformly adjusted to 512 × 512 pixels, while those in the NWPU-VHR10 dataset were adjusted to 640 × 640 pixels; the original aspect ratio was maintained, and mean value filling was used for blank areas to avoid target distortion. Data augmentation included geometric transformations (random horizontal flipping with a probability of 0.5, random vertical flipping with a probability of 0.3, random rotation within the range of −15° to 15°, and random cropping with an area ratio of 0.7–1.0), pixel adjustments (random changes in brightness ±15%, contrast ±20%), and noise injection (Gaussian noise with

σ

= 0.01 was added to 10% of the training samples).

For data partitioning, the official recommended proportions of the datasets were followed: the RSOD dataset was divided into training and test sets at a ratio of 8:2, and the NWPU-VHR10 dataset was divided into training and test sets at a ratio of 7:3, with consistent distribution of each category ensured in the partitioned datasets. For label processing, the bounding box annotations of the original datasets were uniformly converted to the COCO format for easy model parsing, and a small number of samples with ambiguous annotations in the NWPU-VHR10 dataset were manually corrected to ensure annotation accuracy.

4.4. Ablation Studies

4.4.1. Evaluation of Different Components

We performed ablation experiments on the RSOD and NWPU-VHR10 datasets to validate the performance of the proposed PIC and GCAM. Table 2 shows the results of the baseline model on the NWPU-VHR10 and RSOD datasets with and without our proposed PIC and GCAM. On the NWPU-VHR10 dataset, the baseline model achieves only 82.15% mAP, which is due to the problem of spatial redundancy and channel coupling of the traditional dense convolutional features. On the one hand, this issue leads to a large amount of computation consumed in the background region, and, on the other hand, the strong correlation of the channel features results in the submergence of the key texture information, which affects the prediction accuracy of the network regressors and classifiers. When combined with our proposed PIC, the performance of the detector is optimized by 2.14%, indicating that in the remote sensing image target detection task, covering different scales of sensory fields by dynamic sparse sampling improves the capability of capturing multi-resolution features; therefore, the detection performance is improved. When GCAM is added to the network, the detector effectively eliminates the gradient competition induced by cross-dimensional feature coupling through the channel-space dual-path independent modulation mechanism, thus enhancing the robustness of gradient optimization under convolutional kernel deformation conditions. When the network uses only the GCAM alone, the improvement in detector performance is 1.81%, which proves that the shallow feature prediction results are beneficial in guiding the prediction of deeper feature parameters. Together, the PIC and the GCAM improve the performance of the PICK-Net by 2.75% compared to the baseline model, achieving higher accuracy in target detection.

In the RSOD dataset, we observe consistent experimental results that align with our findings on the NWPU-VHR10 dataset. As illustrated in Table 2, PICK-Net achieves superior performance when both proposed modules are integrated compared to configurations utilizing only a single module. Specifically, the Parallel Interleaved Convolution (PIC) module enables the detection network to construct a multi-level adaptive receptive field hierarchy. This mechanism facilitates effective aggregation of cross-scale features in remote sensing target detection scenarios, thereby mitigating the issue of inaccurate parameter prediction in existing methods caused by spatial redundancy in feature representation.

Furthermore, the Global Complementary Attention Mechanism (GCAM) addresses gradient conflicts induced by cross-dimensional feature coupling through independent modulation of feature dimensions. By decoupling channel-wise importance weighting and spatial deformation confidence estimation, GCAM significantly enhances the stability of gradient propagation during deformable convolution kernel optimization. This improvement enhances the adaptability to geometric deformations of targets, effectively alleviating detection accuracy degradation caused by spatial-channel feature entanglement. Consequently, PICK-Net achieves a competitive detection performance of 92.2% mAP on the RSOD dataset, validating the synergistic effectiveness of the proposed modules.

4.4.2. Evaluation of Parameters Inside Modules

In order to verify the cascading optimization effect of Complementary Interleaved Global Attention Module(CIGAM), we modify the number of CIGAM modules and conduct experiments on NWPU-VHR10 and RSOD datasets.

From Table 3, it is evident that when the feature decoupling units are stacked level by level, the detection performance of PICK-Net shows a trend of increasing and then decreasing in the detection accuracy on a detection-by-detection basis, and reaches the highest value of 83.9% after three units, which is 2.22% higher than that of a single-layer branch. This verifies that CIGAM, by decoupling the independent optimization paths of feature dimensions, enables the high purity feature expression after decoupling of the preamble units to guide the subsequent layers to achieve more accurate spatial localization. However, the model performance starts to decay when the number of stacked layers exceeds 3. This is due to the fact that the decoupling modulation process of the CIGAM module is sensitive to feature noise, and the independent optimization characteristics of the layers lead to the propagation of error signals along the decoupling path step by step, so that it is difficult to obtain correct results for the subsequent units once the prediction results of one unit are wrong; the over-deep decoupling stacking structure triggers the increase in redundancy of the decoupling parameter space, and the feature deviation generated by some of the units during dynamic sparse sampling is difficult to be effectively calibrated by the subsequent modulation process. It is difficult to be effectively corrected by the subsequent modulation process. Therefore, the experiment finally selects 3-layer CIGAM as the optimal configuration. As can be seen from Table 3, the conclusion is consistent in the controlled experiments on the RSOD dataset.

4.5. Comparisons with State-of-the-Art Detectors

To validate the comprehensive efficacy of this method, seven types of cutting-edge detection models covering Convolutional Networks and Transformer and Hybrid architectures are selected as benchmarks, including the lightweight model EfficientNet, the YOLO family represented by the YOLOv4 [38] and YOLOv8s [39], the Transformer architecture DETR and its improved version, and the feature fusion innovation models SwinT and DetecoRS [4]. All models were fair-tested on the RSOD dataset using official preset parameters, where the input sizes of EfficientNet, DETR, SwinT, DetecoRS, and DAB-DETR were uniformly

1333 \times 800

pixels, and the rest of the models were kept at

640 \times 640

pixels resolution.

4.5.1. Results on RSOD

Experimental data (shown in Table 4) on the RSOD dataset show that the method in this study shows significant advantages in the accuracy–efficiency balance: in the detection accuracy dimension, the mAP of the RSOD dataset reaches 92.2%, which is 0.26% and 0.9% higher than that of the DAB-DETR and YOLOv8s, respectively, and this confirms the strong adaptability of PICK-Net to complex scenes. In the dimension of computational efficiency, the number of model parameters is compressed to 12.5 M, which breaks through the performance boundary of existing lightweight detection models.

The visual detection results depicted in Figure 2 show excellent performance in the detection tasks of four types of typical remote sensing targets. For example, for the aircraft detection in Figure 2a, the algorithm accurately distinguishes neighboring aircrafts with very small spacing in a dense tarmac scenario by the deformable convolution, which verifies the ability of deformable convolution to capture local features.

The oil storage tank detection verifies the adaptability of the algorithm to multi-scale targets. As shown in the result plots in Figure 2b, the proposed multi-scale feature fusion mechanism in the detection of oil storage tanks spanning 10–50 m in diameter results in a decrease in the standard deviation of the detection rate for targets of different sizes.

The overpass detection illustrates the ability to perceive structures in complex contexts. As in Figure 2c, the algorithm proposed in this paper accurately identifies the shape features of the overpass target and effectively distinguishes the girder structure from the background road texture, and correctly extracts the overpass target from the similarly colored background.

Playground detection shows strong discriminative power under the shape regularity constraint. For the 400 m standard runway whose color features are highly similar to the surrounding roofs, the algorithm reduces the false alarm rate significantly.

4.5.2. Results on NWPU-VHR10

In the detection task of the NWPU-VHR10 dataset, our model leads existing technology with a mAP score of 84.90%, surpassing other advanced detectors (shown in Table 5).

Visualization results (as shown in Figure 3) further confirm the excellent performance when dealing with targets of varying scenarios.Among them, in Figure 3a, the airplane targets have small scales and are densely parked at the airport, but PICK-Net still achieves accurate detection for them; in Figure 3b, the ship targets with extreme aspect ratios and different orientations, the bounding box generated by PICK-Net can be well aligned with the targets to ensure that it contains all the target regions while not covering too much of the background regions; in Figure 3c, the car target has similar color and shape features to the water tank in the background, but PICK-Net does not cause false detection, indicating that PICK-Net can identify the target well based on the features of the context around the target; in Figure 3d, with a large size difference between the two types of targets, PICK-Net shows good robustness and accurately detects the athletic field and basketball court targets.

4.6. Robustness Verification

The detection of agricultural diseases and pests is an important extension of remote sensing technology in precision agriculture. Its core challenges (small targets, complex backgrounds) are highly consistent with those of remote sensing target detection. For instance, satellite remote sensing can acquire large-scale crop images, and the PICK-Net proposed in this study can realize fine-grained detection of diseases and pests based on such images. The proposed algorithm also shows significant effectiveness in orange pest detection tasks (as shown in Figure 4). Specifically, in occlusion scenarios (e.g., Figure 4a), even when leaves overlap with each other, the algorithm can still accurately locate pest-infested areas through its feature extraction and target inference mechanisms. By integrating contextual associations and deep semantic feature analysis, it verifies strong robustness under complex spatial relationships.

Regarding light variation issues, the algorithm mitigates interference from intense light, high-light regions, or low-light environments during feature extraction via illumination normalization processing and a multi-scale feature fusion strategy. This ensures stable capture of pest features across varying light conditions, reflecting its adaptability to complex lighting environments. In Figure 4f, when handling complex background interference, the algorithm extracts fine-grained features unique to pests—such as texture and color—and effectively distinguishes targets from similar background elements. This avoids misdetections caused by redundant background information and highlights its capability for target discrimination in complex backgrounds. Overall, the proposed algorithm not only performs effectively in traditional remote sensing target detection tasks but also generalizes well to crop pest detection scenarios, exhibiting strong robustness.

5. Conclusions

Aiming at the core problems of weak geometric deformation adaptation, low efficiency of multi-scale feature modeling, and insufficient generalization of complex scenes in remote sensing image target detection, this study proposes a dynamic feature disentanglement network framework (PICK-Net) based on dynamic feature decoupling. The framework realizes adaptive modeling of morphological diversity of remote sensing targets through the synergistic design of cascaded dynamic sparse sampling and channel-space decoupling modulation mechanism. Among them, the Parallel Interleaved Convolution module significantly reduces the computational redundancy of traditional deformable convolution through multi-granularity sensing field optimization and gated resource allocation strategy; the decoupled modulation module effectively alleviates the gradient competition problem caused by cross-dimensional feature coupling by independently regulating the channel importance and spatial deformation confidence. In challenging scenarios such as complex background interference, dense target arrangement and extreme scale difference, the proposed method shows better boundary localization accuracy and target-background differentiation capability. Cross-domain experiments show that this framework is not only suitable for traditional remote sensing target detection, but also has significant advantages for extended scenarios that require fine feature modeling, such as agricultural pest and disease identification. Although the absolute improvement margin of 2–3% may seem small, it holds statistical significance in the field of remote sensing object detection, especially for small targets and dense scenarios. Furthermore, with only 12.5 M parameters, the model is significantly superior to models such as EfficientNet (20 M parameters) and YOLOv8s (11.1 M parameters) in terms of accuracy-efficiency balance. This study acknowledges the limitation in generalization arising from the use of only two datasets. In future research, the model will be further validated on datasets such as DOTA, HRRSD, and VEDAI to improve the generalization analysis.

Author Contributions

Conceptualization, L.D. and L.S.; methodology, L.D.; software, H.L.; validation, H.L. and L.D.; formal analysis, L.S.; investigation, L.S.; resources, H.L.; data curation, H.L.; writing—original draft preparation, L.D.; writing—review and editing, L.D.; visualization, L.D.; supervision, L.S.; project administration, H.L.; funding acquisition, L.S. All authors have read and agreed to the published version of the manuscript.

Funding

Scientific Research Project of Jilin Provincial Department of Education, Project Number: JJKH20250566KJ.

Data Availability Statement

The RSOD is available at the following website: “https://gitcode.com/Universal-Tool”. The NWPU-VHR10 “https://github.com/Gaoshuaikun/NWPU-VHR-10”. The HRRSD: “https://github.com/CrazyStoneonRoad/TGRS-HRRSD-Dataset”. The DOTA: “https://captain-whu.github.io/DOTA/dataset.html”, accessed on 17 September 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Cai, Z.; Tao, R. A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–20. [Google Scholar] [CrossRef]
Yu, H.; Tian, Y.; Ye, Q.; Liu, Y. Spatial transform decoupling for oriented object detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6782–6790. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Wang, H.; Jie, F.; Tao, R. LO-Det: Lightweight Oriented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Qiao, S.; Chen, L.C.; Yuille, A. DetectoRS: Detecting Objects with Recursive Feature Pyramid and Switchable Atrous Convolution. In Proceedings of the Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 16794–16805. [Google Scholar]
Wang, W.; Xie, E.; Li, X.; Fan, D.P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L. Pvt v2: Improved baselines with pyramid vision transformer. Comput. Vis. Media 2022, 8, 415–424. [Google Scholar] [CrossRef]
Yang, X.; Yan, J. Arbitrary-oriented object detection with circular smooth label. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 677–694. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning RoI Transformer for Oriented Object Detection in Aerial Images. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 2844–2853. [Google Scholar] [CrossRef]
Zhong, Y.; Han, X.; Zhang, L. Multi-class geospatial object detection based on a position-sensitive balancing framework for high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2018, 138, 281–294. [Google Scholar] [CrossRef]
Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2337–2348. [Google Scholar] [CrossRef]
Xu, Z.; Xu, X.; Wang, L.; Yang, R.; Pu, F. Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sens. 2017, 9, 1312. [Google Scholar] [CrossRef]
Zhang, G.; Lu, S.; Zhang, W. CAD-Net: A Context-Aware Detection Network for Objects in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 57, 10015–10024. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. ReDet: A Rotation-Equivariant Detector for Aerial Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 2786–2795. [Google Scholar]
Zhang, P.; Zhang, X.; Shi, J.; Wei, S. Depthwise separable convolution neural network for high-speed sar ship detection. Remote Sens. 2019, 11, 2483. [Google Scholar] [CrossRef]
Ma, H.; Liu, Y.; Ren, Y.; Yu, J. Detection of collapsed buildings in post-earthquake remote sensing images based on the improved yolov3. Remote Sens. 2019, 12, 44. [Google Scholar] [CrossRef]
Li, C.; Zhang, T.; Wu, Q. Efficient object detection framework and hardware architecture for remote sensing images. Remote Sens. 2019, 11, 2376. [Google Scholar] [CrossRef]
Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive Balanced Network for Multi-Scale Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2021, 59, 8913–8926. [Google Scholar] [CrossRef]
Shi, Q.; Zhu, Y.; Fang, C.; Du, Q.; Wang, Q. RAOD: Refined Oriented Detector with Augmented Feature in Remote Sensing Images Object Detection. Appl. Intell. 2022, 52, 14089–14103. [Google Scholar] [CrossRef]
Chen, L.; Liu, C.; Chang, F.; Li, S.; Nie, Z. Adaptive Multi-Level Feature Fusion and Attention-Based Network for Arbitrary-Oriented Object Detection in Remote Sensing Imagery. Neurocomputing 2021, 451, 67–80. [Google Scholar] [CrossRef]
Rao, J.; Wu, T.; Li, H.; Zhang, J.; Bao, Q.; Peng, Z.; Narayana, S.; Zhang, Y.; Zhou, Y.; Yu, H. Remote sensing object detection with feature-associated convolutional neural networks. Front. Earth Sci. 2024, 12, 1381192. [Google Scholar] [CrossRef]
Pang, X.; Peng, X.; Lou, Q.; Feng, X.; Li, Z. Moving Small Object Detection Algorithm Based on Three-Frame Difference and Improved Twice Image Segmentations in HSV Space. In Proceedings of the 2024 43rd Chinese Control Conference (CCC), Hangzhou, China, Kunming, China, 28–31 July 2024; pp. 5678–5683. [Google Scholar]
Wang, Y.; Yang, J.; Wang, L.; Ying, X.; Wu, T.; An, W.; Guo, Y. Light Field Image Super-Resolution Using Deformable Convolution. IEEE Trans. Image Process. 2021, 30, 1057–1071. [Google Scholar] [CrossRef]
Kang, H.; Liu, Y. Efficient Object Detection with Deformable Convolution for Optical Remote Sensing Imagery. In Proceedings of the 2022 5th International Conference on Pattern Recognition and Artificial Intelligence (PRAI), Chengdu, China, 19–21 August 2023. [Google Scholar]
Zhou, L.Q.; Sun, P.; Li, D.; Piao, J.C. A Novel Object Detection Method in City Aerial Image Based on Deformable Convolutional Networks. IEEE Access 2022, 10, 31455–31465. [Google Scholar] [CrossRef]
Zha, W.; Hu, L.; Sun, Y.; Li, Y. ENGD-BiFPN: A remote sensing object detection model based on grouped deformable convolution for power transmission towers. Multimed. Tools Appl. 2023, 82, 45585–45604. [Google Scholar] [CrossRef]
Zhang, L.; Wang, Y.; Huo, Y. Object Detection in High-Resolution Remote Sensing Images Based on a Hard-Example-Mining Network. IEEE Trans. Geosci. Remote Sens. 2020, 59, 1–13. [Google Scholar] [CrossRef]
Pan, D.; Wu, L. Oriented Object Detection in Remote Sensing Image Based on YOLOV5. In Proceedings of the 2022 14th International Conference on Signal Processing Systems (ICSPS), Zhenjiang, China, 18–20 November 2022; pp. 424–430. [Google Scholar]
Shamsolmoali, P.; Zareapoor, M.; Granger, E.; Chanussot, J.; Yang, J. Enhanced Single-shot Detector for Small Object Detection in Remote Sensing Images. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022. [Google Scholar]
Zhang, J.; Chen, Z.; Yan, G.; Hu, W.B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023, 15, 28. [Google Scholar] [CrossRef]
Lin, Q.; Zhao, J.; Fu, G.; Yuan, Z. CRPN-SFNet: A High-Performance Object Detector on Large-Scale Remote Sensing Images. IEEE Trans. Neural Netw. Learn. Syst. 2020, 33, 416–429. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv 2022, arXiv:2201.12329. [Google Scholar] [CrossRef]
Kukartsev, V.V.; Ageev, R.A.; Borodulin, A.S.; Gantimurov, A.P.; Kleshko, I.I. Deep Learning for Object Detection in Images Development and Evaluation of the YOLOv8 Model Using Ultralytics and Roboflow Libraries. In Proceedings of the Computer Science Online Conference, Virtual, 25–28 April 2024. [Google Scholar]
Zhou, X.; Koltun, V.; Krähenbühl, P. Probabilistic two-stage detection. arXiv 2021, arXiv:2103.07461. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate Object Localization in Remote Sensing Images Based on Convolutional Neural Networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
Jocher, G.; Kwon, Y.; guigarfr; perry; Veitch-Michaelis, J.; Ttayu; Suess, D.; Baltaci, F.; Bianconi, G.M.; IlyaOvodov; et al. Ultralytics/yolov3: V9.1—YOLOv5 Forward Compatibility Updates. 2021. Available online: https://zenodo.org/records/4435632 (accessed on 17 September 2025).

Figure 1. Structure diagram of the PICK-Net. The backbone extracts features, generating a feature pyramid that serves as the foundation for subsequent processing. CIGAM processes these features, contributing to feature sampling and modulation. PIC handles input feature transformation, while the GCAM integrates spatial and spectral modulation, combining them via concatenation before a convolutional layer. Each module plays a distinct role: PIC adapts input features, and GCAM enriches them through spatial-spectral fusion. This coordinated architecture ensures thorough feature processing, capturing critical spatial and spectral characteristics to enable precise detection.

Figure 2. Results on RSOD.

Figure 3. Results on NWPU-VHR10.

Figure 4. Results on orange pest detection task.

Table 1. Category-specific image counts and instance statistics of RSOD.

Class	Image Counts	Instance Counts (%)
Plane	446	4993
Playground	189	191
Overpass	176	180
Oil tank	165	1586

Table 2. The results of ablation experiments using PIC and GCAM on NWPU-VHR10.

Dataset	PIC	GCAM	mAP (%)
RSOD	✗	✗	91.00
	✓	✗	91.65
	✗	✓	91.43
	✓	✓	92.20
NWPU-VHR10	✗	✗	82.15
	✓	✗	84.29
	✗	✓	83.96
	✓	✓	84.90

Table 3. Results of sensitivity experiments on the number of CIGAM.

Dataset	Cascade Number	mAP (%)
RSOD	0	91.00
	1	91.49
	2	91.88
	3	92.20
	4	91.78
NWPU-VHR10	0	82.15
	1	82.68
	2	83.53
	3	84.90
	4	83.74

Table 4. Comparison with other state-of-the-art methods on the RSOD dataset.

Method	Parameter(M)	Size	mAP
EfficientNet	20.0	$1333 \times 800$	90.81
YOLOv4	64.0	$640 \times 640$	89.93
DETR	36.7	$1333 \times 800$	91.12
SwinT	66.	$1333 \times 800$	90.67
DetecoRS	123.2	$1333 \times 800$	90.86
DAB-DETR	42.9	$640 \times 640$	91.94
YOLOv8s	11.1	$640 \times 640$	91.30
Baseline	12.1	$640 \times 640$	91.02
Ours	12.5	$640 \times 640$	92.20

Table 5. Comparison of AP with state-of-the-art methods on NWPU-VHR10.

Model	AP	SH	ST	BD	TC	BC	GT	HA	BR	BE	mAP
EfficientNet	98.36	79.81	93.34	93.75	84.47	70.85	89.74	87.14	55.78	61.75	78.79
YOLOv4	97.81	82.55	95.58	95.78	85.23	73.57	90.74	84.22	57.75	62.24	80.30
DETR	99.10	83.27	94.36	93.78	86.25	74.85	91.71	86.25	58.25	68.31	82.27
SwinT	96.54	81.46	96.31	92.45	86.47	75.01	92.01	85.42	59.25	69.72	81.94
DetecoRS	90.35	82.79	96.46	94.78	83.55	74.98	90.45	87.24	57.38	70.95	81.89
DAB-DETR	98.79	78.56	97.45	95.99	86.53	74.22	91.76	88.45	60.45	71.75	83.65
YOLOv8s	97.88	77.33	96.97	96.45	85.37	71.37	91.65	88.71	59.14	70.87	83.61
Baseline	96.12	82.76	96.43	95.38	85.91	73.93	91.74	88.14	55.57	72.48	82.15
Ours	99.35	83.33	97.72	96.71	86.49	75.12	92.89	88.68	56.31	72.47	84.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Deng, L.; Sun, L.; Li, H. A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels. Symmetry 2025, 17, 1621. https://doi.org/10.3390/sym17101621

AMA Style

Deng L, Sun L, Li H. A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels. Symmetry. 2025; 17(10):1621. https://doi.org/10.3390/sym17101621

Chicago/Turabian Style

Deng, Leilei, Lifeng Sun, and Hua Li. 2025. "A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels" Symmetry 17, no. 10: 1621. https://doi.org/10.3390/sym17101621

APA Style

Deng, L., Sun, L., & Li, H. (2025). A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels. Symmetry, 17(10), 1621. https://doi.org/10.3390/sym17101621

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Fast Rotation Detection Network with Parallel Interleaved Convolutional Kernels

Abstract

1. Introduction

2. Related Works

2.1. Target Detector Architecture

2.2. Boundary Box Characterization

3. Methodology

3.1. Parallel Interleaved Convolution (PIC)

3.2. Global Complementary Attention Mechanism (GCAM)

4. Experiments

4.1. Evaluataion Metrics

4.2. Datasets

4.2.1. RSOD Dataset

4.2.2. NWPU-VHR10 Dataset

4.3. Implementation Details

4.4. Ablation Studies

4.4.1. Evaluation of Different Components

4.4.2. Evaluation of Parameters Inside Modules

4.5. Comparisons with State-of-the-Art Detectors

4.5.1. Results on RSOD

4.5.2. Results on NWPU-VHR10

4.6. Robustness Verification

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI