Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection

Ma, Yiming; Manshor, Noridayu; Khalid, Fatimah binti

doi:10.3390/a18110684

Open AccessArticle

Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection

by

Yiming Ma

,

Noridayu Manshor

^*

and

Fatimah binti Khalid

Computer Science and Information Technology, Universiti Putra Malaysia, Kuala Lumpur 43400, Malaysia

^*

Author to whom correspondence should be addressed.

Algorithms 2025, 18(11), 684; https://doi.org/10.3390/a18110684

Submission received: 4 September 2025 / Revised: 12 October 2025 / Accepted: 14 October 2025 / Published: 28 October 2025

(This article belongs to the Section Algorithms for Multidisciplinary Applications)

Download

Browse Figures

Versions Notes

Abstract

In the field of remote sensing (RS) object detection, efficient and accurate target recognition is crucial for applications such as national defense and maritime monitoring. However, existing detection methods either have high computational complexity, making them unsuitable for real-time applications, or suffer from feature redundancy issues that affect detection accuracy. To address these challenges, this paper proposes a Feature-Differentiated Perception (FDP) lightweight remote sensing object detection method, which optimizes computational efficiency while maintaining high detection accuracy. The proposed method introduces two critical innovations: (1) Dynamic mixed convolution (DM-Conv), which uses linear mapping to efficiently generate redundant feature maps, reducing convolutional computation. It combines features from different intermediate layers through weighted fusion, effectively reducing the number of channels and improving feature utilization. Channel refers to a single feature map in the multi-dimensional feature representation, where each channel corresponds to a specific feature pattern (e.g., edges, textures, or semantic information) learned by the network. (2) The Spatial Orthogonal Attention (SOA) mechanism, which enhances the ability to model long-range dependencies between distant pixels, thereby improving feature representation capability. Experiments on public remote sensing object detection datasets, including DOTA, HRSC2016, and UCMerced-LandUse, demonstrate that the proposed model achieves a significant reduction in computational complexity while maintaining nearly lossless detection accuracy. On the DOTA dataset, the proposed method achieves an mAP (mean Average Precision) of 79.37%, outperforming existing lightweight models in terms of both speed and accuracy. This study provides new insights and practical solutions for efficient remote sensing object detection in embedded and edge computing environments.

Keywords:

remote sensing; object detection; lightweight neural networks; feature redundancy; dynamic mixed convolution

1. Introduction

By leveraging Earth observation image data captured by sensors aboard satellite platforms, the rapid and automatic positioning and identification of targets such as aircraft and ships carry significant application value in domains like reconnaissance, early warning, and the protection of maritime rights and interests. At present, remote sensing (RS) satellite technology is continuously advancing toward maturity, and both the quantity and quality of acquired remote sensing images are steadily improving. Owing to the wide field of view, bird’s-eye perspective, and complex, rich ground feature information contained in remote sensing images, data processing and analysis of such images have garnered substantial attention in recent years.

Traditional RS detection algorithms [1,2] typically generate candidate regions based on features like edges and gray levels. They then extract key features such as corners, gradients, and orientations using manually designed feature operators, before finally employing classifiers and regressors for refined recognition and localization. While traditional remote sensing object detection has undergone careful design for each step and achieved favorable results in specific application scenarios, it still faces numerous problems and limitations [3,4]. Firstly, the traditional approach relying on candidate region extraction can only roughly extract horizontal rectangular regions, leading to poor localization accuracy. Secondly, manually designed shallow and mid-level features mostly utilize low-level visual information, resulting in weak feature representation capabilities. These features fail to accurately convey complex high-level semantic information in images and exhibit poor generality across different categories. In summary, the performance of traditional remote sensing image detection methods is far from meeting the practical application requirements of on-orbit platforms.

Neural network-based RS detection technology effectively integrates processes such as candidate region extraction, feature learning, and classification–regression of suspected object regions [5,6]. This effectively avoids the drawbacks of the cumbersome design and weak generalization inherent in manually crafted features. RS detection networks are primarily divided into two categories: two-stage [7,8,9,10] and one-stage detectors [3,6,11]. For two-stage detectors, Oriented R-CNN [12] directly generates high-quality rotated candidate boxes by designing a specialized oriented RPN network. These boxes are then precisely refined through the oriented R-CNN head, achieving accurate detection of tilted targets via a two-stage cascade optimization process. R³Det [13] gradually adjusts initial horizontal candidate boxes using a Feature Refinement Module, realizing progressive optimization from coarse localization to precise rotation boxes. To achieve faster detection speeds, RetinaNet-R [14], built on a rotationally aware feature pyramid network, directly predicts rotated bounding boxes through a single forward inference. It also employs focal loss to address sample imbalance, enabling fast and accurate detection of rotated targets. R-DFPN [15] designs a dense feature pyramid network structure that performs rotation box regression directly on multi-scale feature maps, ensuring real-time performance while achieving precise detection of rotated targets of varying scales.

Presently, research on detection architectures has shifted focus from accuracy enhancement to speed optimization [16,17,18]. Yang’s Resolution Adaptive Network (RANet) [19] consists of multiple deep subnetworks with different weights. Samples first undergo identification starting from the subnetwork with the lowest weight. If the results meet specific criteria, the process terminates; otherwise, the samples are fed into a subnetwork with a higher weight for further identification. This achieves a balance between accuracy and computational load while reducing spatial redundancy in images. BranchyNet [20] designs additional branch classifiers, allowing most test samples to exit the network early through these classifiers based on their prediction results. SkipNet [21] develops a residual network with gated units, selectively skipping unnecessary convolutional layers during inference, which significantly reduces the model’s inference time. With fixed input data, dynamic parameters can be selected to appropriately adjust network parameters, improving feature extraction performance while adding only a small amount of computational cost.

In the research field of lightweight and efficient remote sensing object detection, existing studies [6,10] have carried out targeted explorations focusing on the oriented nature, rotational property, multi-scale characteristic, and detection efficiency requirements of objects in remote sensing scenarios. LO-Det [22] focuses on the lightweight implementation of oriented object detection in remote sensing images. By designing a lightweight backbone network and a parameter-efficient oriented object prediction branch, it reduces the model’s computational complexity and parameter count while adapting to the detection needs of common oriented objects (such as aircraft and ships) in remote sensing scenarios, thus balancing detection accuracy and operational efficiency. To address the issue of increased complexity caused by the coupling of spatial transformation and detection tasks in oriented object detection, STD-Net [9] proposes a spatial transformation decoupling strategy. This strategy decouples object pose adjustment from feature extraction and classification–regression tasks, effectively simplifying the oriented detection process, reducing redundant computations, and providing a new approach for the efficient detection of oriented remote sensing objects. Starting from model generalization and rotational robustness, ReDet introduces rotation-equivariant properties. This enables the model to adapt to arbitrary rotation angles of remote sensing objects without relying on extensive rotated data augmentation. While improving the detection accuracy of rotated objects, it also reduces the costs of data preprocessing and model training, indirectly contributing to the efficiency of the detection process. Targeting the significant multi-scale differences of remote sensing objects, LSK-Net [23] constructs a large selective kernel network. By adaptively selecting receptive fields that match the scale of objects, it achieves accurate capture of remote sensing objects of different sizes (e.g., small-scale vehicles and large-scale buildings). Meanwhile, it leverages the dynamic adjustment mechanism of selective kernels to avoid computational redundancy caused by fixed large kernels, optimizing the model’s operational efficiency while ensuring multi-scale detection capability. Additionally, there have been studies focusing on the lightweight property of SAR image segmentation: Ma et al. [24] proposed a fast task-specific region merging method, integrating superpixel generation and merging into an end-to-end network, proposed a statistical dissimilarity measure and k-order connectivity, and optimized the segmentation accuracy and efficiency of SAR images. Soft-GCN [25] combines deep task-specific superpixel sampling with soft graph convolution. Both aim to improve the efficiency of SAR image segmentation, providing technical insights for efficient segmentation tasks in lightweight remote sensing detection. These studies provide diverse technical pathways for the efficient detection of remote sensing objects from the perspectives of task decoupling, feature robustness, and scale adaptability, promoting the development of this field toward the coordinated advancement of high accuracy and high efficiency.

In conclusion, existing mainstream lightweight network structures place excessive emphasis on parameter reduction, leading to various adverse effects such as low computational efficiency. They also fail to fully consider the impact of feature redundancy on remote sensing object detectors [4,12,26]. Specifically, there is significant redundancy in the feature maps of deep neural networks. As shown in Figure 1, the 26 subfigures correspond to feature maps from different channels in the convolutional layer. The highlighted blue and green squares indicate regions of feature maps with high similarity, intuitively demonstrating the redundancy in deep neural network feature maps—these similar feature maps mainly contain fine-grained information such as target edges and textures. Feature maps across different channels exhibit high similarity. We believe that such redundancy in feature maps may be crucial for accurately locating densely arranged small targets in remote sensing images, as these seemingly similar redundant feature maps mostly contain fine-grained information like target edge textures. Unlike previous studies, we do not discard redundant feature maps; instead, we utilize them in a simpler and more efficient manner to reduce memory access and convolutional computation. Additionally, detectors share features from the backbone, which are redundant and imprecise for the key features required by specific tasks. For example, classifiers need directionally invariant features to correctly classify the same target with different orientations, while regressors require directionally sensitive features to accurately locate targets. Indiscriminately using redundant backbone features as inputs for remote sensing object classification and regression tasks not only introduces excessive computational load but also causes mutual interference between classification and regression features, which is detrimental to the accurate detection of rotating objects.

To address this issue, this paper puts forward a novel solution. Specifically, within the convolutional layers, a dynamic mixed convolution (DM-Conv) mechanism is developed. It replaces traditional standard convolution with linear mapping, enabling the rapid generation of redundant feature maps and the reorganization of convolution operations. A feature aggregation strategy is designed between convolutional layers, which merges features from different intermediate layers through weighted fusion to generate deep-level features, thereby reducing the number of channels in each convolutional layer. In addition, a spatial orthogonal attention (SOA) mechanism is constructed to capture the dependencies between distant pixels in both horizontal and vertical directions, which helps enhance the feature representation ability of the convolutional layers. Depthwise separable convolution reduces computation by splitting spatial convolution and pointwise convolution, without utilizing feature redundancy and easily losing fine-grained information, while DM-Conv, through generating redundant feature maps via linear mapping + weighted fusion across intermediate layers, reduces computation while retaining key fine-grained information such as target edges and textures. Aiming at the problem that existing mainstream lightweight networks overemphasize parameter reduction, leading to low computational efficiency, and fail to fully consider the impact of feature redundancy on remote sensing object detectors, the design motivation of DM-Conv is to efficiently generate redundant feature maps through linear mapping while reducing the number of channels by combining cross-intermediate layer weighted fusion to avoid computational redundancy of traditional convolution; SOA is designed to solve the problem of high computational cost when traditional attention mechanisms capture long-range pixel dependencies, and simplifies computation through feature decomposition in horizontal and vertical directions. Experimental results show that compared with the baseline model, this method can reduce the computational complexity of the remote sensing detection network while keeping the detection accuracy almost unaffected.

2. Related Works

2.1. Remote Sensing Target Detection

This section focuses on the workflow of remote sensing detection architecture based on deep learning. It starts from two aspects: rotation box/region optimization and feature extraction, the remote sensing object detection solutions based on deep learning that have been proposed in recent years.

Rotation box/area optimization

Building on the objective function of two-stage detectors including Faster RCNN [7], RIFDCNN [27] introduces a rotation-invariant regularization constraint. This constraint ensures that the feature representations of training samples remain consistent before and after rotation, thereby enabling the extraction of rotation-invariant features. Yet such methods typically feature intricate architectures and high computational costs, which restricts their compatibility with diverse detector frameworks. FFA [28] extends the Region Proposal Network (RPN) of Faster RCNN by incorporating angular information. It employs a five-parameter notation—comprising center coordinates, width, height, and angle—to describe the rotated bounding box, and further integrates angular information regression loss into the localization loss function. Most of the aforementioned approaches are improved based on anchor-based frameworks; however, they tend to suffer from performance degradation, which arises either from sensitivity to anchor hyperparameters or vulnerability to boundary discontinuity issues. In addition, approaches like IENet [29] leverage anchor-free detection paradigms originally developed for natural scenes. These methods focus on optimizing two core aspects: the representation of rotated bounding boxes and the regression loss function for such boxes. Nevertheless, the detection accuracy of these anchor-free methods still falls short of that achieved by anchor-based rotation target detection approaches.

2.: Feature extraction

Current convolutional architectures inherently lack the capability to extract rotation-invariant features, and this limitation causes most mainstream deep learning detectors to struggle with achieving tight and accurate localization when confronting targets in rotated poses. To address this issue, Deformable Convolution [24] introduces extra offsets into the backbone architecture—these offsets adjust spatial sampling locations and are learned directly from the target task, eliminating the need for manual design. This results in convolutional neural networks (CNNs) with adaptive sampling positions, allowing the receptive field to move beyond fixed horizontal rectangles and instead closely match the real contour of the target object. Building on this framework, Guide Anchor [7] aims to align features with the geometry of anchor boxes. It infers offsets from the prediction outputs of anchor boxes, which then guide deformable convolution to extract feature-aligned representations. Azimi et al. proposed a cascaded network that leverages a Feature Pyramid Network (FPN) and multi-scale convolutional kernels to generate rotationally invariant features within the backbone, enhancing the robustness of target orientation detection—though this comes at the cost of increased model complexity.

Ra et al. developed a lightweight rotation detection network [30], which embeds an Enhanced Channel Attention (ECA) module into each network layer to boost the model’s representational capacity. Nevertheless, this approach often suffers from the loss of small-target feature information during the downsampling process. RODFormer [27] employs a structured Transformer to aggregate feature information across different resolutions, a design that supports the detection of densely distributed multi-angle targets in remote sensing imagery. AlignDet, on the other hand, introduces RoI Convolution, which replicates the functionality of RoI Alignment in single-stage detector frameworks.

Yet when applied to detecting rotated and densely distributed targets in aerial imagery, most of these methods experience performance drops—primarily due to interference from features of neighboring targets.

2.2. Lightweight Structural Design

Recent research on detection architectures has shifted its focus from accuracy enhancement to speed optimization. Based on the ResNet framework, ResNeXt [31] integrates group convolution into its network blocks to cut down on computational complexity and parameter volume. Notably, under conditions of comparable computational complexity, increasing the number of groups often leads to better recognition accuracy—and this model has delivered outstanding performance in both image recognition and object detection tasks. To boost inter-group interaction and enhance the representational capacity of group convolution, Zhang et al. proposed the Interleaved Group Convolutional Neural Network (IGCNet) [32]. Each network block of IGCNet comprises two independent group convolution layers: primary group convolution and secondary group convolution. To further strengthen the block’s expressive ability, the input channels of the secondary group convolution are distributed across each primary group convolution layer.

EfficientNet [33] introduces a hybrid scaling approach rooted in Neural Architecture Search (NAS). This method can more effectively determine the scaling ratios for three core dimensions—width, depth, and resolution—allowing the model to attain high accuracy while using a small parameter count. As such, it stands as a straightforward yet efficient dynamically generated network. WeightNet [34] fuses the key features of SENet and CondConv within the weight space. It appends a layer of grouped fully connected layers after activation vectors, which directly generates the weights for convolutional kernels. This design ensures high computational efficiency, and the model can effectively balance accuracy and speed by adjusting hyperparameter settings.

Targeting scenarios with extremely low computational requirements, MicroNet [35] presents a lightweight network that embodies two core concepts: micro-decomposition convolution and dynamic maximum value shift. First, it decomposes the original convolution into multiple small convolutions via low-rank approximation—this preserves input–output connectivity while reducing the total number of connections. Second, dynamic inter-group feature fusion boosts node connectivity and strengthens nonlinearity, offsetting the performance decline caused by reduced network depth. CFA-MANet [36] realizes dimensionality reduction from two perspectives via its complementary feature perception module. It combines spatial and channel attention to refine contextual and semantic representations, and further utilizes cross-attention to learn complementary information. This design facilitates effective network training even with limited datasets, substantially enhances tracking speed and performance, and has achieved significant breakthroughs in the field of hyperspectral tracking.

3. Method

This section explores optimization methods for remote sensing detection models during deployment, focusing on the design of convolutional structures in remote sensing detection networks. The aim is to enhance actual hardware inference efficiency while maintaining high detection accuracy. The proposed structure is illustrated in Figure 2. Within the convolutional layers, dynamic mixed convolution (DM-Conv) is designed, which divides the convolutional layers in depthwise separable convolution into two parts. The first part employs regular convolution, but its total number is strictly controlled. Subsequently, a series of simple linear operations are applied to the generated feature maps from the first part to generate all feature maps. Finally, the feature maps from both parts are fused in the channel dimension, enabling the rapid generation and replacement of redundant feature maps and reducing memory access frequency. To prevent the weakening of the expressive power of the rapidly generated feature maps, a spatial orthogonal attention mechanism is designed on this basis. This mechanism aggregates local and remote information from both horizontal and vertical directions, allowing the convolutional structure to capture dependencies between distant pixels while simplifying computations. Currently, widely used deep learning tools such as TF-Lite and ONNX can effectively support this strategy, making it convenient for rapid inference on various embedded devices.

3.1. Dynamic Mixed Convolution

In most remote sensing detection frameworks, the utilization of existing complex convolutional structures as feature extractors leads to the generation of a large number of redundant feature channels. Reducing these channels may result in the loss of key information such as edges and textures, thereby affecting detection accuracy. However, if these redundant features are not reduced, it will increase memory access costs and limit the effective deployment of the model on embedded or edge computing platforms. To address these issues, this paper designs a lightweight dynamic mixed convolution, which replaces standard convolution with a linear filter bank. While maintaining nearly lossless accuracy, it significantly enhances the generation speed of feature maps, thereby effectively reducing computational redundancy and memory access during the inference process of remote sensing detection networks. The following will provide a detailed introduction to this method.

The n process of generating feature maps by a standard convolutional layer can be defined as follows

Y = X * W + b

. In the formula, ∗ represents the convolution operation, b is the bias term, X is the input feature map

(X \in R^{h \times w \times m}]

, Y is the output feature map

(Y \in R^{h \times w \times n})

, and W is the convolution filter

(W \in R^{k \times k \times m \times n})

. The number of parameters that need to be optimized in the convolution kernel is determined by the dimensions of the input and output feature maps. It can be seen that there is a high similarity between the targets in the blue and green boxes. Furthermore, calculating the minimum mean squared error (MSE) of these feature maps, as shown in Table 1, where all MSE values are very small, indicates that there is a strong correlation between feature maps in deep neural networks, and these redundant feature maps can be generated from a few inherent feature maps. Based on this correlation, redundant feature maps can be efficiently derived from the core feature maps.

Therefore, there is no need to generate these redundant feature maps one by one through a large number of convolution operations. The output feature maps can be regarded as a small number of derived feature maps obtained through some simple transformations (as shown in the Figure 3). The specific process is as follows:

y_{i j} = Φ_{i, j} (y_{o i}), \forall i = 1, \dots, m, j = 1, \dots, s

(1)

where

y_{o i}

is the intrinsic i feature map in Y the middle, and

Φ_{i, j}

is the j linear operation used to generate the derivative feature map.

Y^{'} \in R^{h \times w \times c_{o u t}}

is the intrinsic feature, whose size is usually smaller than the original output feature. Then, cheap operations are used to generate more similar features. The two parts of the feature are connected along the channel dimension,

y_{i j}

concatenated with the intrinsic feature map, thus forming a complete convolutional layer, as shown below:

Y = Concat ([Y^{'}, Y^{'} * F_{d p}]),

(2)

where Y represents the final output feature map,

Y'

represents the original intrinsic feature map, and

F_{d p}

represents pointwise convolution for linear mapping. The concatenate operation effectively expands the dimension of feature representation without significantly increasing the computational burden by splicing the derived feature map with the intrinsic feature map. This splicing strategy ensures that the network can capture and retain the richness and diversity of input data while reducing the number of parameters and computational cost. The application of linear transformation enables each derived feature map to be processed independently and in parallel, which reduces the computational burden. At the same time, by reducing the number of convolution kernels, it effectively controls the growth of memory and FLOPs (Floating Point Operations), thereby improving overall computational efficiency.

Computational complexity theoretical analysis. Figure 3 illustrates the basic working principle of the dynamic mixed convolution proposed in this paper. It only requires applying standard convolution on a portion of the input channels for spatial feature extraction, without affecting the remaining channels. For continuous or regular memory access, the first or last continuous channel sequence is used

C_{s e q}

as the computational benchmark for the feature map. In the case where the input and output feature maps have the same number of channels, the FLOPs of DM-Conv are only:

D M C = h \times w \times k^{2} \times c_{s e q}^{2}

(3)

When using compression ratio

r = c_{s e q} / c = 1 / 4

, the FLOPs of DM-Conv are only 1/16 of standard convolution. In addition, the memory access amount of DM-Conv is:

h \times w \times 2 c_{s e q} + k^{2} \times c_{s e q}^{2} \approx h \times w \times 2 c_{s e q}

(4)

As can be seen from the above formula, the proposed DM-Conv has a memory access amount that is 1/4 of standard convolution. At this time, the

c_{s e q}

remaining channels outside remain unchanged. In order to fully and effectively utilize the information from all channels, this paper further appends Pointwise Convolution (PWConv) to the proposed DM-Conv. Compared to standard convolution that slides along a fixed direction, it pays more attention to the central position. The FLOPs of DM-Conv can be calculated as:

h \times w \times (k^{2} \times {c^{2}}_{p} \times c + c \times c_{p})

(5)

Although directly performing the aforementioned feature mapping can significantly reduce computational costs, it inevitably weakens its spatial representation ability. The relationship between spatial pixels is crucial for achieving accurate recognition. However, half of the features only capture spatial information through inexpensive operations. The remaining features are generated solely through 1 × 1 pointwise convolution without interacting with other pixels. The weak ability to capture spatial information may hinder further improvement in performance. To address this, based on this lightweight structure, we introduce a self-attention mechanism to effectively simulate and supplement the spatial correlation relationships over a large range.

3.2. Spatial Orthogonal Attention Mechanism

Generally, existing attention mechanisms often utilize a given input feature

Z \in R^{H \times W \times C}

, and then employ a fully connected layer to generate a set of attention feature maps. The specific operational process is as follows:

a_{h w} = \sum F_{h^{'} w} ⊙ z_{h w},

(6)

where ⊙ denotes element-wise multiplication, and F is the learnable weight matrix of the layer,

z_{h w}

is the feature value at spatial position

(h, w)

in the input feature map Z, and

a_{h w}

is the attention weight at position

(h, w)

. Generally, feature maps in CNNs are usually low rank, and there is no need to densely connect all inputs and outputs at different spatial positions. Therefore, this paper decomposes the fully connected layer into two orthogonal layers and aggregates features along the horizontal and vertical directions, respectively. It can be expressed as:

a_{h w}^{'} = \sum_{h^{'} = 1}^{H} F_{h, h^{'} w}^{H} ⊙ z_{h^{'} w}, h = 1, 2, \dots, H, w = 1, 2, \dots, W,

(7)

a_{h w} = \sum_{w = 1}^{W} F_{w, h^{'} w}^{W} ⊙ a_{h w}^{'}, h = 1, 2, \dots, H, w = 1, 2, \dots, W,

(8)

where

F_{h, h^{'} w}^{H}

and

F_{w, h^{'} w}^{W}

are the learnable horizontal and vertical attention weights, respectively, and

a_{h w}^{'}

is the intermediate attention weight after horizontal aggregation. The aforementioned equations capture long-range dependencies in both directions. In this paper, we refer to this operation as decoupled fully connected attention. Due to the decoupling of horizontal and vertical transformations, the computational complexity of the attention module can be reduced to O(H* W* 2). The specific process is shown in Figure 4.

The input feature map of the entire convolutional layer is sent to two branches. One branch generates the eigen-feature map and redundant feature map, while the other employs an orthogonal self-attention mechanism to produce an attention mapA. The final output of the convolutional layer is the product of the outputs from the two branches, that is:

O = S i g m o i d (A) ⊙ V (X)

(9)

where

S i g m o i d

is the sigmoid activation function scaling the attention map to

(0, 1)

.

V (X)

is the feature map output by the convolution module, and O is the final output feature map after fusing attention weights. The output encompasses both the features from the original convolution module and the spatial information from the orthogonal attention module. The computation of each attention value incorporates a wide range of pixel blocks, allowing the output features to incorporate information from these pixel blocks.

The floating point operations (FLOPs) of a neural network are an important metric for evaluating the computational complexity of object detection networks. For a standard convolutional layer, when the input feature map has dimensions

L_{i} \times W \times H

, the number of output channels is

C_{o u t}

, and the kernel size is

k_{w^{'}}

k_{h^{'}}

, the FLOPs are defined as:

F L O P S = [(K_{w} \times K_{h} \times C_{i}) \times (K_{w} \times K_{h} \times C_{i} - 1) + 1] \times C_{o} \times W \times H,

where

(K_{w} \times K_{h} \times C_{i})

represents the number of multiplication operations,

(K_{w} \times K_{h} \times C_{i} - 1)

represents the number of addition operations, and denote the width and height of the feature map, respectively, and accounts for the bias term in the convolutional layer. Directly parallelizing the orthogonal attention with other modules would introduce additional computational costs. Therefore, this paper reduces the size of the features through horizontal and vertical downsampling, enabling all operations within the orthogonal attention to be performed on smaller features. Typically, the width and height are reduced to half of their original length, resulting in a 75% reduction in FLOPs. Subsequently, the generated feature map is upsampled to its original size to match the size of the feature map from the other branch. This paper employs average pooling and bilinear interpolation for downsampling and upsampling, respectively. The features obtained from downsampling are activated using a function to accelerate actual inference.

4. Experiment

4.1. Parameter Settings

To validate the proposed method, this study conducts experiments on three public remote sensing object detection datasets, as detailed below: (1) The DOTA-v1.0 dataset comprises a large volume of high-resolution remote sensing images spanning diverse scenes. It includes 2806 images with resolutions ranging from 800 × 800 to 4000 × 4000 pixels, annotated with 15 distinct object categories—each labeled using rotated bounding boxes. (2) The HRSC2016 dataset, a publicly accessible remote sensing ship image dataset released by Northwestern Polytechnical University in 2016, is derived from six major ports in Google Earth. It features rich geographical environments and ship activity patterns, with image spatial resolutions ranging from 0.4 to 2 m—sufficient to clearly capture fine details of ships, thereby strongly supporting fine-grained detection tasks. The dataset contains 1680 images in total, encompassing 2976 ship targets, with image sizes varying from 300 × 300 to 1500 × 900 pixels to accommodate diverse research needs and adaptability testing of different algorithms. (3) The UCMerced_LandUse dataset, released by the University of California, Merced (UC Merced), is primarily utilized for remote sensing image analysis, scene classification, and land use research in computer vision. It includes high-resolution remote sensing images across 21 categories, totaling over 10,000 images. These categories cover common land use types such as agriculture, forests, water bodies, and urban buildings. The relatively high spatial resolution of the images enables clear identification of detailed features in various land use types, facilitating accurate recognition of small targets and complex structures within them.

In this research, ResNet serves as the baseline model. All networks are trained using the Adam optimizer. For the DOTA-v1.0, UCAS-AOD, and UCMerced-LandUse datasets, training runs for 300 epochs with a batch size of 16 and a momentum of 0.9. All images are resized to 800 × 800 pixels. The initial learning rate is set to 0.001, and it is reduced by a factor of 10 at 50% and 75% of the total training epochs. In all experiments, the scale factor for all channels is initialized to 0.5. For data augmentation, the strategies adopted in the experiment, including random horizontal flip (probability 0.5), random vertical flip (probability 0.3), random cropping (cropping ratio 0.7–1.0), and color jitter (brightness ±10%, contrast ±10%), have been supplemented; in terms of hardware, it has been clarified that the hardware used in the experiment is 2* NVIDIA RTX 3090 GPU (2* 16 GB memory) and an Intel Core i7-12700K CPU.

4.2. Analysis of Ablation Experiments

Dynamic mixed convolution

The proposed dynamic mixed convolution has two hyperparameters: the parameter s for generating the derived feature maps, and the kernel size d × d (i.e., the size of the depthwise convolution filter) for performing linear operations. This section tests the impact of these two parameters. First, with s = 2 fixed, dd is adjusted within

{1, 3, 5, 7}

, and the results on the UCMerced-LandUse dataset are listed in Table 2. It can be observed that the dynamic mixed convolution module with d = 3 outperforms smaller or larger modules. This is because a kernel of size 1 × 1 cannot introduce spatial information on the feature map, while larger kernels (e.g., d = 5 or d = 7) lead to overfitting and increased computational cost. Therefore, in the following experiments, d = 3 is chosen to achieve better detection accuracy and timeliness.

Next, fix d = 3, set

s

= {2, 3, 4, 5},

train on the dataset, and test the results. The results are shown in Table 3. As can be seen from Section 3.1, the hyperparameter s is directly related to the consumption of computational resources. As can be observed in Equations (4) and (5), a larger value of s leads to greater compression of computational resources and an increase in speed ratio. When increasing s, the computational resources are significantly reduced, and the accuracy gradually decreases. Especially when

s = 2

, the performance is good in terms of accuracy, computational resources, and time consumption.

2.: Visualization of intermediate features

This section first visualizes the intermediate features of the residual network ResNet18 in order to observe the feature similarities between the original feature maps. In Figure 5, the first residual block of ResNet18 is used to extract features, with the input being the original image and the output being the output feature map of the residual block. The feature data of Figure 1 comes from the output of the 3rd convolutional layer of the ResNet18 network; the input images are 50 randomly selected 800 × 800 resolution remote sensing images from the DOTA dataset, which are resized to 224 × 224 before inputting into the network; the feature visualization tool uses Matplotlib 3.10.0, presenting the feature map intensity distribution in the form of heatmaps; the blue and green highlighted areas in the figure are high-similarity feature regions screened by “Minimum Mean Square Error (MSE)”, and the MSE calculation is based on the feature map pixel value matrix (the specific calculation process corresponds to the MSE data when k = 3 in Table 1). It can be seen that the feature maps in the boxes with the same color have high similarity. For example, in the feature map of the airplane in the upper left corner, the target features in the red box have similar contours, textures, and brightness. For complex backgrounds such as tennis courts in urban scenes (lower left corner of Figure 5), a large number of similar feature maps can also be found, such as the areas in the red, blue, and orange boxes. For docks with high aspect ratios or oil tanks with circular geometric appearances, similar output feature maps are also included. Not only that, the MSE (mean squared error) values of each pair of feature maps in boxes with different colors are shown in Table 4. From this table, it can be seen that all MSE values are extremely small. This indicates that there is a strong correlation between feature maps in deep neural networks, and these redundant feature maps can be generated from several inherent feature maps. Similar phenomena apply to most images.

This section also visualizes the intermediate feature maps generated by the proposed dynamic mixed convolution, as shown in Figure 6. On the left is the feature map generated by standard convolution, while on the right is the one generated by linear mapping. Although the feature map on the right originates from the mapping of convolution features from the previous layer, these mapped feature maps exhibit significant differences from each other. This implies that the features generated by dynamic mixed convolution are flexible enough to provide a rich and diverse set of features (as shown in Figure 7) for specific classification and regression tasks while reducing computational complexity.

To intuitively compare the effectiveness of this method with classic image classification and recognition methods, this section employs the confusion matrix ( as shown in Figure 8) method for visual representation. Each row of the confusion matrix corresponds to all samples predicted to belong to that class, and the diagonal of the confusion matrix indicates the number of samples predicted correctly. The denser the distribution of predicted values on the diagonal, the better the model performance. The confusion matrix also facilitates identifying which categories the model tends to misclassify. As evident from Figure 8, the confusion matrix corresponding to this method exhibits a higher density on the diagonal, indicating good classification performance. In Table 5, this section also presents improvements made to classic backbone structures such as ResNet and Vgg using dynamic mixed convolution. It can be observed that while maintaining nearly lossless recognition accuracy, the computational complexity has been reduced by approximately two times.

4.3. Comparative Experiment

Comparative experiment on DOTA dataset

To verify the effective application of the proposed lightweight backbone structure in rotational object detection methods, this section compares the method proposed in this chapter with classic remote sensing object detection methods in recent years on the oriented bounding box (OBB) detection task of the DOTA dataset. The results are shown in Table 6. These comparison methods include two major categories: two-stage detectors and one-stage detectors. As can be seen from the table, the method proposed in this chapter achieves a detection performance of 79.37% on the mAP metric, achieving the best detection results among all compared algorithms. For small objects, such as small vehicles (SV) and storage tanks (ST), the method achieves mAP values of 81.54% and 87.31%, respectively. For large-scale objects, such as swimming pools (SP), baseball fields (BD), and ports (HA), the AP values of the method are all greater than 75%. In addition, the method achieves good detection performance (as shown in Figure 9) for two types of objects with variable orientations, sparse wide-neighborhoods, and multi-neighborhood aggregations: airplanes (PL) and ships (SH). The above results demonstrate that the method proposed in this chapter has certain adaptability to scale and rotational orientation.

Compared to the two-stage remote sensing detection algorithm, the algorithm presented in this chapter demonstrates certain advantages in terms of computational complexity and detection accuracy. Among them, RRPN, SCR-Det, and other methods are all based on the Faster-RCNN method as the baseline module, and rotational boxes are added to detect rotating targets. RRPN does not utilize a feature pyramid network, making it difficult to adapt to remote sensing targets with large scale differences, resulting in relatively low detection performance. SCR-Det achieved an mAP value of 72.61% on the DOTA dataset, achieving the best performance among the two-stage algorithms. Compared to SCR-Det, the computational complexity of the method presented in this chapter is only half of it, while it has a 6.76% advantage in mAP metrics. In single-stage algorithms, the method presented in this chapter has obvious advantages in detection. Specifically, methods such as R3Det, GWD, RSDet, CSL, and SASM have all made improvements in rotating feature extraction, rotational box representation, positive and negative sample allocation strategies, etc., enabling them to detect remote sensing targets with large scale differences and multiple rotational directions. Among the aforementioned single-stage detection algorithms, the most advanced SASM achieved an mAP value of 79.17% on the DOTA dataset. The detection accuracy of the method presented in this chapter is close to it, but its computational complexity is only 53.6% of SASM. These results demonstrate that the method presented in this chapter achieves an effective balance between detection accuracy and speed.

2.: Comparison experiment on HRSC-2016 dataset

HRSC-2016 contains a large number of ship targets with arbitrary orientations and significant scale differences. To evaluate the performance of the method proposed in this chapter on this dataset, this section compares it with classic remote sensing target detection methods in recent years. The quantitative comparison results are shown in Table 7. When using diversity mixture convolution as the backbone network, the method proposed in this chapter achieves a mean average precision (mAP) of 90.3%, and its detection performance is comparable to advanced remote sensing detection networks such as ReDet and O-RCNN. However, the computational complexity of the method proposed in this chapter is only 50% of these detection networks. In addition, based on the previous chapter, the backbone structure of HAA-Net is modified using the diversity mixture convolution proposed in this chapter, resulting in a 27% reduction in computational complexity while maintaining similar detection accuracy. As can be seen from the visualization result in Figure 10, the method proposed in this chapter still has good detection capabilities for ship targets with pose changes and scale variations.

3.: Comparison with well-known lightweight detectors

We have added additional comparative experiments with the aforementioned lightweight detectors and supplemented the relevant metrics (mAP, Params, FLOPs) on DOTA (as shown in Table 8). The results show that: on the DOTA dataset, the proposed method achieves an mAP of 79.37%, which is higher than that of YOLOv4-tiny (68.23%), EfficientDet-D0 (72.5%), and the GhostNet-based detector (75.1%); meanwhile, the proposed method has FLOPs of 84 G, lower than that of YOLOv4-tiny (116 G), EfficientDet-D0 (107 G), and the GhostNet-based detector (112 G). In addition, its Params are only 23.0 M, which is significantly lower than those of the three detectors (i.e., 28.5 M, 31.2 M, and 29.8 M, respectively). The relevant comparative data have been organized into a new table and integrated into the experimental section.

On the CPU (Intel Core i7-12700K), the FPS of the proposed method under the input size of 800 × 800 on the DOTA dataset is 18.2 (shown in Table 8), which is higher than that of YOLOv4-tiny (15.6), EfficientDet-D0 (12.3), and GhostNet-based detectors (14.8); on the edge device (NVIDIA Jetson Xavier NX), as shown in Table 9, the FPS of the proposed method is 10.5, which is also higher than that of the three aforementioned detectors (8.9, 7.2, and 9.1, respectively); the test results prove the real-time performance of the proposed method.

5. Conclusions

The large amount of image data in remote sensing with a wide field of view brings high network computational complexity to deep neural networks, making it difficult to meet the demand for precise and fast detection in-orbit applications. Therefore, this paper proposes a convolutional structure lightweighting algorithm that ensures high-quality and compact representation by designing diverse hybrid operators to simplify the convolutional network, achieving a significant reduction in network computational complexity while maintaining approximate lossless precision. Specifically, dynamic mixed convolution is designed within the convolutional layer, where linear mapping replaces the standard convolution to generate redundant feature maps and reorganize the convolution operation; feature aggregation methods are designed between convolutional layers to generate deep features by weighted mixing of features from different intermediate layers; in addition, a spatial orthogonal attention mechanism is designed to capture the dependence between distant pixels in both horizontal and vertical directions, enhancing the feature expression ability of the convolutional layer. First, in the task of extremely small object detection, the linear mapping operation of DM-Conv may lead to the loss of some fine-grained features, which in turn has a certain impact on the detection accuracy of such objects; second, when processing ultra-high-resolution remote sensing images, the horizontal and vertical downsampling operations adopted by SOA would cause a small amount of spatial information loss, and at the same time lead to a certain increase in computational overhead; finally, the robustness of the proposed method in complex rain and cloud scenes has not been fully verified, and this issue needs to be further optimized and solved in subsequent studies by combining meteorological correction data.

However, the method proposed in this paper still has some shortcomings in the following scenarios: first, in the task of extremely small object detection, the linear mapping operation of DM-Conv may lead to the loss of some fine-grained features, which in turn has a certain impact on the detection accuracy of such objects; second, when processing ultra-high-resolution remote sensing images, the horizontal and vertical downsampling operations adopted by SOA would cause a small amount of spatial information loss, and at the same time lead to a certain increase in computational overhead; and finally, the robustness of the proposed method in complex rain and cloud scenes has not been fully verified, and this issue needs to be further optimized and solved in subsequent studies by combining meteorological correction data.

Author Contributions

Y.M. and N.M.; methodology, Y.M.; supervision, F.b.K.; validation, F.b.K.; writing—original draft, Y.M. and N.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

DOTA V1.0 is available at https://captain-whu.github.io/DOTA/dataset.html (accessed on 15 March 2025). HRSC2016 is available at https://aistudio.baidu.com/aistudio/datasetdetail/31232 (accessed on 15 March 2025). UCAS-AOD is available at https://aistudio.baidu.com/datasetdetail/70265 (accessed on 15 March 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Qiu, H.; Li, H.; Wu, Q.; Meng, F.; Ngan, K.N.; Shi, H. A2RMNet: Adaptively Aspect Ratio Multi-Scale Network for Object Detection in Remote Sensing Images. Remote Sens. 2019, 11, 1594. [Google Scholar] [CrossRef]
Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. CenterNet: Keypoint Triplets for Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6568–6577. [Google Scholar] [CrossRef]
Wang, X.; Zhang, R.; Sun, Y.; Qi, J. Kdgan: Knowledge distillation with generative adversarial networks. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018; pp. 1–15. [Google Scholar]
Chen, Y.; Han, C.; Wang, N.; Zhang, Z. Revisiting Feature Alignment for One-stage Object Detection. arXiv 2019, arXiv:1908.01570. [Google Scholar] [CrossRef]
Deng, C.; Jing, D.; Han, Y.; Chanussot, J. Toward hierarchical adaptive alignment for aerial object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615515. [Google Scholar] [CrossRef]
Han, J.; Ding, J.; Li, J.; Xia, G. Align Deep Features for Oriented Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5602511. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Yang, S.; Loy, C.C.; Lin, D. Region proposal by guided anchoring. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2960–2969. [Google Scholar] [CrossRef]
Li, Z.; Peng, C.; Yu, G.; Zhang, X.; Deng, Y.; Sun, J. Light-Head R-CNN: In Defense of Two-Stage Object Detector. arXiv 2017, arXiv:1711.07264. [Google Scholar]
Yu, H.; Tian, Y.; Ye, Q.; Liu, Y. Spatial transform decoupling for oriented object detection. Proc. AAAI Conf. Artif. Intell. 2024, 38, 6782–6790. [Google Scholar] [CrossRef]
Yuan, X.; Zheng, Z.; Li, Y.; Liu, X.; Liu, L.; Li, X.; Hou, Q.; Cheng, M.M. Strip R-CNN: Large Strip Convolution for Remote Sensing Object Detection. arXiv 2025, arXiv:2501.03775. [Google Scholar] [CrossRef]
Deng, C.; Jing, D.; Han, Y.; Wang, S.; Wang, H. FAR-Net: Fast Anchor Refining for Arbitrary-Oriented Object Detection. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6505805. [Google Scholar] [CrossRef]
Ding, J.; Xue, N.; Long, Y.; Xia, G.S.; Lu, Q. Learning roi transformer for oriented object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object. arXiv 2019, arXiv:1908.05612. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-based Fully Convolutional Networks. In Proceedings of the Advances in Neural Information Processing Systems 29 (NIPS 2016), Barcelona, Spain, 5–10 December 2016; Volume 29. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features From Cheap Operations. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, 13–19 June 2020; pp. 1577–1586. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Zhuang, Z.; Tan, M.; Zhuang, B.; Liu, J.; Guo, Y.; Wu, Q.; Huang, J.; Zhu, J. Discrimination-aware channel pruning for deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems 31 (NeurIPS 2018), Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Zhang, F.; Wang, X.; Zhou, S.; Wang, Y.; Hou, Y. Arbitrary-oriented ship detection through center-head point extraction. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5612414. [Google Scholar] [CrossRef]
Ren, Z.; Tang, Y.; He, Z.; Tian, L.; Yang, Y.; Zhang, W. Ship Detection in High-Resolution Optical Remote Sensing Images Aided by Saliency Information. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5623616. [Google Scholar] [CrossRef]
Huang, Z.; Li, W.; Xia, X.G.; Wang, H.; Jie, F.; Tao, R. LO-Det: Lightweight Oriented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 223373–223384. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
Ma, F.; Zhang, F.; Xiang, D.; Yin, Q.; Zhou, Y. Fast Task-Specific Region Merging for SAR Image Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5222316. [Google Scholar] [CrossRef]
Ma, F.; Zhang, F.; Yin, Q.; Xiang, D.; Zhou, Y. Fast SAR Image Segmentation with Deep Task-Specific Superpixel Sampling and Soft Graph Convolution. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5214116. [Google Scholar] [CrossRef]
Yi, X.; Gu, S.; Wu, X.; Jing, D. AFEDet: A Symmetry-Aware Deep Learning Model for Multi-Scale Object Detection in Aerial Images. Symmetry 2025, 17, 488. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. RIFD-CNN: Rotation-Invariant and Fisher Discriminative Convolutional Neural Networks for Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 2884–2893. [Google Scholar] [CrossRef]
Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images. Isprs J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
Lin, Y.; Feng, P.; Guan, J. IENet: Interacting Embranchment One Stage Anchor Free Detector for Orientation Aerial Object Detection. arXiv 2019, arXiv:1912.00969. [Google Scholar]
Lee, J.R.; Moon, Y.H. An Empirical Study on Channel Pruning through Residual Connections. In Proceedings of the 2021 International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 20–22 October 2021; IEEE: New York, NY, USA, 2021; pp. 1380–1382. [Google Scholar]
Xie, S.; Girshick, R.B.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 5987–5995. [Google Scholar] [CrossRef]
Dang, J.; Yang, J. HIGCNN: Hierarchical Interleaved Group Convolutional Neural Networks for Point Clouds Analysis. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2825–2829. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Ma, N.; Zhang, X.; Huang, J.; Sun, J. WeightNet: Revisiting the Design Space of Weight Networks. In Proceedings of the Computer Vision—ECCV 2020—16th European Conference, Glasgow, UK, 23–28 August 2020; Volume 12360, pp. 776–792. [Google Scholar] [CrossRef]
Li, Y.; Chen, Y.; Dai, X.; Chen, D.; Liu, M.; Yuan, L.; Liu, Z.; Zhang, L.; Vasconcelos, N. MicroNet: Improving Image Recognition with Extremely Low FLOPs. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV, Online, 11–17 October 2021; pp. 458–467. [Google Scholar] [CrossRef]
Ma, S.; Khader, A.; Xiao, L. Complementary features-aware attentive multi-adapter network for hyperspectral object tracking. In Proceedings of the Fourteenth International Conference on Graphics and Image Processing (ICGIP 2022), Nanjing, China, 21–23 October 2022; Volume 12705, pp. 686–695. [Google Scholar]
Han, J.; Ding, J.; Xue, N.; Xia, G.S. ReDet: A Rotation-equivariant Detector for Aerial Object Detection. arXiv 2022, arXiv:2103.07733. [Google Scholar]
Yang, X.; Zhang, G.; Yang, X.; Zhou, Y.; Wang, W.; Tang, J.; He, T.; Yan, J. Detecting Rotated Objects as Gaussian Distributions and its 3-D Generalization. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 4335–4354. [Google Scholar] [CrossRef]
Liu, C.; Ding, W.; Chen, P.; Zhuang, B.; Wang, Y.; Zhao, Y.; Zhang, B.; Han, Y. RB-Net: Training Highly Accurate and Efficient Binary Neural Networks With Reshaped Point-Wise Convolution and Balanced Activation. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6414–6424. [Google Scholar] [CrossRef]
Liu, C.; Ding, W.; Xia, X.; Zhang, B.; Gu, J.; Liu, J.; Ji, R.; Doermann, D. Circulant binary convolutional networks: Enhancing the performance of 1-bit dcnns with circulant back propagation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 2691–2699. [Google Scholar]
Jiang, Y.; Zhu, X.; Wang, X.; Yang, S.; Li, W.; Wang, H.; Fu, P.; Luo, Z. R2CNN: Rotational region CNN for orientation robust scene text detection. arXiv 2017, arXiv:1706.09579. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27–28 October 2019; pp. 8231–8240. [Google Scholar] [CrossRef]
He, Y.; Lin, J.; Liu, Z.; Wang, H.; Li, L.J.; Han, S. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 784–800. [Google Scholar]
Huang, Z.; Li, W.; Xia, X.G.; Tao, R. A General Gaussian Heatmap Label Assignment for Arbitrary-Oriented Object Detection. IEEE Trans. Image Process. 2022, 31, 1895–1910. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Liao, W.; Yang, X.; Tang, J.; He, T. SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2384–2399. [Google Scholar] [CrossRef]
Xu, Y.; Fu, M.; Wang, Q.; Wang, Y.; Chen, K.; Xia, G.S.; Bai, X. Gliding Vertex on the Horizontal Bounding Box for Multi-Oriented Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1452–1459. [Google Scholar] [CrossRef]

Figure 1. Visualization of intermediate features.

Figure 2. Architecture of FDP-Net.

Figure 3. Architecture of dynamic mixed convolution.

Figure 4. Schematic diagram of feature decomposition.

Figure 5. Visualization of intermediate features of plane.

Figure 6. Visualization of intermediate features of ship.

Figure 7. Intermediate feature map of the method in this chapter.

Figure 8. Schematic diagram of the confusion matrix for the method in this chapter.

Figure 9. Schematic diagram of the method in this chapter on the DOTA dataset.

Figure 10. Schematic diagram of the method in this chapter on the HRSC-2016 dataset.

Table 1. Mean squared error of intermediate features.

MSE (10⁻³)	$k = 1$	$k = 3$	$k = 5$	$k = 7$
Blue Frame	4.0	3.4	3.3	3.2
Green frame	12.0	11.1	11.0	10.9

Table 2. The experimental results of convolution kernel size d.

d	Accuracy (%)	Weight (M)	Calculation Amount (M)
1	92.4	0.64	17.6
3	92.7	0.64	17.9
5	92.4	0.64	17.9
7	92.1	0.64	18.7

Table 3. The experimental results of feature map compression ratios.

S	Accuracy (%)	Weight (M)	Calculation Amount (M)
2	92.7	0.64	17.6
3	92.6	0.47	11.8
4	92.4	0.36	9.4
5	92.0	0.28	6.1

Table 4. Mean squared error of intermediate features.

MSE (10⁻³)	Airplane	Storage Tanks	Tennis Court	Harbor
Orange box	3.4	5.4	4.3	4.2
Red box	5.6	7.2	6.1	5.1
Green box	4.1	8.1	7.0	8.9

Table 5. Different network structures.

Methods	Params (M)	FLOPs (M)	Acc
ResNet-56	27.6	126	93.3
DMC+ResNet-56	14.9	65	92.8
Vgg-16	14.7	314	92.1
DMC+Vgg-16	7.4	160	91.8

Table 6. Comparison with other networks on the DOTA dataset.

Class	$ReDet$ [37]	GWD [38]	RSDet [39]	CSL [40]	SASM [26]	S2ANet [41]	SCRDet [42]	CFA [43]	GGHL [44]	SCRDet++ [45]	$G - V$ [46]	Ours
PL	88.68	89.54	89.80	90.25	89.54	88.89	89.98	89.08	89.74	89.98	89.64	89.14
BD	86.57	81.99	82.90	85.53	85.94	83.0	80.65	83.20	85.63	80.65	85.00	84.90
BR	61.93	48.46	48.60	54.64	57.73	57.74	52.09	54.37	44.50	52.09	52.26	61.78
GTF	81.20	62.52	65.20	75.31	78.41	81.95	68.36	66.87	77.48	68.36	77.34	83.50
SV	73.71	70.48	69.50	70.44	79.78	79.94	68.36	81.23	76.72	68.36	73.01	81.54
LV	83.59	74.29	70.10	73.51	84.19	83.19	60.32	80.96	80.45	60.32	73.14	85.87
SH	90.06	77.54	70.20	77.62	89.25	89.11	72.41	87.17	86.16	72.41	86.82	88.64
TC	90.82	90.80	90.50	90.84	90.87	90.78	90.85	90.21	90.83	90.85	90.74	90.89
BC	84.30	81.39	85.60	86.15	58.80	72.84	84.87	84.32	88.18	87.94	79.02	88.02
ST	75.56	83.54	83.40	86.69	87.27	67.38	87.81	86.09	86.25	86.86	86.81	87.31
SBF	71.55	61.97	62.50	69.60	63.82	70.30	65.02	52.34	67.07	65.02	59.55	71.55
RA	71.86	59.82	63.90	68.04	67.81	68.25	66.68	69.94	69.40	66.68	70.91	70.74
HA	83.93	65.44	65.60	73.83	78.67	78.30	66.25	75.52	73.38	66.25	72.94	78.66
SP	80.38	67.46	67.20	71.10	79.35	77.01	68.24	80.76	68.45	68.24	70.86	79.81
HC	75.62	60.05	68.00	68.93	69.37	69.58	65.21	67.96	70.14	65.21	57.32	78.16
$FLOPs$	142	336	201	236	194	182	213	194	248	213	198	104
$mAP$	78.08	78.69	72.20	76.17	79.17	79.42	72.61	76.67	76.95	72.61	75.02	79.37

Table 7. Comparison with other networks on the HRSC-2016 dataset.

Methods	Params (M)	FLOPs (G)	mAP (%)
$Rol Trans .$	55.1	200	86.2
GV	41.1	198	88.2
R3-Det	41.9	336	89.3
DAL	36.4	216	89.8
S2ANet	38.6	198	90.2
Re-Det	31.6	198	90.4
O-RCNN	41.1	199	90.5
HAA-Net	32.1	182	90.5
DMC (ours)	23.0	94	90.3

Table 8. Comparison with well-known lightweight detectors.

Method	YOLOv4-Tiny	EfficientDet-D0	GhostNet-Based	Ours
mAP	68.23	72.41	75.32	79.37
FLOPs	1116	107	112	84
Params	28.5	31.2	29.8	23.0

Table 9. FPS on actual hardware with well-known lightweight detectors.

Method	YOLOv4-Tiny	EfficientDet-D0	GhostNet-Based	Ours
CPU	15.6	12.3	14.8	18.2
Jetson Xavier	8.9	7.2	9.1	10.5

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, Y.; Manshor, N.; Khalid, F.b. Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection. Algorithms 2025, 18, 684. https://doi.org/10.3390/a18110684

AMA Style

Ma Y, Manshor N, Khalid Fb. Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection. Algorithms. 2025; 18(11):684. https://doi.org/10.3390/a18110684

Chicago/Turabian Style

Ma, Yiming, Noridayu Manshor, and Fatimah binti Khalid. 2025. "Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection" Algorithms 18, no. 11: 684. https://doi.org/10.3390/a18110684

APA Style

Ma, Y., Manshor, N., & Khalid, F. b. (2025). Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection. Algorithms, 18(11), 684. https://doi.org/10.3390/a18110684

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature-Differentiated Perception with Dynamic Mixed Convolution and Spatial Orthogonal Attention for Faster Aerial Object Detection

Abstract

1. Introduction

2. Related Works

2.1. Remote Sensing Target Detection

2.2. Lightweight Structural Design

3. Method

3.1. Dynamic Mixed Convolution

3.2. Spatial Orthogonal Attention Mechanism

4. Experiment

4.1. Parameter Settings

4.2. Analysis of Ablation Experiments

4.3. Comparative Experiment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI