GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images

Yang, Yiming; Yan, Lingyu; Wang, Jing; Liu, Jinhang; Tang, Xing

doi:10.3390/s25226856

Open AccessArticle

GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images

by

Yiming Yang

^1,2,

Lingyu Yan

^1,2,*

,

Jing Wang

^1,2,

Jinhang Liu

^1,2

and

Xing Tang

³

¹

School of Computer Science, Hubei University of Technology, Wuhan 430068, China

²

Key Laboratory of Green Intelligent Computing Network in Hubei Province, Wuhan 430068, China

³

School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Wuhan 430070, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(22), 6856; https://doi.org/10.3390/s25226856

Submission received: 9 September 2025 / Revised: 21 October 2025 / Accepted: 27 October 2025 / Published: 10 November 2025

(This article belongs to the Section Sensing and Imaging)

Download

Browse Figures

Versions Notes

Abstract

Object detection plays a crucial role in remote sensing and UAV image technology, but it faces the challenge of speed and accuracy in multi-scale dense small target mission detection scenarios and is susceptible to noise interference, such as weather conditions, lighting changes, and occluded backgrounds in complex backgrounds. In recent years, Mamba-based methods have become hot in the field of object detection, showing great potential in capturing remote dependencies with linear complexity but lacking deep customization of remote sensing targets. Based on this, we propose GMG-LDefmamba-YOLO, which contains two core modules: the Gaussian mask gear convolution module forms a gear-shaped receptive field through improved convolutional splicing to enhance the extraction of small target features and combines the Gaussian mask mechanism to dynamically modulate the feature weights to suppress complex background interference. The linear deformable Mamba module integrates linear deformable sampling, a spatial state dual model, and residual gating MLP components, integrating the advantages of flexible capture of local features and efficient modeling of global dependence, dynamically adapting to target scale changes and spatial distribution, and reducing computational costs. Experiments on DOTA-v1.0, VEDAI, and USOD datasets show that the mAP50 of the model reaches 70.91%, 77.94%, and 90.28%, respectively, which is better than the baseline and mainstream methods, and maintains the lightweight characteristics, providing efficient technical support for remote sensing monitoring, UAV inspection, and other fields.

Keywords:

UAV remote sensing image; small object detection; YOLOv11; gear shape convolution; linear deformable mamba

1. Introduction

In recent years, remote sensing target detection technology (RSIs) [1] and unmanned aerial vehicle detection technology (UAV) [2] have been used in agricultural monitoring [3] and environmental monitoring [4], infrastructure inspection [5], and other fields, providing efficient data support for the acquisition and analysis of wide-area geographic information. However, target detection in remote sensing and UAV images still faces many challenges as shown in Figure 1: on the one hand, remote sensing targets are usually dense and have a very low proportion of pixels (some are even less than 16 × 16 pixels), the feature information is sparse and highly dependent on the central feature for positioning, and its feature representation is vague and extremely susceptible to interference from complex backgrounds [6]; On the other hand, the scale of remote sensing targets changes drastically, and the target morphology at different distances varies significantly, coupled with factors such as changes in lighting conditions and target occlusion [7], which further exacerbates the difficulty of accurate detection. Especially the balance between the detection accuracy of small targets and the efficiency of the algorithm needs to be solved urgently.

In the field of object detection, deep learning-based algorithms have become mainstream and are mainly divided into two-stage detection algorithms and single-stage detection algorithms according to their workflow. Two-stage detection algorithms [8], such as the R-CNN series [9], first generate regional suggestions and then extract features and classify these regions. Although this method achieves higher detection accuracy, it is computationally expensive and slow, making it unsuitable for scenarios that require rapid response. The Faster R-CNN [10] algorithm significantly improves detection speed by introducing a region suggestion network, but it still requires classification and bounding box regression for each candidate region, which limits its performance in real-time applications. In contrast, SSDs [11] pioneered single-stage detection, followed by methods such as RetinaNet [12] and YOLO [13]. In addition to these two detection algorithms, there is a class of visual transformers, including DETR [14] and Swin Transformer [15], which process images by dividing them into contiguous chunks and applying self-attention mechanisms for object detection. While vision converters can simplify the overall architecture, their high computational requirements make them unsuitable for deployment on resource-constrained drone equipment. In contrast, single-stage detection algorithms directly predict object classification and bounding boxes on the entire image without generating area suggestions, resulting in faster detection speeds, and are suitable for real-time applications and UAV platform integration. Notably, the YOLO series has garnered attention for its fast inspection speed and balanced performance, maintaining inspection accuracy while ensuring speed. However, single-level algorithms generally lag slightly behind two-level algorithms in terms of accuracy, especially in small object detection.

In recent years, more work has begun to rethink how to design improved CNNs to achieve higher accuracy and faster speeds. More and more studies have begun to use hybrid architectures to improve models and reduce complexity, such as MobileVit [16], EdgeVit [17], and EfficientFormer [18]. The state space model (SSM) Mamba [19], equipped with a selective scanning mechanism, has shown superior performance in long-distance interaction and linear computational complexity. These advantages enable it to effectively address the computational inefficiencies of transformers in long-sequence spatial modeling. The model LS-MambaNet is based on the improved scanning mechanism of Mamba [20] and the hybrid architecture Mamba-YOLO [21]. Excellent progress has been made in this regard. At the same time, other innovations for convolution methods are also present, such as the improved splicing method of convolution, pinwheel-shaped convolution [22], and deformable convolution DCNv4 [23]. Outstanding results have been achieved in the field of object detection. However, these methods still have deficiencies in adapting to the characteristics of small targets in the field of remote sensing, and their computing power has obvious limitations in terms of detection speed and adaptability to edge devices such as drones. However, although the improved convolutional splicing method strengthens the directional sensitivity, it lacks edge feature processing ability and is susceptible to complex background interference from remote sensing targets. Although traditional deformable convolution can adapt to the multi-scale morphological changes in remote sensing targets, the computational cost is too high due to the parameter growth mode, and performance bottlenecks are prone to occur on edge devices with limited computing power. Therefore, it is crucial to explore the combination of advanced methods and the YOLO architecture.

To solve the above challenges, this paper proposes a target detection method based on the YOLOv11 framework, GMG-LDefmamba-YOLO, specifically designed for remote sensing small object detection. The main contributions of this study are as follows:

The Gaussian mask gear convolutional GMGblock module is proposed, which extracts directional sensitivity features through the mirror-symmetrical eight-way padding convolutional kernel branch, splices it to form a symmetrical receptive field of gear shape, and combines the dynamic modulation of the Gaussian mask mechanism to strengthen the central feature extraction ability of small targets and effectively suppress complex background interference.
The linear deformable Mamba block LDefMambablock module is designed, integrating the linear deformable sampling LDblock, the spatial state dual model SS2D, and the residual gating MLP component, integrating the flexible local feature capture capability of linear deformable convolution and the efficient global dependency modeling advantages of the Mamba architecture, dynamically adapting to the spatial distribution and scale changes in the target, and the gating mechanism optimizes the computing efficiency while ensuring accuracy, and adapts to the computing power constraints of edge devices.
The proposed GMG-LDefmamba-YOLO architecture achieves the most advanced detection performance in the field of remote sensing small targets. Excellent detection accuracy and inference efficiency were demonstrated on DOTA-v1.0, VEDAI, and USOD datasets.

Notably, our work differs from existing Mamba-YOLO in two critical ways: (1) We adopt YOLOv11 as the baseline (instead of YOLOv8) and explore SSM’s compatibility with its newly added modules (SPPF, C2PSA) and core C3k2 module, addressing the limitation that most YOLO improvements focus on older versions (v5/v8); (2) We customize the method for remote sensing small target detection, while Mamba-YOLO targets general object detection. These differences lay the foundation for our methodological innovations and further verify the compatibility of Mamba with the YOLO architecture. At the same time, we compare the newer YOLO versions, including YOLOv12 and YOLOv13, in the experiment, and demonstrate the SOTA performance of YOLOv11 through experimental results. The source code will be available at https://github.com/acaneyoru/GMG-LDefmamba-YOLO (accessed on 2 July 2025).

2. Related Works

2.1. YOLOv11

As shown in Figure 2, the YOLOv11 [24] architecture consists of three key components: the backbone, neck, and head. The YOLOv11 backbone network retains the modular hierarchical design but replaces the original C2f module with a more efficient C3k2 unit and adds a convolutional module C2PSA with spatial attention [25] to enhance the detection of small and occluded targets. At the same time, YOLOv11’s spatial pyramid fast pooling SPPF is retained as an upgraded version of traditional SPP for sequential pooling to achieve excellent feature representation [26]. The neck of YOLOv11 adopts a path aggregation network PANet structure [27] to enable cross-scale feature interaction through upsampling and concatenation stitching operations. The neck integrates the different-level features output by the trunk, which not only retains the spatial details in the low-level features but also integrates the semantic information of the high-level features, providing rich feature support for small object detection. The detection head part continues the efficiency of the YOLO series of single-stage detection, using multi-scale detection branches (corresponding to small, medium, and large targets) combined with box_loss, cls_loss, and dfl_loss for joint optimization [28]. Through the collaborative design of the backbone, neck, and head, YOLOv11 achieves a balance between detection accuracy and inference efficiency while maintaining lightweight characteristics.

2.2. State Space Models

In recent years, structural state space sequence models (SSMs) such as Mamba [29] have become powerful methods for long sequence modeling. The model has the advantage of input-scale linear complexity and can efficiently model global information. SSM reduces computational complexity from quadratic to linear by compressing hidden states that allow each element in a one-dimensional sequence (e.g., a text sequence) to interact with a previously scanned sample compared to traditional self-injection mechanisms. While SSM was originally designed for natural language processing (NLP) tasks, it has also shown great potential in the field of computer vision. Vision Mamba [30] proposed a pure visual backbone model based on selective SSM, marking the first introduction of Mamba into the visual field. VMamba [31] introduced a cross-scan module that enables the model to selectively scan 2D images, enhance vision processing capabilities, and demonstrate superiority in image classification tasks. LocalMamba [32] focuses on window scanning strategies for visuospatial models, optimizing visual information to capture local dependencies, and introducing dynamic scanning methods to search for optimal options for different layers. Although the above-mentioned Mamba-based model shows excellent performance in vision tasks, the scanning strategy in SSM introduces additional computational costs and functional redundancy, which becomes a performance bottleneck for remote sensing target recognition. In addition, although the Mamba method strengthens the connection between the spatial distribution of targets, its ability in local feature extraction is limited, especially in the feature extraction of small targets. At the same time, as an end-to-end detector, we hope that the improved Mamba model based on the YOLO architecture does not need to be pre-trained on large-scale datasets but is directly trained on the target dataset and put into use.

2.3. Deformable Convolution

Traditional convolutional networks with fixed receptive fields are difficult to adapt to the dynamic characteristics of aerial scenes [33], so lightweight and adaptive feature extraction methods are needed to balance small target sensitivity and computational efficiency. Dai et al. [34] developed a deformable convolutional network (DCN) using a dynamically adjustable offset mechanism, which significantly enhanced the spatial adaptability of convolutional kernels. However, the multi-layer deformable structure proliferates the model parameters, exacerbating the consumption of computational resources during training. Du et al. [35] proposed a sparse convolution scheme that applies channel pruning optimization to the detection head, which reduces computational complexity while maintaining detection accuracy. However, this approach compromises feature discernibility in complex contexts. Wang et al. [36] constructed an elastic receptive field model using deformable convolution as a basic operator, breaking the traditional geometric constraints. However, a mechanism linking receptive fields to the spatial distribution of small targets has not been established. Qi et al. [37] developed DSCNet with a serpentine convolutional structure, which improved the segmentation accuracy of the network. However, the topological feature extraction mechanism is susceptible to irregular shape interference in general object detection scenarios, and the stability of dynamic kernel adjustment needs to be optimized. The latest linear deformable method, LDConv [38], is considered to be very effective in improving the field of small target feature extraction, Wu et al. [39]. Combining this method with YOLO architecture, the AAPW-YOLO network is proposed for remote sensing target detection. However, this method does not perform well in dense object detection.

3. Method

3.1. Overview of GMG-LDefmamba-YOLO

As shown in Figure 3a, our GMG-LDefmamba-YOLO framework as a whole follows the YOLO hierarchy and consists of three parts: backbone downsampling, neck feature fusion, and head detection. We replace the key small target extraction layer with two modules, GMGblock and LDefMambablock, and improve the neck feature pyramid structure to enhance the small target feature fusion ability. Figure 3b shows the GMGblock structure, which mainly improves the convolutional splicing method, extracts directional sensitivity features through two mirror-symmetrical four-way padded convolutional kernels, and combines the weighted fusion of Gaussian convolutional kernels with “heavy intermediate weights and decreasing surrounding weights” to strengthen the extraction ability of small target features. Figure 3c outlines the core structure of the LDefMamba block, the most critical components of which are the SS2D module, responsible for global feature extraction, and the LDblock module, responsible for local feature extraction at the downsampling layer. The module discards redundant parameters through the Mamba gating mechanism to reduce the computational cost.

In order not to destroy the basic feature extraction, we retain the original convolutional structure of the shallow network and maintain the Conv and C3k2 module design consistent with the foundation model at layers 0–2 to ensure that the preliminary information is not destroyed. In the middle layer feature extraction stage (p3/8 small target extraction layer), an improved LDefMambablock is introduced to pair the key small target features, and the feature splicing of the C3k2 improved feature pyramid is replaced by GMGblock at the key layer of the neck.

3.2. Gaussian Mask Gear Convolution GMGblock

In recent years, remote sensing small object detection methods based on convolutional neural networks (CNNs) have achieved excellent performance. However, these methods often employ standard convolution and ignore the spatial features of the pixel distribution of remote sensing small targets. In addition, small targets usually have a very low proportion of pixels and sparse feature information, resulting in weak feature expression ability and being extremely dependent on central features for positioning and detection. To solve this problem, we propose the GMGblock module, as shown in Figure 3b. The module combines multi-branch convolution to extract horizontal and vertical feature information from images through asymmetric filling and dynamic weighted fusion through Gaussian masks.

First, the input feature map is split into eight branches, and each four branches are used as a group to carry out two mirror-symmetrical four-way expansion convolutions, and the two sets of expansion convolutions are concatenated into a mirror-symmetric receptor field with the shape of a windmill through the splicing methods of “up, left, bottom, right” and “left, up, right, bottom”, respectively, and then the two groups of branches are concatenated to obtain a completely symmetrical gear-shaped receptive field with decreasing weights from the center to the periphery. Finally, the dynamically weighted multi-scale feature fusion is carried out in combination with the Gaussian mask mechanism to further strengthen the central feature. This method makes full use of the input feature map, strengthens the central feature extraction ability through its own multi-branch convolution and improved splicing method without introducing redundant information to increase the computational workload, and combines dynamic Gaussian weighting to match the central features of small targets to the greatest extent.

Specifically, the GMGblock module works as follows: first, the feature map

X^{(h_{1}, w_{1}, c_{1})}

with the number of input channels, c1 is divided into eight direction-sensitive branches, each with a thickness of c1/8, and then four of the branches are asymmetrically expanded (padding) in the direction of up, left, bottom, and right, and the other four branches are asymmetrically expanded in the direction of left, up, right, and bottom. This process is described as follows:

Vertical Upper Branch (Top):

Apply asymmetric padding

P^{(1, 0, 0, 3)}

to the input (left fill 1 pixel, right fill 0 pixels, top fill 0 pixels, bottom fill 3 pixels) to obtain the post-fill tensor

X_{P^{(1, 0, 0, 3)}}^{(h_{1} + 3, w_{1} + 1, c_{1})}

. Using 1 × 3 horizontal convolutional kernels

W_{1}^{(1, 3, c^{'})}

for convolution, combined with batch normalization (BN) and SiLU activation, the following results are calculated:

Vertical Bottom Branch (Bottom):

Apply asymmetric padding

P^{(0, 3, 0, 1)}

to the input (left fill 0 pixels, right fill 3 pixels, top fill 0 pixels, bottom fill 1 pixel) to obtain the post-fill tensor

X_{P^{(0, 3, 0, 1)}}^{(h_{1} + 1, w_{1} + 3, c_{1})}

. Using 3 × 1 vertical convolutional kernels

W_{2}^{(3, 1, c^{'})}

for convolution, combined with batch normalization (BN) and SiLU activation, the following results are calculated:

Horizontal Left Branch (Left):

Apply asymmetric padding

P^{(0, 1, 3, 0)}

to the input (left fill 0 pixels, right fill 1 pixel, top fill 3 pixels, bottom fill 0 pixels) to obtain the post-fill tensor

X_{P^{(0, 1, 3, 1)}}^{(h_{1} + 3, w_{1} + 1, c_{1})}

. Using 1 × 3 horizontal convolutional kernels

W_{3}^{(1, 3, c^{'})}

for convolution, combined with batch normalization (BN) and SiLU activation, the following results are calculated:

Horizontal Right Branch (Right):

Apply asymmetric padding

P^{(3, 0, 1, 0)}

to the input (left fill 3 pixels, right fill 0 pixels, top fill 1 pixel, bottom fill 0 pixels) to obtain the post-fill tensor

X_{P^{(3, 0, 1, 0)}}^{(h_{1} + 1, w_{1} + 3, c_{1})}

. Using 3 × 1 vertical convolutional kernels

W_{4}^{(3, 1, c^{'})}

for convolution, combined with batch normalization (BN) and SiLU activation, the following results are calculated:

X_{1}^{(h^{'}, w^{'}, c^{'})} = SiLU (BN (X_{P^{(1, 0, 0, 3)}}^{(h_{1} + 3, w_{1} + 1, c_{1})} \otimes W_{1}^{(1, 3, c^{'})}))

(1)

X_{2}^{(h^{'}, w^{'}, c^{'})} = SiLU (BN (X_{P^{(0, 3, 0, 1)}}^{(h_{1} + 1, w_{1} + 3, c_{1})} \otimes W_{2}^{(3, 1, c^{'})}))

(2)

X_{3}^{(h^{'}, w^{'}, c^{'})} = SiLU (BN (X_{P^{(0, 1, 3, 0)}}^{(h_{1} + 3, w_{1} + 1, c_{1})} \otimes W_{3}^{(1, 3, c^{'})}))

(3)

X_{4}^{(h^{'}, w^{'}, c^{'})} = SiLU (BN (X_{P^{(3, 0, 1, 0)}}^{(h_{1} + 1, w_{1} + 3, c_{1})} \otimes W_{4}^{(3, 1, c^{'})}))

(4)

where

X \in R

represent the feature map obtained by applying different convolutional kernels, the convolutional step s determines the size of the output feature map, satisfying,

h^{'} = \frac{h_{1} + p_{h} - k_{h}}{s} + 1

and

h^{'} = \frac{h_{1} + p_{h} - k_{h}}{s} + 1

,

p_{h}, p_{w}

is the fill height, width;

k_{h}, k_{w}

is the height and width of the convolutional kernel, and the final output channel number meets

c^{'} = \frac{c_{1}}{8}

.

X^{'} = concat (x_{1}, x_{2}, x_{3}, x_{4})

(5)

By performing this, the 4 c/8 branches of each group are spliced to obtain the convolutional kernel of the two c/2 branches. Due to the above asymmetric expansion and dislocation splicing, each convolutional core has a mirror-symmetrical pinwheel-shaped model receptor field. Splice the two branches:

X^{″} = concat (x^{'}, x_{1}^{'})

(6)

At this point, the number of channels is restored to

c^{″} = c_{1}

, and the receptive field is perfectly symmetrical, showing a gear shape with decreasing weights from the center to the surrounding area. Then, we introduce the Gaussian mask enhancement mechanism to optimize the spliced feature map. Specifically, the Gaussian mask is generated by the preset Gaussian kernel function, and its kernel value distribution follows the Gaussian distribution law, with the center of the feature map as the origin, and shows exponential attenuation to the surrounding area, so as to strengthen the feature weight of the central region and suppress edge noise interference. The formula for generating Gaussian kernels is as follows:

K (i, j) = \frac{1}{2 π σ^{2}} e x p (- \frac{{(i - c)}^{2} + {(j - c)}^{2}}{2 σ^{2}})

(7)

where c is the core center coordinate, which

σ

is the standard deviation of the Gaussian function, which controls the weight decay rate. The Gaussian kernel acts on the splicing feature map

X^{″}

through the convolutional layer, the batch normalization (BN), and SiLU activation to output the enhanced Gaussian feature map.

We generate the kernel matrix by expanding Equation (7) into a K × K tensor. For example, when K = 3 and sigma = 1.0, the generated kernel matrix is:

[\begin{matrix} 0.0585 & 0.1247 & 0.0585 \\ 0.1247 & 0.2653 & 0.1247 \\ 0.0585 & 0.1247 & 0.0585 \end{matrix}]

(8)

This matrix is then repeated to match the input/output channel number of gaussian_conv (i.e., c2 channels), ensuring each feature channel receives consistent Gaussian weighting. In the forward method, we apply the pre-initialized Gaussian kernel to the concatenated 8-branch feature map—essentially performing element-wise multiplication between the kernel and local feature patches, followed by summation. This operation implements pixels with high kernel weights (near target centers) and weakens background pixels.

In order to balance the contribution of the original splicing features and the Gaussian-enhanced features, a dynamic fusion weight mechanism is designed: the fusion weights are calculated by the learnable parameters

\partial

and the Softmax function, and the two types of features are weighted and summed:

f u s e d_f e a t = ω_{0} \cdot c o n c a t_f e a t + ω_{1} \cdot g a u s s i a n_f e a t

(9)

where

ω_{0}

and

ω_{1}

are the normalized weights (

ω_{0} + ω_{1} = 1

), the model can automatically adjust the values of the two according to the task requirements, and flexibly strengthen the response of key regions while retaining the diversity of multi-branch features.

Finally, the channel integration of the fusion features is carried out by concatenation, and the final feature map

c^{‴} = c_{1}

with the output dimension. This operation not only further compresses the feature dimension but also enhances cross-channel information interaction and ensures the strong responsiveness of the center small object feature and the integrity of the spatial context.

In summary, the GMGblock captures rich directional sensitivity features through multi-branch asymmetric convolution, combined with the dynamic enhancement mechanism of a Gaussian mask, to achieve accurate adaptation to the characteristics of “strong dependence on central features and sparse spatial distribution” of small remote sensing targets, which significantly improves the discrimination of small target features while suppressing background noise. At the same time, by replacing the basic convolution module in C3k2 = False, and C3k2 = True of the bottleneck, the improvement can be applied to the C3k2 structure of the neck, optimizing the feature extraction, multi-scale feature fusion, and feature splicing of the pyramid.

3.3. Linear Deformable Mamba Block LDefMambablock

In the field of visual remote sensing, different distances change the size and shape of the target. Complex backgrounds, such as buildings, clouds, or vegetation, further obscure the target. Mamba is capable of modeling long-range feature dependencies, which is essential for understanding the global context in remote sensing images. Existing SSM-based models often employ scanning strategies to ensure connectivity between different regions of the image. However, this approach significantly increases feature redundancy within SSM, resulting in high computational costs when processing high-resolution remote sensing images, making it a performance bottleneck. At the same time, in remote sensing detection scenarios, traditional convolution is difficult to adapt to target scale changes and complex background interference due to the limitations of fixed sampling shapes and cannot accurately focus on effective feature areas. The traditional deformable convolution is limited by the parameter growth mode of square complexity, and when the number of sampling points increases, the computational overhead increases significantly, making it difficult to balance performance and efficiency under the lightweight demand. Traditional feature fusion modules mostly use fixed weights or simple splicing strategies, which lack the ability to dynamically adapt features at different levels and scales, which can easily cause redundancy or loss of feature information.

In order to solve this problem, we propose an efficient linear deformable Mamba LDefMambablock module, as shown in Figure 3c, which integrates the SS2D spatial state dual model based on SSM to capture global dependencies and the improved linear deformable convolution to extract accurate local features of variable size and performs post-processing by gating MLP and complementing channel nonlinear interactions.

Specifically, the structure can be decoupled into three independent functional components: the linear deformable sampling LDblock, the spatial state dual model SS2D, and the residual gated Res-MLP block.

3.3.1. LDblock

The working principle of the LDblock module is shown in Figure 3c, which is divided into three steps: initial sampling coordinate generation, dynamic offset adjustment and coordinate correction, and bilinear interpolation and feature aggregation.

The first is the initial sampling coordinate generation: for any number of parameters N, such as (3, 5, 7, etc.), the “regular grid + irregular expansion” strategy is used to generate the initial sampling coordinates

P_{n}

, and the basic integer division (base_int = round (

\sqrt{N}

)) to construct a regular grid framework, and then irregularly expand the remainder (mod_number = N % base_int) to ensure that the distribution of sampling points is stable and flexible. Mathematically, the initial coordinate generation follows the following process:

\{\begin{cases} base_int = r o u n d ⌊\sqrt{N}⌋ \\ r o w_n u m b e r = ⌈\frac{N}{b a s e_int}⌉ \\ \mod_n u m b e r = N % b a s e_int \end{cases}

(10)

Here,

⌊\sqrt{N}⌋

calculates the largest integer less than or equal to

\sqrt{N}

to form the base of the regular grid;

⌈\frac{N}{b a s e_int}⌉

(ceiling operation) determines the number of rows needed to accommodate all sampling points, ensuring no points are omitted;

N % b a s e_int

gives the remainder of N divided by

b a s e_int

, which guides the irregular expansion of sampling points.

The regular grid part generates coordinates

(p_n_x, p_n_y)

through torch.meshgrid, and the irregular expansion part supplements the sampling points corresponding to the remainder, and finally splices into a complete initial coordinate matrix

P_{n} \in R^{2 N \times 1 \times 1}

.

Then there is dynamic offset adjustment and coordinate correction: to adapt to the change in the shape of the target, the LDblock predicts the offset of the sampling point

Δ P \in R^{B \times 2 N \times H \times W}

through an independent offset branch

(p_c o n v)

and fuses it with the initial coordinates to generate a dynamic sampling position:

P = P_{0} + P_{n} + Δ P

among them is the datum coordinate grid is

P_{0}

, which is determined by the input feature map size and step size. The offset learning adopts a gradient scaling strategy to avoid feature instability caused by excessive offset in the early stage of training. The corrected coordinate P should be limited to the range of the feature map by cropping to ensure the sampling effectiveness:

P_{c l i p} = c l a m p (P, 0, H - 1) \times c l a m p (P, 0, W - 1)

(11)

Specifically, LDblock predicts the offset of the sampling point through an independent offset branch p_conv and fuses it with the initial coordinates to generate a dynamic sampling position. The dynamic sampling position

{\hat{P}}_{n}

is calculated as follows:

\{\begin{cases} {\hat{P}}_{n} = P_{n} + α \cdot Δ P \\ {\hat{P}}_{n} = c l i p ({\hat{P}}_{n}, 0, H - 1) \otimes c l i p ({\hat{P}}_{n}, 0, W - 1) \end{cases}

(12)

Among them, the datum coordinate grid is

P_{0}

, which is determined by the input feature map size and step size.

α = 0.1

is a gradient scaling factor to avoid feature instability caused by excessive offset in the early stage of training, and

c l i p (\cdot)

restricts coordinates to the valid range of the feature map (H and W are the height and width of the feature map, respectively) to ensure sampling effectiveness.

Finally, bilinear interpolation and feature aggregation: based on dynamic coordinates

P_{c l i p}

, bilinear interpolation is used to achieve feature resampling. For each sample point

(x, y)

, its feature value is represented by the weighting sum of the surrounding four pixels:

x_{offset} = \sum_{q \in (q_{u}, q_{r b}, q_{l b}, q_{r t})} x_{q} \cdot y_{q}

(13)

where q is the integer coordinate,

y_{q}

is the interpolation weight, which is calculated by the distance between the sample point and the integer coordinate.

Subsequently, bilinear interpolation is performed to aggregate features. For the corrected sampling coordinate

{\hat{P}}_{n} = ({\hat{x}}_{n}, {\hat{y}}_{n})

, the feature value

F ({\hat{P}}_{n})

is obtained as follows:

F ({\hat{P}}_{n}) = \sum_{q \in \{q_{11}, q_{12}, q_{21}, q_{22}\}} w_{q} \cdot F (q)

(14)

where q denotes the four integer coordinates surrounding

{\hat{P}}_{n}

, and the interpolation weight

w_{q}

is calculated as follows:

w_{q} = (1 - |{\hat{x}}_{n} - x_{q}|) \times (1 - |{\hat{y}}_{n} - y_{q}|)

(15)

w_{q}

ensures that pixels closer to

{\hat{P}}_{n}

have a greater impact on the interpolated feature value, facilitating continuous feature sampling for small targets.

In summary, LDblock acts as the local feature extraction part of the LDefMambablock module through the co-design of linear parameter growth, arbitrary sampling shape, and dynamic offset adjustment.

3.3.2. SS2D Module

The working principle of the SS2D module is shown in Figure 3c, which is divided into four steps: selective scanning, direction sequence alignment, symmetrical weighted fusion, and spatial reconstruction.

The first is selective scanning: the selective-scanning mechanism is the core component of the SS2D module to realize global dependency modeling, which converts 2D visual features into linear complexity sequence modeling problems through three-level processing of four-way symmetric scanning, state-space transformation, and feature aggregation. The core innovation of this mechanism is that the spatial structure of the image is transformed into a sequence of temporal relationships through directional decomposition, and the secondary complexity bottleneck of traditional self-attention is broken through with the linear temporal characteristics of the state space model (SSM), while retaining the spatial context association of visual features.

For the input feature map

x \in R^{B \times C \times H \times W}

, the SS2D mechanism first expands into sequence data in four symmetrical directions (up, down, up, right, right, and left) through cross-scan operations. Taking the horizontal direction as an example, the scanning process can be broken down as follows:

Top-Down: Expand line by line along the image row dimension from top to bottom, and generate a sequence

x s_{t d} \in R^{B \times C \times (H \times W)}

; Bottom-Up: Reverse scan along the row dimension to generate a reverse sequence

x s_{b u} \in R^{B \times C \times (H \times W)}

; Left-Right: Expand the column from left to right along the column dimension to generate a sequence

x s_{l r} \in R^{B \times C \times (H \times W)}

; Right-Left: Reverse scan along the column dimension to generate a reverse sequence

x s_{r l} \in R^{B \times C \times (H \times W)}

.

The four-way sequence

x s = [x s_{t d}; x s_{b u}; x s_{l r}; x s_{r l}] \in R^{B \times 4 \times C \times (H \times W)}

is obtained by splicing, and the operation covers the entire area of the image through directional symmetry, providing a multi-perspective sequence representation for subsequent global dependency modeling. Mathematically, a scan in a single direction can be expressed as follows:

x s_{d i r} = S c a n (x, d i r) = flatten (t r a n s p o s e (x, d i r))

(16)

Among them

d i r \in [t d, b u, l r, r l]

is the flattening feature map flattening operation; transpose adjusts the dimension order according to the direction to realize row-by-row expansion.

For a single direction (taking top-down as an example), the scanning process can be mathematically expressed as follows:

S_{d i r} = flatten (X (p e r m u t e (0, 2, 1, 3)))

(17)

where

x \in R^{B \times C \times H \times W}

is the input feature map (B is batch size, C is channel number, H is height, W is width),

p e r m u t e (0, 2, 1, 3)

means we reorder the input feature map’s dimensions from

(B \times C \times H \times W)

to

(B \times H \times C \times W)

, which places the height dimension (H, rows) before channels, so when flattened, features are traversed row-by-row from top to bottom, enabling the desired scanning direction. Through this method, we can reorder dimensions to enable row-wise top-down scanning, and

flatten (\cdot)

converts the 2D feature map into a 1D sequence

S_{d i r} \in ℝ^{B \times C \times (H \times W)}

.

SS2D uses the zero-order hold ZOH method to discretize the continuous state space model and adapt it to sequence processing. The state transfer equation for a continuous system is as follows:

h^{'} (t) = A h (t) + B x (t), y (t) = C h (t)

(18)

where

A \in R^{N \times N}

is the state transition matrix,

B \in R^{N \times 1}

is the input weight matrix, and

B \in R^{N \times 1}

is the output observation matrix.

The sequence in each direction is transformed by the state space model, and the state transition matrix and the input matrix are discretized by zero-order hold ZOH to satisfy the following:

\bar{A} = \exp (Δ A), \bar{B} = {(Δ A)}^{- 1} (\exp (Δ A) - I) Δ B

(19)

\bar{A}

and

\bar{B}

are the discretized state transition matrix and the input matrix, of which

Δ

is the timescale parameter and

I

is the unit matrix. This recursive process processes the four-way sequence channel by channel, generating a hidden state sequence containing global dependencies.

The second is direction alignment: the direction inverse transformation of the four-way output sequence

y_{t d}, y_{b u}, y_{l r}, y_{r l} \in R^{B \times C \times L}

is performed to restore the spatial position correspondence.

Top-down direction: The scan order is row first (each row from row 1 to

H

row, each column from column 1 to column

W

). Bottom-Up direction: The scanning order is reverse (each row from row

H

to row 1, each column from column 1 to column

W

). Left-Right direction: The scanning order is column first (each column from column 1 to column

H

, each row from row

W

to row 1); Right-Left direction: The scanning order is reverse column order (each column from column

H

to column 1, each row from

W

row 1 to row); The mapping of sequence indexes

t

and spatial coordinates

(i, j)

is satisfied in the order of up-down, down-up, left-right, and right-left:

\begin{array}{l} t = (i - 1) \cdot W + (j - 1), i \in [1, H], i \in [1, W] \\ t = (H - i) \cdot W + (j - 1), i \in [1, H], i \in [1, W] \\ t = (j - 1) \cdot H + (i - 1), i \in [1, H], i \in [1, W] \\ t = (W - j) \cdot H + (i - 1), i \in [1, H], i \in [1, W] \end{array}

(20)

The third is symmetric weighted fusion: weighted fusion is used in both vertical and horizontal directions to eliminate directional ambiguity and enhance symmetrical regional response. The complementarity of the vertical direction (row scan) can suppress the row direction noise, and the complementarity of the horizontal direction (column scan) can suppress the column direction noise, and the vertical and horizontal fusion results are dynamically balanced by learnable parameters

α

. The fusion formula is as follows:

\begin{array}{l} Y_{v e r t} = \frac{1}{2} (Y_{t d} + Y_{b u}) \\ Y_{h o r i z} = \frac{1}{2} (Y_{l r} + Y_{r l}) \\ Y_{m e r g e} = α \cdot Y_{v e r t} + (1 - α) \cdot Y_{h o r i z} \end{array}

(21)

where

Y_{v e r t}

is vertical weighted fusion,

Y_{h o r i z}

is horizontal weighted fusion, and

α \in [0, 1]

are learnable parameters, initialized to 0.5, optimized by backpropagation.

Finally, the fourth is spatial reconstruction: reshape the fused sequence into

R^{B \times C \times H \times W}

dimensions to complete the spatial mapping of global features. Mathematically, the merge operation can be expressed as follows:

y = C r o s s M e r g e ({y_{d i r}}) = r e s h a p e (\frac{1}{2} (y_{t d} + y_{b u}) + \frac{1}{2} (y_{l r} + y_{r l}))

(22)

The mathematical nature of this operation is linear compression of the spatial dimension, calculating the mean of the original feature map corresponding to each pixel in the target size (H′, W′) for each channel’s feature map:

Y_{o u t} [b, c, i^{'}, j^{'}] = \frac{1}{h \cdot w} \sum_{p = 0}^{h - 1} \sum_{q = 0}^{w - 1} Y_{m e r g e} [b, c, ⌊\frac{i^{'} \cdot H}{H^{'}}⌋ + p, ⌊\frac{j^{'} \cdot W}{W^{'}} + q⌋]

(23)

where b is the batch size, c is the number of channels,

i^{'}

and

j^{'}

are the spatial coordinates on the output feature map,

H

and

W

are the spatial size of the sequence after fusion, and

H^{'}

and

W^{'}

are the target size of the output feature map after spatial reconstruction.

The process suppresses noise through directional symmetry while retaining global dependencies, so that the output feature map y not only contains local feature information, but also captures long-distance spatial correlations.

3.3.3. Res-MLP Block

The working principle of the residual gated MLP module is shown in Figure 3c. Firstly, the linear transformation of the channel dimension is performed on the input feature map, and it is projected into a temporary high-dimensional feature space. Suppose the input feature map is

X \in R^{B \times C \times H \times W}

(where B is the batch size, C is the number of channels, and H and W are the height and width, respectively), and the dimensional upscaling of the channel dimension is carried out through a fully connected layer to obtain the feature map

Y_{1} \in R^{B \times C_{m i d} \times H \times W}

, where in

C_{m i d} = C \times r

(r is the expansion ratio, e.g., r = 2). This operation aims to provide more feature combination possibilities for subsequent nonlinear transformations and explore complex dependencies between channels.

Next,

Y_{1}

is performed for nonlinear transformation on the Hardswish activation function, which is expressed as follows:

H a r d s w i s h (x) = x \cdot m a x (0, m i n (1, (x + 3) / 6))

(24)

This activation function, Hardswish, has significant advantages in remote sensing small target detection: for scenes where the proportion of small target pixels is extremely low and the eigenvalue is generally small, the linear segmentation feature of Hardswish in the interval

x \in [- 3, 3]

can effectively amplify the weak feature signal (such as small target edges and local textures) and avoid feature loss caused by the saturation of the activation function.

After that, a fully connected layer is used to project the feature map from the high-dimensional intermediate space back to the original number of channel dimensions, obtaining

Y_{3} \in R^{B \times C_{m i d} \times H \times W}

. At this time, the original feature map

Y_{3}

are added element-by-element with X to form a residual connection. The purpose of this step is to alleviate the problem of gradient disappearance, so that the model can better transmit information in the deep structure, and retain the key features of the original input, which is the final output feature map

Y_{o u t} = X + Y_{3}

.

The gating mechanism plays an important role in the entire process. In the improved residual gating MLP module, a gating unit is introduced before the residual is attached. It divided

Y_{3}

into two parts through a linear transformation, one part generates a gating signal

G \in ℝ^{B \times C \times H \times W}

through an activation function (the element value in G is between 0 and 1, which is used to control the degree of passage of information), and the other part multiplies G element by element, and then adds it to X. In this way, the model can adaptively determine how much information from the original input X is retained and how much new information is introduced from the MLP-processed features

Y_{3}

through gating signal control, further improving the model’s ability to select and fuse different features, and enhancing its ability to express complex data patterns, especially in tasks such as remote sensing and small object detection that require accurate capture of subtle features and are sensitive to background noise interference. It can better focus on the target features, suppress useless information in the background, and achieve the effect of lightweight.

4. Experiments

This section describes extensive experiments to evaluate the effectiveness and performance of this model in remote sensing target detection. First, the datasets used in the experiment are briefly introduced. Next, explain the experiment setup and evaluate the metrics. Finally, the results of ablation studies and comparative experiments are provided, followed by an analysis of the observed phenomena and trends.

4.1. Dataset

The object detection dataset DOTA-v1.0 [40] is a large-scale object detection dataset for optical remote sensing imagery, which includes 2806 images with a total of 188,282 object instances annotated with orientation bounding boxes. Sizes range from 800 × 800 to 4000 × 4000. These objects are divided into 15 categories: airplanes, baseball fields, bridges, track and field fields, small vehicles, large vehicles, ships, tennis courts, basketball courts, storage tanks, soccer fields, ring roads, ports, swimming pools, and helicopters. Image sizes in DOTA datasets range from 800 × 800 to 4000 × 4000 pixels. Each image in the dataset contains an average of 3 objects.

The Vehicle Detection in Aerial Imagery (VEDAI) dataset [41] is a comprehensive dataset specifically tailored for aerial imagery vehicle detection. A total of 1553 images were included. The images contain seven different categories and a total of 3575 object instances. These objects are divided into nine categories: cars, trucks, boats, tractors, campers, pickups, airplanes, others, and vans. The dataset offers images in 512 × 512 and 1024 × 1024 resolution formats. The dimensions of objects are primarily concentrated in the range of 12 × 12–24 × 24 pixels, and each image resolution includes data for both visible and infrared image registration, catering to the diverse needs of image analysis. Each image in the dataset contains an average of 3 objects.

USOD is a dataset based on UNICORN 2008 [42] for detailed manual annotation of small and medium-sized vehicles. The training set has 3000 images in sizes of 416 × 416 and 640 × 640. Each image contains an average of 14.5 objects, with the dimensions of the objects mainly concentrated in the range of 8 × 8–16 × 16 pixels, and each image in the dataset contains an average of 14.5 objects.

The details of DOTA-v1.0, VEDAI, and USOD are shown in Table 1.

The datasets used in this study are publicly available from their official repositories:

DOTA-v1.0 at https://captain-whu.github.io/DOTA/dataset.html (accessed on 2 July 2025) under the CC BY-NC-SA 4.0 License.

VEDAI can be accessed at https://downloads.greyc.fr/vedai/ (accessed on 2 July 2025) under a custom non-commercial license for academic research.

USOD can be accessed at https://github.com/AFRL-RY/data-unicorn-2008 (accessed on 2 July 2025) under the MIT License.

4.2. Implementation Details and Evaluation Metrics

In our experiments, we evaluate the directional object detection model using three datasets: DOTA-v1.0, VEDAI, and USOD. To ensure fair comparisons, we adhere to the dataset processing methods used in other mainstream studies [43] while strictly avoiding train/val overlap to eliminate data leakage risks. For DOTA-v1.0, we first followed its official split [40] to retain the original training set (1411 images) and validation set (458 images) without any cross-subset mixing. We then performed multi-scale cropping in a subset-exclusive manner: we scaled the original images of the training set at three ratios (0.5, 1.0, 1.5) and cropped each scaled image into 1024 × 1024 patches, and independently applied the same scaling and cropping rules to the validation set images, with no data interaction between the two subsets. After cropping, we added a spatial overlap check: we calculated the intersection-over-union (IoU) of spatial coordinates (relative to the original image) between each training patch and validation patch, discarding and re-cropping patches with IoU > 0.1 to avoid meaningful feature overlap. We also tracked the source ID of each original image for all cropped patches, confirming no validation-set parent IDs appeared in the training patch pool and vice versa. Through these measures, we finally obtained 15,749 training patches and 5297 validation patches, ensuring the integrity of small target features (target pixel size mainly concentrated in 20 × 20–300 × 300 pixels) while maintaining strict dataset separation.

For the VEDAI dataset, all category annotations in the dataset are retained to maintain the original data distribution characteristics, in which the target pixel size is mainly concentrated in the range of 12 × 12–24 × 24 pixels, and the small target presents a natural distribution state in the complex background. A total of 1224 training sets and 309 verification sets were obtained.

For the USOD dataset, the training set and validation set images are sorted into two sizes: 416 × 416 and 640 × 640, and the original distribution features of small targets (8 × 8–16 × 16 pixels) in the dataset are maintained. A total of 2100 training sets and 900 verification sets were obtained.

The experiment was conducted on Windows 11. The deep learning environment is equipped with CUDA 12.6 and the Pytorch 2.6.0 framework, trained on NVIDIA RTX4060 GPUs. The YOLOv11 algorithm scales based on network width and depth, including multiple variants: YOLOv11n, YOLOv11s, YOLOv11m, YOLOv11 l, and YOLOv11x. As a starting point, YOLOv11n is selected as a baseline, followed by network improvements.

To compare with SOTA results and ensure the comparability and validity of experimental results, we unified the training epoch count for both main experiments and ablation studies to 300 epochs. To enrich the input image, the initial learning rate is initialized to 0.01 using Mosaic enhancement technology, and the SGD optimizer is selected to optimize the total loss function. The training batch size is set to eight. The momentum parameter used is 0.937, and box_loss (bounding box loss), cls_loss (classification loss), and dfl_loss (Distribution FocalLoss) were used as loss functions of YOLOv11. To mitigate overfitting, an early stop method was implemented during training. For evaluation, our experiments utilized standard deep learning metrics such as precision (P), recall (R), average accuracy (AP), per-image predicted speed, and the memory of the model.

The formula for calculating accuracy is as follows:

\begin{array}{l} P = \frac{N_{T P}}{N_{T P} + N_{F P}} \end{array}

(25)

The formula for calculating the recall rate is as follows:

R = \frac{N_{T P}}{N_{T P} + N_{F N}}

(26)

The formula for calculating average accuracy is as follows:

A P = \int_{0}^{1} P (R) d R

(27)

where P(R) is the precision–recall curve formula.

Our experimental evaluation metrics include average accuracy (mAP), model parameter count (parameters), and model computational complexity (FLOPs).

mAP is a comprehensive metric obtained by the average AP values. It uses an integral method to calculate the area under the precision–recall curve for all categories. Therefore, mAP can be calculated as follows:

m A P = \frac{A P}{N}

(28)

where N is the number of categories.

The parameters and FLOPs are calculated by providing the same batch of images to the model. A parameter refers to the total number of parameters in a model, while FLOP measures computational complexity, representing the number of floating-point operations required to process a given input.

4.3. Ablation Study

Our proposed GMG-LDefmamba-YOLO consists of two key components: the GMGblock and the LDefMambablock. To verify the effectiveness of these components and ensure the rigor and generalizability of experimental results, we conducted ablation experiments with YOLOv11 [24] as the baseline model and simultaneously selected the above three representative remote sensing datasets for testing. All ablation experiments followed a unified protocol (300 training epochs, SGD optimizer with momentum 0.937, batch size 8).

4.3.1. Analysis of the GMGblock on Branch Design and Gaussian Mask

As shown in Table 2, we studied the kernel design of the branch splicing strategy in GMGblock. C/4*4 is a four-branched asymmetric structural receptor field (half of the gear). C/4*8/2 is the number of channels per branch, C/4, and the eight-branch splicing is 2C, and then compressed into 1C by 1 × 1 convolution. C/16*8*2 is the number of channels per branch C/16, and the eight-branch splicing is C/2, which is restored to 1C through 1 × 1 convolution. C/8*8 is the eight-branch symmetric convolutional splicing mentioned in the text as the gear-shaped receptive field. As shown in Figure 4, the design diagram of different branches of the GMGblock module is shown, where ‘a’ corresponds to C/4*4, ‘b’ corresponds to C/8*8, ‘c’ corresponds to C/4*8/2, and ‘d’ corresponds to C/16*8*2.

The results show that eight-branch convolution enhances symmetry and achieves better accuracy than four-branch convolution. Among them, the C/8 × 8 branching strategy attains the best performance: on USOD, it reaches 90.35% mAP50 with 68 FLOPs and 16.88 M parameters; on DOTA-v1.0, it obtains 70.91% mAP50 with 79 FLOPs and 18.59 M parameters; and on VEDAI, it achieves 77.94% mAP50 with 77 FLOPs and 17.94 M parameters. The C/16 × 8 × 2 branching strategy has the lowest computational cost (e.g., 76 FLOPs on USOD) but poor accuracy (89.77% mAP50 on USOD). The C/4 × 8/2 branching strategy neither performs well in speed nor accuracy (e.g., 80 FLOPs and 89.98% mAP50 on USOD), which is speculated to be due to the increased computational load from more channels and the loss of some transmission characteristics during 1 × 1 convolutional compression. These findings suggest that the C/8 × 8 branching strategy achieves the best trade-off between speed and precision, maximizing mAP50 while keeping FLOPs at a reasonable level. It effectively splices the symmetrical gear-shaped receptive field while avoiding excessive computational overhead, making it an optimal choice for balancing small target detection accuracy and inference efficiency across multiple datasets. At the same time, we note that the USOD dataset FLOPs are significantly smaller than VEDAI’s because the USOD dataset image size is smaller (training set image size 416 × 416 and 640 × 640), there are fewer target species (single-class detection only), and the target size is smaller (mainly concentrated in 8 × 8–16 × 16 pixels), while the VEDAI dataset image has a resolution of 512 × 512 and 1024 × 1024, and the target size is relatively larger, and the model needs to detect three types of targets at the same time, and the characteristics of these datasets themselves make the computational amount significantly different.

In addition, it is also crucial to explore the effects of isolating the Gaussian mask alone. As shown in Table 3, We constructed two variants of the GMGblock for comparison on datasets: GMGblock w/o Gaussian mask: This variant retains the C/8*8 branch design (verified as optimal in previous experiments) but removes the Gaussian mask operation. Feature maps are processed only by the gear-shaped convolutional branches without spatial attention weighting from the Gaussian mask.

Experiments show that equipping with a Gaussian kernel improves model accuracy while causing only a minimal speed overhead. It is speculated that this is because the Gaussian kernel is statically defined and directly weighted. This kind of ‘weight overlay’ processing usually only brings linear-scale complexity and does not cause a speed burden. This proves the adaptability of the Gaussian mask block to this improvement.

4.3.2. Analysis of the GMGblock on Kernel Sizes

GMGblock uses a horizontal convolutional kernel with a kernel size of 1 × k and a vertical convolutional kernel with a kernel size of k × 1 in an eight-branch expanded convolution, which directly affects the range of the receptive field and the ability to capture small target features. In order to verify the influence of kernel size on remote sensing small target detection, we took these datasets as the test object, fixed the branching strategy as the optimal C/8 × 8, only adjusted the convolutional kernel size (k = 3, 5, 7, 11), and compared the model parameters (Paras), computational power (FLOPs), and detection accuracy (mAP50) under different configurations, and the results are shown in Table 4.

The results show that the convolutional kernel size of 3 (1 × 3 and 3 × 1) has the best detection performance for remote sensing small targets in GMGblock. On USOD, it achieves 90.35% mAP50 with 68 GFLOPs and 16.88M parameters; on DOTA-v1.0, 70.91% mAP50 with 79 GFLOPs and 18.59 M parameters; and on VEDAI, 77.94% mAP50 with 77 GFLOPs and 17.94 M parameters. Its receptive field (about 40–60 pixels) matches the size of remote sensing small targets. Although increasing the kernel size can cover a wider range, it introduces redundant background information, increases the computational burden, and reduces the discriminability of small target features. When the convolutional kernel size is 5, redundant background is introduced, and mAP50 drops to 89.45% (USOD), 69.89% (DOTA-v1.0), and 76.85% (VEDAI). When the kernel size is 7, the receptive field is too large, leading to feature dilution, with mAP50 at 89.98% (USOD), 70.32% (DOTA-v1.0), and 77.00% (VEDAI). When the kernel size is 11, the coverage is excessively wide, causing spatial sparsification, and mAP50 is only 87.62% (USOD), 69.40% (DOTA-v1.0), and 76.13% (VEDAI), while computational volume surges (e.g., 83 GFLOPs and 26.44 M parameters on USOD). This verifies the rationality of GMGblock’s adaptation to the characteristics of small remote sensing targets and conforms to the design logic of “heavy central weight”.

4.3.3. Analysis of the LDefMambablock on Necessity of Each Component

As mentioned earlier, LDefMambablock can be decoupled into three independent functional components: the linear deformable sampling LDblock, the spatial state dual model SS2D, and the residual gated MLP module. Among them, MLP has been shown to be used in conjunction with SS2D structures as a conditioning of the Mamba gating mechanism. In this section, we independently examine each module in LDefMambablock, and before proving the necessity of individual components, we downsample using conventional convolution to assess the impact of component performance on accuracy. We use the USOD dataset as the test object. As shown in Table 5, we have replaced the relevant modules on the original YOLOv11 with LDefMambablock. a, b, c, and d are different component forms of LDefMambablock, and we only study the construction of different LDefMambablocks to verify the effectiveness of each component. The “√” proof module in the table integrates the components, and the “×” represents that the components are replaced by standard convolutions. The schematic diagram of the different structures of the alternative is shown in Figure 5.

From the results, when LDefMambablock does not integrate any components (LDblock, SS2D, and MLP are all replaced by standard convolution, corresponding to structure “a”), the model achieves 88.87% mAP50 with a speed of 3.5 ms and memory usage of 1257 MB. When only LDblock is integrated (structure “b”), mAP50 increases to 89.59% (speed: 3.6 ms, memory: 1289 MB), indicating the effectiveness of linear deformable sampling for local feature capture. When SS2D and MLP are integrated (structure “c”), mAP50 is 89.83% (speed: 3.9 ms, memory: 1365 MB), demonstrating the effectiveness of global dependency modeling combined with the gating regulation mechanism for background feature discrimination. When the three components are integrated (structure “d”), mAP50 reaches the highest at 90.35% (speed: 4.2 ms, memory: 1418 MB), which is significantly higher than that of other combinations. This verifies the necessity of the synergy of LDblock’s local features, SS2D’s global dependency, and MLP’s gating regulation, and effectively improves the remote sensing small target detection performance through complementary mechanisms, while the increase in speed and memory is within a reasonable range for the performance gain.

4.4. Comparisons with Previous Methods

In order to further test the effectiveness of GMG-LDefmamba-YOLO, it was compared with the original YOLOv11 algorithm and other mainstream detection methods, including DSSD [44], RefineDet [45], YOLOv3 [46], YOLOv5 [47], YOLOv8 [48], YOLOv11 [24], YOLOv12 [49], YOLOv13 [50], Gold-YOLO [51], BGF-YOLO [52], FFCA-YOLO [53], DINO [54], DNTR [14], RFLA [55], Mamba-YOLO [21], FBRT-YOLO [56], etc. Both DSSD and RefineDet are classic and efficient object detectors. RefineDet stands out as a hybrid structure that combines the advantages of both primary and two-level object detection methods. Meanwhile, DSSD is an enhanced version of SSDs specifically designed to improve the detection accuracy of small objects. YOLOv3, YOLOv5, YOLOv8, YOLOv11, YOLOv12, and YOLOv13 are the cornerstone algorithms in the YOLO family, while Gold-YOLO, BGF-YOLO, and FFCA-YOLO represent advancements built on the YOLO framework. Gold-YOLO and BGF-YOLO enhance the network’s multi-scale feature fusion capabilities, while FFCA-YOLO improves the detection of small objects. DINO and DNTR are based on the transformer architecture, which has been improved for remote sensing object detection. RFLA is a label assignment method tailored for the same purpose. The comparison algorithm used in our study, RFLA, is built on top of Cascade R-CNN. DNTR, RFLA, TPH-YOLOv5, and FFCA-YOLO are all recently published remote sensing object detection algorithms. In addition, Mamba-YOLO and FBRT-YOLO are the latest remote sensing object detection YOLO algorithms released in 2025, of which Mamba-YOLO is the first algorithm to combine the Mamba algorithm with YOLO and achieve SOTA effect, and FBRT-YOLO is a recent further exploration and improvement for remote sensing small target detection.

DOTA-v1.0 experimental results: We compare GMG-LDefmamba-YOLO with the above mainstream detection methods on the DOTA-v1.0 dataset, and the reported DOTA results are obtained by submitting the prediction results to the official evaluation server. As shown in Table 6, our GMG-LDefmamba-YOLO achieves state-of-the-art performance—its mAP50 reaches 70.91%, which is 0.58 percentage points higher than Mamba-YOLO, 0.33 percentage points higher than FBRT-YOLO, 0.48 percentage points higher than YOLOv12, and 0.36 percentage points higher than YOLOv13; it also achieves an excellent mAP50:95 of 50.39%, outperforming Mamba-YOLO by 0.63 percentage points, FBRT-YOLO by 0.46 percentage points, YOLOv12 by 2.61 percentage points, and YOLOv13 by 0.76 percentage points. This proves the effectiveness and efficiency of the proposed GMG-LDefmamba-YOLO model, which shows high recognition ability and stable localization performance for complex objects in remote sensing images. Figure 6 and Figure 7 show a line chart of detection accuracy by category and a bar chart of overall detection accuracy, respectively. Figure 8 shows the confusion matrix generated by the model on the DOTA-v1.0 test set, where the rows represent the true category and the columns represent the predicted class. The diagonal elements from the top left to the bottom right indicate the accuracy of the model in correctly classifying each category.

Experimental results of VEDAI: The reduction in target pixels in VEDAI is extremely significant, and the types of targets are diverse, which poses a challenge to the robustness and generalization of model performance. We evaluated the performance of GMG-LDefmamba-YOLO on the VEDAI dataset based on nine state-of-the-art methods. As shown in Table 7, both the mAP50 and FLOPs of GMG-LDefmamba-YOLO were optimal, ranking third in terms of parameter amount, second only to the YOLOv11 baseline model. This shows that the limited gating mechanism combined with MAMBA affects the computational amount of the YOLO hierarchy but greatly improves the detection performance and detection speed. Figure 9 shows a bar chart comparing mAP50, parameters, and FLOPs metrics on the VEDAI dataset.

Experimental results of USOD: The instances in the USOD dataset are of a single type, and the target size is very small; the average number of targets per image is large, which tests the model’s small target detection ability, dense target discrimination ability, and detection efficiency. Table 8 compares the 9 state-of-the-art methods for precision, recall, mAP50, speed, and memory on the USOD dataset. GMG-LDefmamba-YOLO outperformed previous methods, confirming the excellent small object detection capability of our improved model.

Additionally, breaking down detection accuracy by object scale is critical for validating the effectiveness of our method—especially for remote sensing small target detection, where performance differences across scales directly reflect the model’s ability to address the core challenge of “small target feature sparsity”. To solidify the claims, we conducted scale-specific accuracy analysis as shown in Table 9. The minimum bounding box (MBB) area of the target is defined as the smallest axis-aligned rectangle that can completely enclose the target area in the remote sensing image. Aligned with remote sensing target detection standards (following the DOTA dataset’s scale division and industry common practices), we divided object scales into four categories based on the minimum bounding box (MBB) area of targets in the test datasets (DOTA-v1.0, VEDAI, USOD): small targets (MBB area < 32 × 32 pixels, e.g., small vehicles, streetlights in DOTA-v1.0; 8 × 8–16 × 16 pixel targets in USOD), medium targets (32 × 32 ≤ MBB area < 96 × 96 pixels, e.g., ships, airplanes in DOTA-v1.0; 20 × 20–24 × 24 pixel targets in VEDAI), large targets (96 × 96 ≤ MBB area < 256 × 256 pixels, e.g., buildings, bridges in DOTA-v1.0), and extra-large targets (MBB area ≥ 256 × 256 pixels, e.g., large-scale roads, industrial facilities in DOTA-v1.0). The results demonstrate that our method effectively improves detection accuracy for objects of different scales, proving that this improvement, achieved by combining ‘global and local’ features, enhances object detection capabilities in the field of remote sensing.

Overall, the experimental results show that the proposed GMG-LDefmamba-YOLO method not only achieves excellent performance on the DOTA-V1.0 dataset but also shows excellent detection performance on the VEDAI and USOD datasets, especially in small target-dense scenarios. Our GMG-LDefmamba-YOLO can significantly improve the detection accuracy of remote sensing targets, mainly due to the following reasons: (1) The proposed GMGblock module constructs a gear-like receptive field with “heavy central weight and decreasing peripheral weight” through mirror-symmetrical four-way padding convolution and Gaussian mask mechanism, which accurately adapts to the characteristics of small targets “dependent on central feature and sparse spatial distribution”, and enhances the discriminatory feature of small targets. (2) The designed LDefMambablock module integrates LDblock and SS2D components: LDblock realizes flexible local feature capture through linear deformable sampling and adapts to target scale changes; SS2D effectively balances detection accuracy and computational efficiency through four-way symmetric scanning and state-space transformation while maintaining linear computational complexity while modeling global dependence. (3) The improved neck feature pyramid structure replaces the traditional convolutional module with GMGblock, which strengthens the multi-scale feature fusion capability and especially improves the transfer and aggregation effect of small target features in deep networks.

The results of the DOTAv-1.0. dataset, VEDAI dataset, and USOD dataset detection visualization are shown in Figure 10 and Figure 11.

To comprehensively understand the limitations of our proposed model and address potential robustness concerns, we conducted a detailed failure analysis. We specifically focused on identifying worst-case scenarios, including high-confidence false positives and challenging missed detections (Figure 12).

As the results show, our model faces challenges in specific scenarios. For example, white vehicles in VEDAI scenes and black vehicles in USOD scenes are prone to missed detections—this is directly tied to their high similarity to surrounding backgrounds: white vehicles blend with white road shoulders (low color contrast), while black vehicles match the dark texture of farmland (consistent grayscale distribution). Additionally, false positives occur when the model misidentifies background textures (e.g., grid-like farmland ridges in USOD, light-colored road markings in VEDAI) that share structural patterns with target features (e.g., small vehicle outlines). These situations suggest that while our model performs well overall, further refinement could involve enhancing the feature discrimination ability, perhaps by incorporating more context-aware mechanisms to better distinguish targets from similar background elements, so as to boost its performance in such edge cases.

5. Conclusions

In this paper, an improved model based on the YOLOv11 framework, GMG-LDefmamba-YOLO, is proposed to solve the core challenges of remote sensing image target recognition, such as dense distribution, feature sparseness, drastic scale changes, and complex background interference, including two core modules, GMGblock and LDefMambablock. The traditional YOLO framework is refined from the perspectives of convolutional splicing, deformable convolutional sampling, selective scanning mechanism, global modeling of spatial state dual modeling, and residual gating lightweighting.

GMGblock forms a gear-like symmetric receptive field through mirror-symmetrical eight-way padding volume branching and combines Gaussian mask dynamic modulation feature weights to enhance the extraction of small target center features and suppress complex background interference. LDefMambablock integrates linear deformable sampling (LDblock), spatial state dual model (SS2D), and residual gating MLP components, which combine the advantages of flexible local feature capture and global dependence efficient modeling, dynamically adapt to target scale changes and spatial distribution, and reduce the amount of computational parameters to achieve lightweight performance. Experiments on DOTA-v1.0, VEDAI, and USOD datasets have shown that the mAP50 of GMG-LDefmamba-YOLO reached 70.91%, 77.94%, and 90.28%, respectively, compared with baseline YOLOv11 and mainstream studies. The methods (e.g., Mamba-YOLO, FBRT-YOLO, etc.) are improved while maintaining the lightweight characteristics of 17.94 M parameters and 77 FLOPs, which is lower than the computational cost of some comparison models. Comparative experiments show that the model achieves a better balance between detection accuracy and parameter efficiency. The visualization results further confirm that GMG-LDefmamba-YOLO significantly improves the detection confidence in dense small targets, complex backgrounds, and multi-scale scenarios, while reducing computational complexity and computing power requirements, providing efficient and reliable technical support for image analysis in remote sensing monitoring, UAV inspection, and other fields.

In this study, although the GMG-LDefmamba-YOLO network shows superior detection performance, it has limited improvement in the detection accuracy of large targets, and there is room for further improvement in the optimization of scanning methods and the lightweight recognition speed. Future work can explore the combination of this method with large core convolution, wavelet transform, and other technologies, and at the same time design dynamically adjustable scanning paths according to the target size or local image complexity so as to enhance the generalization of targets of different sizes in the process of remote sensing detection. At the same time, it can be explored that the method can be used to integrate multi-modal features (such as infrared and visible light images) and noise suppression technology to improve model performance under more extreme scenarios with low signal-to-noise ratio.

Author Contributions

Conceptualization, Y.Y. and L.Y.; methodology, Y.Y. and J.W.; validation, Y.Y., J.W. and J.L.; writing—original draft preparation, Y.Y., L.Y. and X.T.; writing—review and editing, J.L.; visualization, X.T.; supervision, J.W.; project administration, L.Y.; funding acquisition, L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant (62472149) and the Hubei Province Students Innovation and Entrepreneurship Training Program under Grant (20250100154).

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, W.; Chen, Y.; Hu, K.; Zhu, J. Oriented reppoints for aerial object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1829–1838. [Google Scholar]
Bouguettaya, A.; Zarzour, H.; Kechida, A.; Taberkit, A.M. Vehicle detection from UAV imagery with deep learning: A review. IEEE Trans. Neural Netw. Learn. Syst. 2021, 33, 6047–6067. [Google Scholar] [CrossRef]
Zheng, Z.; Yuan, J.; Yao, W.; Kwan, P.; Yao, H.; Liu, Q.; Guo, L. Fusion of UAV-acquired visible images and multispectral data by applying machine-learning methods in crop classification. Agronomy 2024, 14, 2670. [Google Scholar] [CrossRef]
Monteiro, J.G.; Jiménez, J.L.; Gizzi, F.; Přikryl, P.; Lefcheck, J.S.; Santos, R.S.; Canning-Clode, J. Novel Approach to Enhance Coastal Habitat and Biotope Mapping with Drone Aerial Imagery Analysis. Sci. Rep. 2021, 11, 574. [Google Scholar] [CrossRef] [PubMed]
Kyrkou, C.; Theocharides, T. EmergencyNet: Efficient Aerial Image Classification for Drone-Based Emergency Monitoring Using Atrous Convolutional Feature Fusion. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 1687–1699. [Google Scholar] [CrossRef]
Zong, H.; Pu, H.; Zhang, H.; Wang, X.; Zhong, Z.; Jiao, Z. Small Object Detection in UAV Image Based on Slicing Aided Module. In Proceedings of the 2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS), Shenyang, China, 29–31 July 2022. [Google Scholar]
Zhao, X.; Xia, Y.; Zhang, W.; Zheng, C.; Zhang, Z. YOLO-ViT-Based Method for Unmanned Aerial Vehicle Infrared Vehicle Target Detection. Remote Sens. 2023, 15, 3778. [Google Scholar] [CrossRef]
Chen, J.; Wang, G.; Luo, L.; Gong, W.; Cheng, Z. Building Area Estimation in Drone Aerial Images Based on Mask R-CNN. IEEE Geosci. Remote Sens. Lett. 2021, 18, 891–894. [Google Scholar] [CrossRef]
Hmidani, O.; Ismaili Alaoui, E.M. A Comprehensive Survey of the R-CNN Family for Object Detection. In Proceedings of the 2022 5th International Conference on Advanced Communication Technologies and Networking (CommNet), Marrakech, Morocco, 12–14 December 2022; pp. 1–6. [Google Scholar]
Xu, J.; Ren, H.; Cai, S.; Zhang, X. An Improved Faster R-CNN Algorithm for Assisted Detection of Lung Nodules. Comput. Biol. Med. 2023, 153, 106470. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Ross, T.Y.; Dollár, G. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 10012–10022. [Google Scholar]
Mehta, S.; Rastegari, M. MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer. arXiv 2021, arXiv:2110.02178. [Google Scholar]
Chen, Z.; Zhong, F.; Luo, Q.; Zhang, X.; Zheng, Y. EdgeViT: Efficient visual modeling for edge computing. In Proceedings of the International Conference on Wireless Algorithms, Systems, and Applications, Nanjing, China, 25–27 June 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 393–405. [Google Scholar]
Li, Y.; Hu, J.; Wen, Y.; Evangelidis, G.; Salahi, K.; Wang, Y.; Tulyakov, S.; Ren, J. Rethinking Vision Transformers for MobileNet Size and Speed. In Proceedings of the IEEE International Conference on Computer Vision, Paris, France, 2–3 October 2023. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Yan, L.; He, Z.; Zhang, Z.; Xie, G. LS-MambaNet: Integrating Large Strip Convolution and Mamba Network for Remote Sensing Object Detection. Remote Sens. 2025, 17, 1721. [Google Scholar] [CrossRef]
Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba YOLO: A Simple Baseline for Object Detection with State Space Model. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8205–8213. [Google Scholar] [CrossRef]
Yang, J.; Liu, S.; Wu, J.; Su, X.; Hai, N.; Huang, X. Pinwheel-shaped Convolution and Scale-based Dynamic Loss for Infrared Small Target Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 9202–9210. [Google Scholar] [CrossRef]
Xiong, Y.; Li, Z.; Chen, Y.; Wang, F.; Zhu, X.; Luo, J.; Dai, J. Efficient deformable convnets: Rethinking dynamic and sparse operator for vision applications. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5652–5661. [Google Scholar]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Lin, C.; Hu, X.; Zhan, Y.; Hao, X. MobileNetV2 with Spatial Attention module for traffic congestion recognition in surveillance images. Expert Syst. Appl. 2024, 255, 14. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Li, X.; Wang, W.; Wu, L.; Chen, S.; Hu, X.; Li, J.; Tang, J.; Yang, J. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. In Advances in Neural Information Processing Systems 33 (NeurIPS 2020); Neural Information Processing Systems Foundation, Inc. (NeurIPS): La Jolla, CA, USA, 2020. [Google Scholar]
Gao, F.; Jin, X.; Zhou, X.; Dong, J.; Du, Q. MSFMamba: Multi-Scale Feature Fusion State Space Model for Multi-Source Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5504116. [Google Scholar]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision mamba: Efficient visual representation learning with bidirectionaltional state space model. arXiv 2024, arXiv:2401.09417. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. arXiv 2024, arXiv:2401.10166. [Google Scholar]
Huang, T.; Pei, X.; You, S.; Wang, F.; Qian, C.; Xu, C. LocalMamba: Visual State Space Model with Windowed Selective Scan. arXiv 2024, arXiv:2403.09338. [Google Scholar] [CrossRef]
Saif, A.F.M.S.; Prabuwono, A.S.; Mahayuddin, Z.R. Moment Feature Based Fast Feature Extraction Algorithm for Moving Object Detection Using Aerial Images. PLoS ONE 2015, 10, e0126212. [Google Scholar] [CrossRef] [PubMed]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable Convolutional Networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Du, B.; Huang, Y.; Chen, J.; Huang, D. Adaptive Sparse Convolutional Networks with Global Context Enhancement for Faster Object Detection on Drone Images. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 13435–13444. [Google Scholar]
Wang, W.; Dai, J.; Chen, Z.; Huang, Z.; Li, Z.; Zhu, X.; Hu, X.; Lu, T.; Lu, L.; Li, H.; et al. InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14408–14419. [Google Scholar]
Qi, Y.; He, Y.; Qi, X.; Zhang, Y.; Yang, G. Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 4–6 October 2023; pp. 6047–6056. [Google Scholar]
Zhang, X.; Song, Y.; Song, T.; Yang, D.; Ye, Y.; Zhou, J.; Zhang, L. LDConv: Linear deformable convolution for improving convolutional neural networks. Image Vis. Comput. 2024, 149, 105190. [Google Scholar] [CrossRef]
Wu, Y.; Mu, X.; Shi, H.; Hou, M. An object detection model AAPW-YOLO for UAV remote sensing images based on adaptive convolution and reconstructed feature fusion. Sci. Rep. 2025, 15, 16214. [Google Scholar] [CrossRef] [PubMed]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
Colin, L. Unified Coincident Optical and Radar for Recognition (UNICORN) 2008 Dataset. 2019. Available online: https://github.com/AFRL-RY/data-unicorn-2008 (accessed on 5 July 2025).
Gao, T.; Liu, Z.; Zhang, J.; Wu, G.; Chen, T. A task-balanced multiscale adaptive fusion network for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
Fu, C.Y.; Liu, W.; Ranga, A.; Tyagi, A.; Berg, A.C. Dssd: Deconvolutional single shot detector. arXiv 2017, arXiv:1701.06659. [Google Scholar] [CrossRef]
Zhang, S.; Wen, L.; Bian, X.; Lei, Z.; Li, S.Z. Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4203–4212. [Google Scholar]
Farhadi, A.; Redmon, J. YOLOv3: An Incremental Improvement. In Proceedings of the Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Varghese, R.; Sambath, M. Yolov8: A novel object detection algorithm with enhanced performance and robustness. In Proceedings of the 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), Chennai, India, 18–19 April 2024; pp. 1–6. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Hu, H.; Zhou, Y.; Zheng, X.; Gao, Y. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Wang, Y.; Han, K. Gold-YOLO: Efficient object detector via gather-and-distribute mechanism. Adv. Neural Inf. Process. Syst. 2023, 36, 51094–51112. [Google Scholar]
Kang, M.; Ting, C.M.; Ting, F.F.; Phan, R.C.W. Bgf-yolo: Enhanced yolov8 with multiscale attentional feature fusion for brain tumor detection. In Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention, Daejeon, Republic of Korea, 23–27 September 2024; Springer Nature: Cham, Switzerland, 2024; pp. 35–45. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote. Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Caron, M.; Touvron, H.; Misra, I.; Jégou, H.; Mairal, J.; Bojanowski, P.; Joulin, A. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 9650–9660. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 8673–8681. [Google Scholar] [CrossRef]

Figure 1. Challenges faced by remote sensing target detection. (a) Dense small targets, (b) complex backgrounds, (c) blurred features, and (d) dim lighting conditions.

Figure 2. Structure of YOLOv11.

Figure 3. Overview of the method.

Figure 4. Schematic diagram of different branches of the GMGblock module.

Figure 5. Schematic diagram of the integration of different components. Among them, (a) is the base convolution, (b) integrates LDblock on the basis of (a), (c) integrates SS2D and MLP on the basis of (a), and (d) is our LDefMambablock structure in this paper.

Figure 6. Average detection accuracy of the DOTA-V1.0 dataset. Each point represents the accuracy of the comparison models in a given category, where the horizontal axis represents the different models and the vertical axis represents the mAP50 value for each category.

Figure 7. GMG-LDefmamba-YOLO is represented by red bars and achieves the highest detection accuracy of 70.91% on this dataset. Average detection accuracy across all categories in the DOTA-V1.0 dataset.

Figure 8. Confusion matrix for the DOTA-V1.0 dataset.

Figure 9. Bar charts comparing mAP50, parameters, and FLOPs metrics on the VEDAI dataset. The results of our method are prominently colored.

Figure 10. Partial visualization of the results of the method on the DOTAv-1.0 dataset.

Figure 11. Partial visualization of the results of the method on the VEDAI and USOD datasets.

Figure 12. False positives and missed detections of DOTAv-1.0, VEDAI, and USOD. The green border indicates the correct labeling scheme, the yellow border indicates the targets correctly recognized by the model, and the red border indicates false positives and missed detections.

Table 1. Summary of datasets.

Dataset Name	Object Quantity	Object Distribution	Object Size	Category Quantity
DOTA-v1.0	263,427	3 per image	20 × 20–300 × 300	15
VEDAI	3575	3 per image	12 × 12–24 × 24	9
USOD	50,298	14.5 per image	8 × 8–16 × 16	1

Table 2. Effects of branch design strategies on each dataset: Paras, FLOPs, and mAP50.

Dataset	Branch Design	Paras	FLOPs	mAP50
USOD	C/4*4	17.15 M	73	89.60
	C/8*8	16.88 M	68	90.35
	C/4*8/2	19.29 M	80	89.98
	C/1682	18.34 M	76	89.77
DOTAv1.0	C/4*4	19.01 M	81	70.07
	C/8*8	18.59 M	79	70.91
	C/4*8/2	20.97 M	89	69.98
	C/1682	20.14 M	84	70.35
VEDAI	C/4*4	18.83 M	79	75.65
	C/8*8	17.94 M	77	77.94
	C/4*8/2	19.50 M	85	75.49
	C/1682	18.99 M	81	76.33

Table 3. Effects of isolating the Gaussian mask alone on each dataset: Paras, FLOPs, and mAP50.

Dataset	Gaussian Mask	Paras	FLOPs	mAP50
USOD	configure	16.88 M	68	90.35
USOD	unconfigure	16.81 M	67	89.82
DOTAv1.0	configure	18.59 M	79	70.91
DOTAv1.0	unconfigure	18.53 M	78	70.15
VEDAI	configure	17.94 M	77	77.94
VEDAI	unconfigure	17.90 M	76	76.88

Table 4. Effects of different convolutional kernel sizes on each dataset: Paras, FLOPs, and mAP50.

Dataset	Kernel Size (1 × k, k × 1)	Paras	FLOPs	mAP50
USOD	3	16.88 M	68	90.35
	5	17.52 M	70	89.45
	7	20.15 M	75	89.98
	11	26.44 M	83	87.62
DOTAv1.0	3	18.59 M	79	70.91
	5	20.68 M	83	69.89
	7	25.42 M	88	70.32
	11	30.07 M	96	69.40
VEDAI	3	17.94 M	77	77.94
	5	19.33 M	80	76.85
	7	23.82 M	85	77.00
	11	27.37 M	91	76.13

Table 5. Impact of different component integration on the USOD dataset mAP50.

Structure	mAP50	Speed (ms)	Memory (MB)	LDblock	SS2D	MLP
(a)	88.87	3.5	1257	×	×	×
(b)	89.59	3.6	1289	√	×	×
(c)	89.83	3.9	1365	×	√	√
(d)	90.35	4.2	1418	√	√	√

Table 6. Comparison of results on DOTA-v1.0. The results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	PL	BD	BR	GTF	SV	LV	SH	TC	BC	ST	SBF	RA	HA	SP	HC	mAP50	mAP50:95
DSSD	75.12	71.00	43.11	61.11	60.37	64.17	73.70	76.29	72.20	71.84	52.39	49.16	63.26	59.86	50.30	62.93	45.32
RefineDet	75.21	68.27	41.10	59.69	65.17	69.79	73.83	76.56	72.27	70.63	49.37	56.06	60.01	57.62	51.14	63.11	46.54
YOLOv3	67.67	64.25	36.68	65.72	59.99	64.17	69.45	70.28	65.68	66.91	48.67	50.02	57.98	60.35	58.61	60.43	44.21
YOLOv5	80.13	77.12	48.74	66.69	65.94	69.74	79.26	81.90	77.85	77.44	57.97	54.83	68.83	65.48	55.81	68.52	47.58
Gold-YOLO	80.54	73.25	45.91	61.99	70.01	74.10	79.39	82.07	78.65	75.86	55.06	58.84	66.09	59.92	43.42	67.01	47.45
BGF-YOLO	72.67	66.53	37.83	57.87	61.98	67.86	71.91	74.78	71.65	69.62	45.62	44.18	59.84	51.93	47.15	68.17	47.81
YOLOv8	76.05	72.66	44.96	74.14	68.19	72.50	77.76	78.53	73.93	75.24	56.90	58.24	66.26	68.65	66.92	68.73	47.99
FFCA-YOLO	80.43	78.37	52.75	65.41	73.43	77.94	78.53	81.49	78.00	80.17	60.20	57.32	62.21	73.16	67.33	70.12	49.88
DINO	79.70	75.46	47.67	65.62	64.85	68.67	78.19	80.82	76.81	76.34	56.88	53.76	67.76	64.41	55.04	67.47	47.69
DNTR	80.21	76.83	50.07	65.51	70.55	73.21	81.49	80.28	78.39	77.11	53.56	62.22	69.28	63.37	55.61	69.18	48.17
RFLA	73.98	71.65	48.08	67.44	67.09	71.79	74.69	77.15	72.73	73.55	54.28	56.16	69.42	67.61	64.29	67.39	47.32
YOLOv11	78.38	73.36	51.17	71.46	69.06	74.94	77.62	79.77	76.26	77.25	61.75	60.82	67.83	63.82	64.14	69.84	48.57
YOLOv12	75.94	75.08	53.42	65.99	70.21	72.92	72.10	77.33	73.28	69.91	60.86	58.62	66.10	68.29	66.33	68.43	47.78
YOLOv13	72.11	76.13	51.87	73.86	69.67	75.28	76.88	80.29	75.07	77.31	60.18	61.09	66.38	70.25	64.42	70.05	49.63
Mamba-YOLO	76.98	74.51	51.02	70.44	70.05	74.64	77.53	80.10	75.67	76.51	57.27	59.16	73.36	70.53	67.24	70.33	49.76
FBRT-YOLO	80.69	77.55	53.11	74.69	70.23	76.94	79.42	79.27	78.29	61.07	63.37	69.40	68.26	59.35	67.01	70.58	49.93
GMG-LDef Mamba-YOLO	81.54	78.37	53.91	75.47	70.81	76.83	79.79	82.25	80.60	79.28	54.77	55.93	69.94	66.70	57.47	70.91	50.39

Table 7. Comparison of results on VEDAI. The results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	mAP50 (%)	Parameters (M)	FLOPs
YOLOv5	71.93	23.75	117
Gold-YOLO	75.02	28.76	122
BGF-YOLO	76.47	32.53	145
YOLOv8	74.39	28.14	125
FFCA-YOLO	75.18	11.20	82
YOLOv11	75.44	15.98	84
YOLOv12	74.39	20.46	108
YOLOv13	75.17	16.98	93
Mamba-YOLO [21]	76.98	41.81	151
FBRT-YOLO [56]	77.52	19.33	88
GMG-LDefmamba-YOLO	77.94	17.94	77

Table 8. Comparison of results on USOD. The results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Method	Precision	Recall	mAP50	Speed (ms)	Memory (MB)
YOLOv5	85.80	79.41	83.95	5.6	1861
Gold-YOLO	88.77	82.10	87.95	5.2	1589
BGF-YOLO	88.18	82.56	88.24	5.3	1751
YOLOv8	87.65	82.01	87.40	4.8	1668
FFCA-YOLO	90.04	83.98	89.80	4.6	1655
YOLOv11	89.50	83.25	89.06	4.3	1446
YOLOv12	88.77	82.01	88.52	4. 7	1714
YOLOv13	89.38	82.53	88.89	4.5	1533
Mamba-YOLO	90.09	83.96	89.60	4.6	1729
FBRT-YOLO	90.33	83.95	89.83	4.4	1509
GMG-LDefmamba-YOLO	90.35	84.03	90.28	4.2	1418

Table 9. mAP50 breakdown by object scale on datasets. The results highlighted in red and blue represent the best and second-best performance in each column, respectively.

Model	Small Targets (mAP50, %)	Medium Targets (mAP50, %)	Large Targets (mAP50, %)	Extra-Large Targets (mAP50, %)
YOLOv11	61.29	77.08	88.36	93.01
Mamba-YOLO	63.17	75.33	88.67	94.13
YOLOv12	58.23	74.15	87.94	93.42
YOLOv13	59.05	73.82	89.51	92.76
GMG-LDefmamba-YOLO	72.58	81.34	91.29	95.05

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Yan, L.; Wang, J.; Liu, J.; Tang, X. GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images. Sensors 2025, 25, 6856. https://doi.org/10.3390/s25226856

AMA Style

Yang Y, Yan L, Wang J, Liu J, Tang X. GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images. Sensors. 2025; 25(22):6856. https://doi.org/10.3390/s25226856

Chicago/Turabian Style

Yang, Yiming, Lingyu Yan, Jing Wang, Jinhang Liu, and Xing Tang. 2025. "GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images" Sensors 25, no. 22: 6856. https://doi.org/10.3390/s25226856

APA Style

Yang, Y., Yan, L., Wang, J., Liu, J., & Tang, X. (2025). GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images. Sensors, 25(22), 6856. https://doi.org/10.3390/s25226856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

GMG-LDefmamba-YOLO: An Improved YOLOv11 Algorithm Based on Gear-Shaped Convolution and a Linear-Deformable Mamba Model for Small Object Detection in UAV Images

Abstract

1. Introduction

2. Related Works

2.1. YOLOv11

2.2. State Space Models

2.3. Deformable Convolution

3. Method

3.1. Overview of GMG-LDefmamba-YOLO

3.2. Gaussian Mask Gear Convolution GMGblock

3.3. Linear Deformable Mamba Block LDefMambablock

3.3.1. LDblock

3.3.2. SS2D Module

3.3.3. Res-MLP Block

4. Experiments

4.1. Dataset

4.2. Implementation Details and Evaluation Metrics

4.3. Ablation Study

4.3.1. Analysis of the GMGblock on Branch Design and Gaussian Mask

4.3.2. Analysis of the GMGblock on Kernel Sizes

4.3.3. Analysis of the LDefMambablock on Necessity of Each Component

4.4. Comparisons with Previous Methods

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI