HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images

Xu, Jinyu; Liu, Wenwei; Tian, Runze; Wang, Chengyou; Zhang, Yuanbo

doi:10.3390/rs18101577

Open AccessArticle

HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images

by

Jinyu Xu

¹

,

Wenwei Liu

¹,

Runze Tian

¹,

Chengyou Wang

^1,2,3,*

and

Yuanbo Zhang

¹

School of Airspace Science and Engineering, Shandong University, Weihai 264209, China

²

Shandong Key Laboratory of Intelligent Electronic Packaging Testing and Application, Shandong University, Weihai 264209, China

³

Shandong University–Weihai Research Institute of Industrial Technology, Weihai 264209, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1577; https://doi.org/10.3390/rs18101577

Submission received: 23 March 2026 / Revised: 4 May 2026 / Accepted: 9 May 2026 / Published: 14 May 2026

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

We propose the hybrid scale dynamic detection transformer (HSD-DETR), a lightweight and highly efficient end-to-end object detector explicitly optimized for accurate and efficient small object detection in high-resolution remote sensing imagery.
To overcome the severe bottlenecks faced by lightweight models in detecting small objects within complex backgrounds, HSD-DETR integrates the HCSSNet backbone (combining local convolution and global linear scanning), the SPDMixer for pixel-level lossless downsampling, the DSAIFI module for dynamic noise suppression, and the RF-MPDIoU loss function for precise localization of dense targets.

What are the implications of the main findings?

The end-to-end architecture eliminates complex post-processing, and its lightweight design drastically reduces computational overhead (parameters and GFLOPs), providing a highly efficient algorithmic solution for scenarios with strict computational budgets.
This research establishes that the novel hybrid fusion of convolutions and state space models (SSMs) provides a more parameter-efficient paradigm for capturing global-local context, effectively mitigating the structural limitations of CNNs and SSMs in complex remote sensing scenarios.

Abstract

Balancing small object detection performance with model lightweighting remains a critical challenge in the remote sensing domain. To address the massive computational and parameter overhead of existing algorithms, we propose the hybrid scale dynamic detection transformer (HSD-DETR). This lightweight detector incorporates four core innovations to effectively enhance feature extraction for small objects. First, to reduce costs without compromising performance, we design a hybrid convolution and selective scanning fusion (HCSS-Fusion) module to reconstruct the backbone, combining local convolution with global linear scanning. Second, to preserve fine-grained information, we introduce a space-to-depth mixer (SPDMixer) to achieve pixel-level lossless downsampling. Third, to mitigate background interference and enhance small object representation, we develop a dynamic sparse adaptive intra-scale feature interaction (DSAIFI) module, employing a gating mechanism to dynamically select informative spatial tokens. Finally, to improve the localization precision for small objects, we propose the rational-focal minimum point distance intersection over union (RF-MPDIoU) loss, utilizing a non-linear mapping to dynamically modulate sample weights. Experimental results on public benchmarks confirmed that, compared to mainstream models, HSD-DETR achieves highly competitive accuracy while significantly reducing parameter scale and theoretical computational complexity. Ultimately, this research provides a lightweight and robust algorithmic solution for the field of remote sensing object detection.

Keywords:

remote sensing; small object detection; lightweight architecture; feature fusion; state space models (SSMs); detection transformer; Wasserstein distance

1. Introduction

With the continuous advancement of Earth observation technologies, high-resolution remote sensing (RS) imagery has been widely applied across diverse domains, such as urban planning, environmental monitoring, agriculture management, military reconnaissance, and disaster management [1,2]. Distinguished by their wide field of view, rich spectral information, and short revisit cycles, RS images serve as a vital medium for perceiving complex dynamic scenes [3]. Therefore, achieving accurate and efficient object detection in such environments is crucial for the effectiveness of these downstream applications.

Unlike natural scene images, RS imagery presents unique challenges, including vast spatial coverage, intricate backgrounds, drastic scale variations, and dense distributions of small objects [4]. Particularly, critical targets like aerial vehicles and automobiles often occupy minimal spatial extents (typically smaller than

32 \times 32

pixels) and are frequently submerged in overwhelming environmental noise [5]. These inherent characteristics impose stringent demands on both the feature extraction capabilities and computational efficiency of detection models.

Deep learning has fundamentally reshaped object detection, evolving from traditional convolutional neural networks (CNNs) [6] to advanced Transformer-based architectures such as the detection transformer (DETR) [7] and real-time detection transformer (RT-DETR) [8]. Unlike state-of-the-art CNN-based detectors (e.g., the You Only Look Once (YOLO) [9] series) that struggle with dense objects due to their reliance on heuristic non-maximum suppression (NMS) [10], RT-DETR offers a more elegant end-to-end paradigm. By leveraging self-attention mechanisms, it naturally excels at capturing global contextual dependencies, which is indispensable for interpreting large-scale and complex RS scenes. Unlike traditional DETR models that suffer from prohibitive computational overhead and sluggish convergence, RT-DETR successfully bridges the gap between high accuracy and operational efficiency. This makes it an ideal foundational architecture for demanding RS applications.

However, while RT-DETR establishes a robust benchmark, its direct adaptation to small object detection in RS imagery remains constrained by excessive parameters and computational costs. Furthermore, it faces four critical bottlenecks: (1) standard backbones struggle to balance feature extraction accuracy and model lightweighting, particularly in capturing long-range spatial dependencies while maintaining local details; (2) aggressive downsampling causes irreversible loss of fine-grained information for small objects; (3) indiscriminate global modeling introduces severe background noise interference; and (4) standard loss functions suffer from gradient vanishing and sub-optimal sample weighting when handling dense, small objects.

To address the challenges outlined above, we propose the hybrid scale dynamic detection transformer (HSD-DETR), a lightweight and efficient detector optimized for remote sensing imagery. Building upon the RT-DETR architecture, we establish a novel hybrid backbone as the foundation of our network to improve sensitivity to small objects while mitigating computational redundancy. To seamlessly complement this foundational architecture across the entire detection pipeline, we further introduce specialized optimizations in the downsampling stage, encoder, and loss function.

The main contributions of this article are summarized as follows:

(1): We propose the hybrid convolution and selective scanning fusion (HCSS-Fusion) module as the foundational building block to construct the lightweight HCSSNet backbone. By effectively combining local convolution and global linear scanning, it significantly enhances small object feature extraction in complex backgrounds.
(2): To preserve critical fine-grained information during the initial downsampling stage, we propose a space-to-depth mixer (SPDMixer) for feature embedding, utilizing pixel-level lossless reorganization.
(3): To dynamically filter background noise and focus computational resources on salient small object features, we design a dynamic sparse adaptive intra-scale feature interaction (DSAIFI) module with a gating mechanism.
(4): To significantly improve localization precision for dense, small objects, we design a rational-focal minimum point distance intersection over union (RF-MPDIoU) loss function with a non-linear mapping mechanism.

The rest of this paper is organized as follows. Section 2 starts with a brief review of related work. The proposed HSD-DETR architecture, including the HCSS-Fusion module, SPDMixer, DSAIFI module, and the optimized RF-MPDIoU regression loss function, is detailed in Section 3. Experimental results are presented in Section 4. Comprehensive discussions are provided in Section 5. Conclusions and remarks on possible future work are given finally in Section 6.

2. Related Work

2.1. Deep Learning for Remote Sensing Object Detection

With the rapid availability of large-scale annotated datasets and the advancement of deep learning, object detection in RS images has shifted from traditional manually engineered features to CNNs and, more recently, Transformer-based architectures [11]. One-stage detectors, particularly the YOLO series, have been widely adopted for real-time applications due to their optimal balance between inference speed and accuracy. However, CNN-based methods are inherently limited by their local receptive fields, which restricts their capability to capture the long-range contextual dependencies essential for interpreting complex, large-scale remote sensing scenes.

To overcome these limitations, Transformer-based detectors, such as DETR and its variants [12,13,14], have emerged as the dominant paradigm for global context modeling. These models utilize self-attention mechanisms to formulate global relationships, effectively enhancing the understanding of complex scenes. Notably, RT-DETR establishes a new benchmark by employing a hybrid encoder architecture that skillfully balances global modeling capabilities with real-time inference efficiency, demonstrating significant advantages in capturing long-range dependencies over traditional CNNs.

2.2. State Space Models and Lightweight Feature Extraction

Balancing feature extraction capability with computational efficiency remains a critical challenge in the remote sensing domain. For instance, RT-DETR’s native backbones (e.g., ResNet [15]) are tailored for natural images and lack the adaptive receptive field mechanisms required for RS scenes with large variations. To effectively handle such diverse scale variations in remote sensing, dynamic network architectures such as large selective kernel network (LSKNet) [16] and deformable attention transformers [17] have been proposed to dynamically adjust kernel sizes and expand the receptive field, thereby enhancing the capture of contextual information. Nevertheless, LSKNet remains constrained by the inherent locality of convolution operations and lacks efficient global linear modeling capabilities.

In recent years, state space models (SSMs), such as Mamba [18], have garnered significant attention as emerging architectures capable of modeling long sequences with linear complexity. Offering a compelling alternative to the quadratic computational complexity inherent to standard Transformers, SSMs have recently been successfully introduced into the field of computer vision by pioneering works like Vision Mamba (Vim) [19] and VMamba [20]. Recently, models like Mamba-YOLO [21] have successfully adapted SSMs for object detection, demonstrating their vast potential in vision tasks. However, existing backbones relying purely on CNNs or Transformers often struggle to simultaneously capture local fine-grained details and global semantic information without incurring excessive parameter costs. Inspired by these developments, hybrid architectures and lightweight adaptations like MobileMamba [22] have become a promising direction. Seamlessly integrating the local feature extraction advantages of convolution with the efficient global linear scanning capabilities of 2D selective scan (SS2D) offers a promising solution. This hybrid approach significantly enhances the ability to capture small objects in complex backgrounds, while maintaining a highly efficient parameter footprint.

2.3. Optimization for Small Object Detection

Detecting small objects (typically smaller than

32 \times 32

pixels) in RS images presents unique challenges. Researchers have made significant progress in mitigating information loss during downsampling, enhancing feature representation, and optimizing regression localization.

Addressing the issue of information loss caused by downsampling, researchers have proposed advanced feature pyramid variants [23] and enhanced path aggregation networks [24] to enhance semantic expression by fusing features from different architectural levels. Furthermore, techniques such as invertible downsampling [25] and space-to-depth convolution (SPD-Conv) [26] have been developed to preserve features by reorganizing spatial information into channel dimensions, effectively alleviating the impact of resolution degradation. Despite these advancements, most mainstream backbones still rely on aggressive large-stride downsampling in their initial extraction stages. Consequently, for small objects occupying only a few pixels, critical fine-grained features are often irreversibly lost before they can even reach these downstream fusion modules.

From context-expanded local modules (e.g., dilated neighborhood attention [27]) to linear-complexity global aggregators (e.g., bi-level routing attention [28] and agent attention [29]), these methods successfully endow models with the ability to capture long-range contextual dependencies. Despite this, standard global modeling often introduces significant background noise interference. In RS images where small objects are extremely sparse, performing indiscriminate dense attention calculations on all pixels not only incurs computational redundancy but also causes the weak features of small objects to be overwhelmed by dominant background signals. To mitigate these issues, recent studies have explored memory-efficient macro designs like the single-head vision transformer [30] and cross-paradigm representation alignment [31] to reduce computational bottlenecks and enhance robust feature filtering.

Finally, regarding the optimization of regression localization, bounding box loss functions have evolved from basic coordinate-level regressions to geometry-aware intersection over union (IoU) metrics (e.g., generalized IoU (GIoU [32]), distance IoU (DIoU) [33] and complete IoU (CIoU) [33]), significantly improving detection performance. Models employing standard IoU-based loss functions are highly sensitive to the positional deviations of small objects. When the bounding box overlap is minimal, these models often suffer from gradient vanishing and severe convergence difficulties. To address this, researchers have proposed specialized metrics such as normalized Wasserstein distance (NWD) [34], inner IoU (Inner-IoU) [35], and minimum point distance IoU (MPDIoU) [36]. By modeling bounding boxes as Gaussian distributions, introducing auxiliary borders of different scales, or utilizing corner distance metrics, these approaches effectively optimize bounding box regression. They demonstrate clear superiority in small object detection, providing strong support for achieving a balance between high accuracy and efficiency. Inspired by these advancements, and to further resolve gradient vanishing issues and significantly improve localization precision for dense, small objects, we design an RF-MPDIoU loss function with a non-linear mapping mechanism.

3. Materials and Methods

3.1. Overview of HSD-DETR Model

As illustrated in Figure 1, HSD-DETR is designed as a cohesive end-to-end framework. This architecture is centered on the HCSS-Fusion module and builds the HCSSNet backbone. The system first employs SPDMixer to maintain initial information fidelity. Building upon this foundation, HCSSNet performs deep feature extraction using the HCSS-Fusion engine. Subsequently, DSAIFI purifies the feature space by mitigating background interference. Finally, RF-MPDIoU provides precise geometric supervision. This progressive pipeline ensures that fragile small object signals are preserved and accurately localized.

3.2. Hybrid Convolution and Selective Scanning Fusion Module

Existing lightweight backbones often struggle with complex remote sensing backgrounds, failing to achieve discriminative feature representation under stringent computational constraints. To address this, as illustrated in Figure 2, we propose the HCSS-Fusion module to enhance feature capture capabilities for small objects. In optical remote sensing, micro-targets are frequently submerged in vast backgrounds. This environment poses a challenge for lightweight architectures relying purely on CNNs or pure SSMs.

CNNs offer excellent local inductive biases for capturing sharp geometries. However, their localized receptive fields restrict broad contextual understanding. In contrast, SSMs achieve a global receptive field efficiently. Yet, applying 1D sequential scanning to 2D vision data can disrupt spatial proximity. During long-range state integration, the fragile features of small objects—often comprising just a few pixels—are susceptible to being washed out or over-smoothed.

This challenge provides an opportunity for mitigation through frequency-domain complementarity. Theoretically, because its compact, localized kernel inherently captures rapid spatial variations such as edges and fine textures, pure convolution acts as a high-frequency structural filter

H_{C N N}

. Functioning as a localized spatial anchor, its high-frequency response

H_{C N N} (i, j)

at spatial coordinates

(i, j)

for an input

X {\in R}^{C \times H \times W}

is expressed as:

H_{C N N} (i, j) = \sum_{p, q \in Ω} W_{c o n v}^{p, q} \cdot X_{i + p, j + q}

(1)

where

Ω

represents the local neighborhood (receptive field), and (

p, q

) are the relative spatial offsets within this neighborhood.

W_{c o n v}^{p, q}

refers to the learnable convolution kernel weights, and

X_{i + p, j + q}

is the input feature value at the offset position.

Simultaneously, the 2D selective scan (SS2D) mechanism functions as a low-frequency contextual filter

H_{S S M}

. It executes a global linear attention mechanism to aggregate dependencies while maintaining the current spatial resolution. Its contextual output

H_{S S M}^{(j)}

along the

j - t h

feature slice is formulated as:

H_{S S M}^{(j)} \approx (Q ⊙ w^{(j)}) h_{a}^{(j)} + [(Q ⊙ w^{(j)}) {(\frac{K}{w^{(j)}})}^{T} ⊙ M_{c a u s a l}] V^{(j)}

(2)

where

Q

,

K

, and

V

are query, key, and value matrices projected from the input.

h_{a}^{(j)}

represents the previous hidden state vector, and

M_{c a u s a l}

is the causal mask matrix ensuring sequential scanning logic. The symbol

⊙

denotes the element-wise product. The distance-aware decay tensor

w^{(j)}

enables the network to aggregate low-frequency global contexts by weighting long-range dependencies.

Leveraging this theoretical synergy, the HCSS-Fusion module captures information through a parallel structure. The local context branch employs a

3 \times 3

depthwise convolution to preserve fragile geometric priors. This compact kernel focuses on high-frequency textures and prevents spatial over-smoothing. Parallel to this, the global context branch introduces SS2D to capture long-range dependencies. Since small objects rely heavily on surrounding cues (e.g., vehicles on roads), SS2D provides efficient global guidance. To ensure robust representation power, we set the state dimension to

d_{s t a t e} = 8

and the expansion ratio to

r_{s s m} = 1.5

. The output features of the two branches,

U_{l o c a l}

and

U_{g l o b a l}

, are calculated in Equation (3):

U_{l o c a l} = {C o n v}_{3 \times 3} (X), U_{g l o b a l} = S S 2 D (X)

(3)

To achieve adaptive complementarity between local details and global context, we design the Hybrid Scale Selection Mechanism. First, we fuse the extracted local and global features via element-wise addition to obtain a unified intermediate representation

U

, which efficiently aggregates both local structural details and long-range dependencies. This process is denoted in Equation (4):

U = U_{l o c a l} + U_{g l o b a l}

(4)

To explicitly model the spatial saliency across this fused representation, we apply average pooling and max pooling along the channel dimension of

U

, generating two 2D spatial descriptors

D_{a v g}

and

D_{m a x}

. After concatenating these spatial descriptors, a large-kernel

7 \times 7

convolution layer is applied for spatial information interaction, followed by a Sigmoid activation function to generate the hybrid spatial selection mask

M {\in R}^{2 \times H \times W}

in Equation (5):

M = σ ({C o n v}_{7 \times 7} ([D_{a v g}; D_{m a x}]))

(5)

where

[\cdot; \cdot]

denotes concatenation. The two channels of the mask

M_{l o c}

and

M_{g l o}

(corresponding to the first and second channels of

M

), serve as the spatial attention weights for the projected local and global features, respectively. Finally, we compute the weighted sum of the original branch features using the generated spatial masks to obtain the final module output

Y

in Equation (6):

Y = U_{l o c a l} ⊙ M_{l o c} + U_{g l o b a l} ⊙ M_{g l o}

(6)

where

⊙

denotes element-wise multiplication.

Through this process, HCSS-Fusion is able to adaptively reinforce the feature saliency of small targets while maintaining computational efficiency, thereby significantly improving the model’s perception and capture capabilities for small targets.

3.3. Space-to-Depth Mixer Module

To prevent irreversible information loss, we propose SPDMixer as the foundational stem of our architecture. By replacing conventional strided convolutions, SPDMixer ensures that the dense pixel-level semantics required for subsequent modeling are fully preserved through lossless space-to-depth transformation. As illustrated in Figure 3, SPDMixer integrates lossless Space-to-Depth (SPD) transformation with re-parameterized convolutions to perform lossless feature compression and reconstruction while preserving pixel-level information during downsampling.

In the feature embedding stage, SPDMixer discards traditional pooling or large-kernel convolutions. Instead, it introduces an SPD layer to perform lossless downsampling on the input image

X {\in R}^{C \times H \times W}

. This process rearranges spatial neighborhood pixel information into the channel dimension by slicing the input along spatial dimensions using odd and even indices. Specifically, four sub-feature maps are generated and concatenated along the channel dimension to yield the intermediate feature

X_{s p d}

in Equation (7):

X_{s p d} = Concat (X_{0 : : 2,0 : : 2}, X_{1 : : 2,0 : : 2}, X_{0 : : 2,1 : : 2}, X_{1 : : 2,1 : : 2}) \in R^{4 C \times \frac{H}{2} \times \frac{W}{2}}

(7)

where

: :

denotes the array slicing operator with an omitted stop index, indicating that the extraction continues to the end of the dimension at the specified step size. To efficiently encode these spatially folded features, we employ a Re-Parameterized Encoder composed of cascaded MobileOne Blocks to enhance the reconstruction capability of local features. Specifically, the spatially folded input is first processed by a stride-1 MobileOne block to project it to the embedding dimension and reconstruct local correlations. Subsequently, a stride-2 depthwise MobileOne block performs secondary downsampling. The process concludes with a final stride-1 MobileOne block, functioning as a pointwise convolution, to achieve thorough channel fusion and output the encoded feature representation.

This design effectively achieves rigorous information preservation during downsampling. By strictly retaining raw pixel data via SPD and reconstructing semantic features via efficient re-parameterized convolutions, SPDMixer ensures acute perception of small objects from the network’s stem. Consequently, this architectural design significantly enhances detection performance in complex scenarios.

3.4. Dynamic Sparse Adaptive Intra-Scale Feature Interaction

While HCSS-Fusion establishes a rich feature space, we integrate DSAIFI as an obligatory feature purifier within the framework. Through adaptive gated sparse attention, it suppresses redundant signals and focuses computational resources exclusively on the salient features of micro-targets. The core of this module lies in our innovatively designed adaptive gated sparse attention (AGSA) mechanism, which replaces traditional dense attention calculations, achieving adaptive feature selection and interaction.

As illustrated in Figure 4, AGSA is designed to enhance feature representational capability through intelligent feature selection. To focus on key features while preserving original context information, AGSA employs a partial channel processing strategy. Given the input features

X {\in R}^{C \times H \times W}

, they are split along the channel dimension into two parts: “active features”

X_{a c t i v e}

for attention interaction (accounting for 25% of channels) and “passive features”

X_{p a s s i v e}

for information retention (accounting for 75% of channels).

The passive features

X_{p a s s i v e}

are transmitted directly to the output via an identity mapping, which not only effectively simplifies the processing of non-salient regions but also preserves the background context of the image. The active features

X_{a c t i v e}

are first normalized via a layer normalization (LN) and then mapped through convolutions to generate the query (

Q

), key (

K

), and value (

V

) vectors required for the attention mechanism.

Concurrently, a lightweight gating network processes the global statistical information of the input feature

X

. This network adaptively predicts a dynamic factor, which reflects the sparsity of targets in the current image. We denote this dynamic factor as

γ

, which is calculated by Equation (8):

γ = Mean (σ ({Conv}_{1 \times 1} (ReLU ({Conv}_{1 \times 1} (X)))))

(8)

where

σ

denotes the Sigmoid activation function. To define the operation scale, the total number of spatial tokens

N

is first calculated as follows:

N = H \times W

(9)

where

H

and

W

represent the height and width of the features, respectively. Based on this, the number of semantic tokens to retain, denoted as

K_{dynamic}

, is formulated by Equation (10):

K_{dynamic} = ⌊ N \times γ ⌋

(10)

where

⌊ \cdot ⌋

denotes the floor operation to ensure an integer value.

Based on

K_{dynamic}

, the model executes a Top-

k

sparse selection strategy. First, the dense attention correlation map

A_{dense} \in R^{N \times N}

is calculated using Equation (11):

A_{dense} = \frac{{Q K}^{T}}{\sqrt{d_{k}}}

(11)

where

d_{k}

is the channel dimension of

Q

and

K

. To retain only the Top-

k

correlation weights with the strongest responses, we extract the indices of the largest

K_{dynamic}

values along the sequence dimension of

A_{dense}

to formulate a binary mask

M

, which is defined by Equation (12):

M_{i, j} = \{\begin{array}{l} 1, & i f j \in Ω_{i} \\ 0, & o t h e r w i s e \end{array}

(12)

where

i, j

\in

[1, N]

denote the token indices, and

Ω_{i}

denotes the index set of the largest

K_{dynamic}

values in the

i

-th sequence

A_{dense}^{(i)}

. The remaining irrelevant background connections are forcibly masked out by replacing their attention scores with negative infinity, yielding

A_{masked}

, which is defined by Equation (13):

A_{m a s k e d}^{(i, j)} = \{\begin{array}{l} A_{d e n s e}^{(i, j)}, & i f M_{i, j} = 1 \\ - \infty, & i f M_{i, j} = 0 \end{array}

(13)

This dynamic sparse mechanism ensures that the model automatically optimizes the distribution of attention according to the complexity of the image content, precisely focusing attention on salient regions containing potential targets. The masked attention is then passed through a softmax layer and multiplied by

V

to obtain the aggregated sparse attention features, denoted as

A_{sparse}

, which is calculated using Equation (14):

A_{sparse} = S o f t m a x {(A_{masked})}^{T} \cdot V

(14)

Finally, the features aggregated via sparse attention

A_{sparse}

are concatenated with the original passive features

X_{p a s s i v e}

along the channel dimension. They are then fused through an output projection layer—comprising a sigmoid linear unit (SiLU) activation and a convolutional layer—to restore the original dimensions and yield the output

X_{out}

, which is formulated by Equation (15):

X_{out} = Proj (Concat (A_{sparse}, X_{passive}))

(15)

where

Proj (\cdot)

denotes the output projection operation.

Building upon the robust feature extraction capabilities of AGSA, we construct the complete DSAIFI module. As shown in Figure 4, input features first enter the AGSA module for global sparse context modeling, utilizing the aforementioned mechanism to effectively filter out background noise. Subsequently, the processed features undergo fusion through a dropout layer and a residual connection to prevent gradient vanishing, followed by layer normalization. To further enhance non-linear representation capability and integrate channel information, the fused features are fed into a feed-forward network (FFN). Unlike traditional fully connected layer designs, we employ a convolutional FFN consisting of two

1 \times 1

convolutional layers, a Gaussian error linear unit (GELU) activation function, and two dropout layers. This design is more conducive to preserving the spatial locality of RS images during feature interaction. Finally, after a second residual connection with dropout and a final layer normalization, DSAIFI outputs the enhanced feature representation. By integrating inner sparse selection with outer residual enhancement, DSAIFI effectively harmonizes dynamic noise suppression and the preservation of spatial locality for sparse targets.

3.5. Optimization of Loss Function

Finally, to translate the purified semantic representations into precise coordinates, we propose RF-MPDIoU as the definitive supervisory mechanism. Through the geometric alignment mechanism, RF-MPDIoU effectively bridges the feature extraction achievements of the preceding components, ensuring that the accuracy gains of the architecture in small object perception are maximized during the supervisory phase.

Specifically, pure geometric overlap metrics struggle to provide stable supervision signals due to the extreme sensitivity of small object positions on discrete pixel grids. As a baseline for the bounding box regression task, the standard IoU, denoted as

I

, is first calculated as shown in Equation (16):

I = \frac{Area (B_{p} \cap B_{g})}{Area (B_{p} \cup B_{g})}

(16)

where

B_{g}

is the ground truth box, and

B_{p}

is the predicted box.

(x_{1}^{p}, y_{1}^{p})

and

(x_{2}^{p}, y_{2}^{p})

denote the top-left and bottom-right coordinates of the predicted box, respectively. Similarly,

(x_{1}^{g}, y_{1}^{g})

and

(x_{2}^{g}, y_{2}^{g})

represent those of the ground truth box.

Furthermore, in the context of small object detection, the severe imbalance between samples during training often leads to gradients being dominated by a vast number of simple samples. This phenomenon suppresses the effective optimization of the sparse and challenging features of small objects. To mitigate this, we introduce a focal mapping mechanism to reweight the

I

by mapping it into a specified interval. The mapped IoU,

I_{m}

, can be formulated as Equation (17):

I_{m} = m a x (0, m i n (1, \frac{I - d}{u - d}))

(17)

where the parameters

d

and

u

are hyperparameters representing the lower and upper bounds of the truncation interval, which are empirically set to 0 and 0.95, respectively, to ensure stable gradient propagation. Subsequently, to smoothly adapt to the optimization difficulty and dynamically adjust the penalization, a rational function transformation is applied to

I_{m}

. The transformed metric,

I_{RF}

, can be expressed as Equation (18):

I_{RF} = \frac{(α + 1) \cdot I_{m}}{α \cdot I_{m} + 1}

(18)

where the parameter

α

is empirically set to 1.0 to adjust the smoothness of the gradient distribution.

To maintain strong geometric constraints on the actual boundaries, a corner distance penalty is integrated. The squared Euclidean distances between the corresponding corners,

d_{1}^{2}

and

d_{2}^{2}

, are defined in Equation (19):

d_{1}^{2} = {(x_{1}^{p} - x_{1}^{g})}^{2} + {(y_{1}^{p} - y_{1}^{g})}^{2}, d_{2}^{2} = {(x_{2}^{p} - x_{2}^{g})}^{2} + {(y_{2}^{p} - y_{2}^{g})}^{2}

(19)

To ensure scale invariance, these distances are normalized by the width

W

and height

H

of the input image. Therefore, the final RF-MPDIoU metric, denoted as

M_{R F}

, is given by Equation (20):

M_{R F} = I_{R F} - \frac{d_{1}^{2}}{W^{2} + H^{2}} - \frac{d_{2}^{2}}{W^{2} + H^{2}}

(20)

Finally, the corresponding bounding box regression loss,

L_{RF}

, is obtained by Equation (21):

L_{RF} = 1 - M_{RF}

(21)

4. Experiment

4.1. Datasets

To rigorously evaluate the proposed model’s efficacy in small object detection tasks, we employed the VisDrone2019 dataset [37], a prominent large-scale benchmark specifically tailored for low-altitude unmanned aerial vehicle (UAV) computer vision tasks. Distinguished by its high density of small targets, this dataset provides a comprehensive evaluation suite encompassing scenarios of varying complexity.

The dataset consists of 10,209 static images (split into 6471 for training, 548 for validation, 1610 for testing, and 1580 for test-challenge) acquired across diverse urban and rural settings. It includes 10 standard object categories: pedestrian, people, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Notably, the dataset contains well over 300,000 annotated bounding boxes. This vast volume of instance-level data provides robust supervisory signals, critical for validating HSD-DETR’s capability to handle dense distributions of small objects.

As depicted in Figure 5, VisDrone2019 poses multifaceted challenges for object detection. First, the dataset is defined by drastic variations in object scale and a prevalence of minute targets. Figure 5b,d quantitatively demonstrate that the vast majority of target objects possess normalized widths and heights of less than 0.1 (frequently smaller than

32 \times 32

pixels). Furthermore, these objects exhibit diverse aspect ratios, which severely complicate feature extraction and bounding box regression. Second, as revealed by the spatial distribution heatmap in Figure 5c, object centers are predominantly concentrated in the lower-middle regions of the frame. This spatial bias reflects typical UAV downward-looking flight trajectories and requires models to possess robust spatial perception capabilities.

Furthermore, UAV flight dynamics introduce frequent occlusions and motion blur, severely degrading target feature representations. Finally, Figure 5a illustrates a severe long-tailed class imbalance that further exacerbates the difficulty. The prolific “car” (144,866 instances) and “pedestrian” (79,335 instances) classes contrast sharply with sparse samples like “awning-tricycle” (3246 instances) and “bus” (5926 instances). This extreme discrepancy demands a model with robust optimization strategies to maintain high performance on underrepresented classes.

4.2. Implementation Details

The experiments were conducted on a system running Ubuntu 20.04 LTS, equipped with an Intel Xeon Gold 5218R CPU (Intel Corporation, Santa Clara, CA, USA) and two NVIDIA GeForce RTX 3090 GPUs (24 GB) (NVIDIA Corporation, Santa Clara, CA, USA). The implementation was developed using Python 3.10, and model training was performed with PyTorch 2.2.2 based on the Ultralytics 8.0.201 framework. The hardware and software configurations for training and testing are summarized in Table 1.

The experimental settings were optimized for remote sensing tasks: The batch size was set to 16 with an input image size of

640 \times 640

pixels to accommodate GPU memory constraints. Mosaic augmentation (probability 1.0) was applied to mitigate the impact of complex backgrounds and enhance robustness. The model was trained from scratch for 200 epochs in 32-bit floating point (FP32) mode to ensure full convergence, coupled with an early stopping mechanism (patience of 30 epochs) to prevent overfitting. Additionally, a 3-epoch warmup phase was utilized to stabilize the initial training process. The detailed training parameters are summarized in Table 2.

4.3. Evaluation Metrics

To comprehensively evaluate the HSD-DETR model, we employed nine metrics covering detection accuracy and computational efficiency: precision (

P

), recall (

R

), average precision (AP), average precision for small objects (

{A P}_{s}

), average precision for medium objects (

{A P}_{m}

), mean average precision (mAP), parameters (Params), giga floating-point operations (GFLOPs) and frames per second (FPS). In this paper, bold values in the tables indicate the best performance for each evaluation metric. For detection accuracy metrics, the bold values represent the largest values, as higher values correspond to better accuracy. In contrast, for computational efficiency metrics, the bold values represent the smallest values, indicating a more lightweight and computationally efficient model.

4.3.1. Detection Accuracy Metrics

Precision (

P

) represents the proportion of targets that are correctly predicted in all detected targets. By defining true positives (

n_{TP}

) as the correct predictions of the model and false positives (

n_{FP}

) as the incorrect predictions of background as targets, precision (

P

) can be calculated using Equation (22):

P = \frac{n_{TP}}{n_{TP} + n_{FP}}

(22)

Recall (

R

) represents the proportion of actual targets that are correctly predicted by the model. It can be calculated using Equation (23):

R = \frac{n_{TP}}{n_{TP} + n_{FN}}

(23)

where false negatives (

n_{FN}

) represent the actual targets that are missed by the model.

Average precision (AP) represents the area under the precision-recall curve, while the mean average precision (mAP) is the average of the AP values across all classes. Both metrics are mathematically defined in Equation (24) and (25):

A P = \int_{0}^{1} P (R) d R

(24)

m A P = \frac{1}{N} \sum_{i = 1}^{N} {A P}_{i}

(25)

where

N

is the total number of classes, and

{A P}_{i}

is the average precision for the

i

-th class. To comprehensively evaluate the capability of the proposed model for small and medium object detection in complex scenarios, this study adheres to the COCO evaluation protocol and reports

{A P}_{s}

and

{A P}_{m}

.

{A P}_{s}

is precisely defined as the detection accuracy for objects with an area not exceeding

32 \times 32

pixels, while

{A P}_{m}

evaluates the detection accuracy for objects with an area between

32 \times 32

and

96 \times 96

pixels.

4.3.2. Computational Efficiency Metrics

Params refer to the total number of learnable weights in the model. It measures the spatial complexity and storage requirements, where a smaller number implies a more compact model suitable for constrained devices.

GFLOPs refers to the number of floating-point operations (in billions) required for a single forward inference, serving as an indicator of the model’s time complexity and computational cost.

FPS is a metric that measures the real-time processing capability of a model, defined as the number of images the model can process in one second. It can be calculated using Equation (26):

f \begin{matrix} = \frac{1}{t_{i n f}} \end{matrix}

(26)

where

t_{i n f}

represents the average inference time required to process a single image.

4.4. Comparison with Previous Methods

To comprehensively validate the detection performance and practical utility of the proposed HSD-DETR, this section compares our improved model with several state-of-the-art object detection methods. To ensure an objective assessment, all baseline methods were executed following their official implementations under consistent experimental settings on the same dataset. The selected baselines are categorized into two primary paradigms:

(1): YOLO-series models, including YOLOv8m [38], YOLO11m [39], YOLO26m [40], and Mamba YOLO-B [21];
(2): Transformer and DETR-based approaches, encompassing Fine-grained Distribution Refinement (D-FINE) variants (D-FINE-N, D-FINE-S) [41], DETR with Improved Matching (DEIM) variants (DEIM-D-FINE-N, DEIM-D-FINE-S) [42], RT-DETRv2-R18 [43], and RT-DETR-R18 [8].

Table 3 summarizes the comparative detection performance of various models on the VisDrone2019 dataset. The proposed HSD-DETR achieves an mAP50 of 46.6%, an mAP50-95 of 28.2%, and an

{A P}_{s}

of 18.8%. When compared with the baseline RT-DETR-R18, our model demonstrates substantial performance gains, specifically increasing mAP50 and

{A P}_{s}

by 1.3 percentage points each. Furthermore, HSD-DETR maintains a clear competitive edge over other mainstream detectors. For instance, it surpasses the lightweight D-FINE-S and DEIM-D-FINE-S models by 7.2 and 6.2 percentage points in mAP50, respectively. Even when evaluated against the advanced RT-DETRv2-R18, our approach still achieves a competitive performance gain of 0.7% in mAP50. Comparisons with recent YOLO variants, such as YOLO11m and YOLO26m, further validate these improvements, with HSD-DETR achieving mAP50 gains of 3.1% and 2.0%, respectively.

Beyond accuracy, HSD-DETR preserves a favorable balance between detection quality and computational efficiency. As detailed in Table 3, our model utilizes only 13.7 M parameters and 41.0 GFLOPs, requiring significantly fewer resources than most contemporary networks. Specifically, compared to the RT-DETR-R18 baseline, HSD-DETR reduces the computational overhead by 28.1% (from 57.0 to 41.0 GFLOPs) and the parameter count by 31.2% (from 19.9 M to 13.7 M). This efficiency is similarly evident when compared to Mamba YOLO-B, where our model achieves a 37.2% reduction in parameters and a 17.3% decrease in GFLOPs. Furthermore, while YOLO26m and YOLO11m require higher computational costs, our method decreases parameter volume by over 31% and lowers computational demands by approximately 39%. By leveraging the HCSSNet backbone, the proposed architecture effectively minimizes redundancy without compromising feature representation. Notably, although the HSD-DETR significantly lowers GFLOPs, its FPS (146) is marginally lower than that of the RT-DETR-R18 baseline (174). This observation underscores that theoretical computational complexity does not always translate directly into inference throughput. This phenomenon is primarily attributed to the current hardware support and optimization levels for SSMs. Highly optimized standard convolutions benefit from mature hardware-level parallelism. In contrast, the non-contiguous memory access patterns of state-space scanning introduce additional overhead in practical execution, even though they require fewer FLOPs.

To further assess detection capabilities across various object scales and categories, Table 4 provides a class-wise mAP50 comparison. HSD-DETR demonstrates robust generalization, particularly in challenging remote sensing scenarios. Directly compared to the RT-DETR-R18 baseline, our model achieves marked improvements in categories frequently subject to occlusion or complex backgrounds, such as bus (+5.2%), truck (+4.3%), and tricycle (+3.8%). Moreover, HSD-DETR attains the highest absolute accuracy across all evaluated networks for the people (47.3%), tricycle (35.3%), and bus (65.1%) categories. Despite the significant reduction in computational complexity, the model remains highly proficient in identifying small and dense targets, as evidenced by competitive results in the motor (58.3%), pedestrian (51.8%), and car (84.2%) categories. These findings suggest that the proposed model possesses strong hierarchical feature representation and background interference suppression capabilities, leading to more precise localization of small-scale objects in complex remote sensing environments.

4.5. Ablation Experiment

To systematically evaluate the effectiveness of the proposed components in the HSD-DETR architecture, comprehensive ablation studies are conducted on the VisDrone2019 dataset. The evaluation comprises three primary phases: first, a comparative analysis of different backbone networks to validate the efficiency and theoretical advantages of the proposed hybrid lightweight HCSSNet; second, a comparison of different IoU loss functions to verify the localization precision of the proposed RF-MPDIoU; and third, a comprehensive ablation study to assess both the individual contributions and the synergistic impact of all proposed functional modules.

4.5.1. Comparison of Different Backbones

Feature extraction is a critical but challenging step for small-scale object detection in remote sensing imagery. To validate the efficacy of the proposed hybrid architecture, its performance is compared against several mainstream CNNs and SSMs, including ResNet-18 [15], LSKNet-S [16], MambaOut [44], and MobileMamba [22]. The quantitative results are summarized in Table 5.

As shown in Table 5, CNNs and SSMs both exhibit distinct structural limitations when applied to remote sensing small objects. The standard ResNet-18 backbone provides a strong baseline with an mAP50 of 45.3% and an

{A P}_{s}

of 17.5%, but incurs substantial computational overhead (19.9 M parameters and 57.0 GFLOPs). Replacing it with LSKNet-S increases the computational cost (60.9 GFLOPs) while paradoxically degrading detection accuracy (mAP50 drops to 43.5%), indicating that simply enlarging convolutional kernels struggles to efficiently model global context without introducing severe background noise. Furthermore, while the Mamba-based lightweight models (MambaOut and MobileMamba) successfully alleviate the computational burden, this reduction comes at the expense of severe performance degradation, with their mAP50 scores dropping to 43.0% and 42.9%, respectively. Specifically, the reliance of SSMs on flattening 2D images into 1D sequences inadvertently shatters the 2D Euclidean spatial proximity, causing fatal structural disruption to the already fragile spatial representations of small objects.

In contrast, the proposed hybrid architecture, HCSSNet, achieves a superior trade-off between detection accuracy and computational efficiency. It aggressively reduces the network size to merely 13.9 M parameters and 40.4 GFLOPs, representing the most lightweight configuration among all evaluated backbones. Despite this massive reduction in complexity, HCSSNet attains highly competitive overall performance across different IoU thresholds. Specifically, it achieves an mAP50 of 45.0% and an mAP50-95 of 27.1%, outperforming all other lightweight alternatives (which score between 25.7% and 26.0%) under strict localization criteria. More importantly, it yields improved accuracy for small objects, raising the specific

{A P}_{s}

metric to 17.8%, demonstrating highly competitive performance among the evaluated backbones. These results suggest theoretical advantages of the hybrid paradigm: by coupling the SS2D mechanism with a parallel convolution branch, HCSSNet utilizes convolutions as spatial anchors to preserve high-frequency local geometric details, while employing SS2D to efficiently aggregate low-frequency global contextual information. This orthogonal complementarity mathematically helps suppress complex remote sensing backgrounds while preserving the fine-grained features of micro-targets, mitigating certain inherent bottlenecks of CNNs and SSMs.

4.5.2. Comparison of Different IoU Loss Functions

Bounding box regression for small objects in remote sensing imagery is exceptionally challenging due to the extreme sensitivity of their positions on discrete pixel grids and the severe imbalance between simple and challenging samples during training. To verify the superiority of the proposed RF-MPDIoU loss in addressing these issues, we conducted a comparative experiment using the RT-DETR-R18 baseline under identical experimental settings. We compared RF-MPDIoU with standard geometric overlap metrics (CIoU, DIoU, GIoU), as well as specific loss functions tailored for small object detection, including NWD, MPDIoU, and Focal MPDIoU. The quantitative results are summarized in Table 6.

As demonstrated in Table 6, the proposed RF-MPDIoU achieves the highest performance across all key metrics, reaching an mAP50 of 45.8% and an

{A P}_{s}

of 18.0%. Notably, NWD, which is widely utilized for small targets, yields the lowest performance (44.7% mAP50 and 17.1%

{A P}_{s}

) in this scenario. While NWD employs Gaussian distribution modeling to mitigate the sensitivity of discrete pixel grids, this abstract representation inevitably compromises the strict geometric alignment of bounding boxes. Specifically, RF-MPDIoU incorporates a corner distance penalty to maintain strong geometric constraints on the actual boundaries, thereby providing a more precise and stable localization signal for small targets in cluttered backgrounds.

Crucially, progressive performance gains are clearly observed among the original MPDIoU (45.5% mAP50), Focal MPDIoU (45.6% mAP50), and RF-MPDIoU (45.8% mAP50). While the original MPDIoU effectively utilizes corner distance minimization to constrain boundaries, it struggles with the severe sample imbalance, where gradients are dominated by a vast number of simple samples, suppressing the effective optimization of sparse and challenging small objects. Incorporating a focal mapping mechanism (Focal MPDIoU) partially mitigates this issue by mapping IoU values into a truncated interval. RF-MPDIoU goes a step further by applying a non-linear rational function transformation. This synergistic combination of boundary-preserving corner distances and rational-focal gradient regulation is the key to our model’s success. RF-MPDIoU employs a rational-focal regulated corner distance constraint to significantly enhance small object localization accuracy.

4.5.3. Progressive Ablation Study of Proposed Modules

A detailed ablation study is conducted to evaluate both the individual and synergistic influence of the proposed architectural improvements in HSD-DETR. As summarized in Table 7, the evaluation first examines the independent contribution of each component when integrated into the standard RT-DETR-R18 baseline, followed by a step-by-step progressive integration.

The independent analysis of the four proposed modules demonstrates their respective effectiveness. Integrating the lightweight HCSSNet backbone drastically reduces the network size (−6.0 M parameters and −16.6 GFLOPs) while successfully improving the

{A P}_{s}

to 17.8%. When evaluated independently on the baseline, the SPDMixer, DSAIFI, and RF-MPDIoU modules all demonstrate consistent performance improvements, each yielding mAP50 gains between 0.3% and 0.5%. Notably, the RF-MPDIoU loss shows particular efficacy for small targets, independently enhancing

{A P}_{s}

by 0.5%.

The progressive integration of these modules further underscores their synergistic benefits. Starting from the efficient HCSSNet-based foundation, incorporating the SPDMixer module yields a notable improvement in mAP50 to 45.5% and

{A P}_{s}

to 18.2%. This validates that the pixel-level lossless downsampling mechanism successfully preserves critical fine-grained information that is typically discarded by traditional large-stride convolutions. Next, integrating the DSAIFI module produces a significant increase in mAP50 to 46.2% (a 0.7 percentage point gain over the previous step) while further compressing the overall parameter count to 13.7 M. This emphasizes the effectiveness of dynamically selecting informative spatial tokens for sparse interaction, which filters out background noise and reduces parameter redundancy. Finally, the addition of the RF-MPDIoU regression loss function establishes the complete HSD-DETR architecture, elevating mAP50 to 46.6% and the

{A P}_{s}

metric to 18.8%. The substantial boost in precision is accompanied by a surge in recall (

R

) to 44.6%. This indicates that the dynamic gradient regulation and boundary constraints in RF-MPDIoU effectively tighten bounding box regression. Consequently, this leads to more precise localization in complex environments.

The step-by-step improvements in the ablation studies validate the conceptual soundness and practical utility of each proposed module. The final architecture demonstrates superior detection performance while maintaining a minimal computational footprint.

4.6. Generalization Test

To further evaluate the robustness, generalizability, and specific module contributions of the proposed method across diverse remote sensing environments, we conducted comprehensive extended experiments. These comprise extended ablation studies to verify individual components and generalization comparative analyses.

We utilized three prominent remote sensing datasets: the Remote Sensing Object Detection (RSOD) [45], the Satellite Imagery Multi-vehicle Dataset (SIMD) [46], and the Northwestern Polytechnical University Very-High-Resolution 10-class (NWPU VHR-10) [47]. RSOD contains 976 images and 6950 instances across four categories (see Figure S2 in Supplementary Materials), challenging the model’s adaptability to varied backgrounds. SIMD features 5000 high-resolution images encompassing 15 vehicle categories (see Figure S3 in Supplementary Materials), characterized by densely distributed, small objects. NWPU VHR-10 comprises 650 optical images with 3651 instances across 10 categories (see Figure S1 in Supplementary Materials). All datasets were randomly partitioned into training, validation, and testing sets using a strict 7:2:1 ratio. This diverse selection comprehensively assesses the model’s performance across various core challenges in remote sensing small object detection. For consistency, the primary experimental hyperparameters and training protocols were strictly applied across all datasets.

4.6.1. Extended Ablation Experiments

To rigorously validate the individual and synergistic contributions of the proposed modules (HCSSNet, SPDMixer, DSAIFI, and RF-MPDIoU), an extended ablation study was conducted on the RSOD dataset, as summarized in Table 8.

The baseline RT-DETR-R18 initially achieves a mAP50 of 94.3% and a small object average precision

{A P}_{s}

of 37.8%. The independent integration of each module yields consistent performance gains. For instance, incorporating the RF-MPDIoU module alone increases mAP50 to 95.3% and

{A P}_{s}

to 38.6%, highlighting its effectiveness in precise bounding box regression. Furthermore, the joint application of multiple modules strengthens the detection capability. Integrating HCSSNet, SPDMixer, and DSAIFI raises the mAP50 to 96.5% and

{A P}_{s}

to 40.1%. Finally, the complete HSD-DETR architecture achieves optimal performance with an mAP50 of 97.3% and an

{A P}_{s}

of 40.5%. This represents an improvement of 3.0 percentage points in mAP50 and 2.7 percentage points in

{A P}_{s}

over the baseline, demonstrating strong synergy among the modules in feature extraction, cross-scale fusion, and small object localization.

4.6.2. Extended Generalization Comparative Experiments

To assess generalizability, comparative analyses between the baseline RT-DETR-R18 and HSD-DETR were conducted on the RSOD, SIMD, and NWPU VHR-10 datasets. The quantitative results are presented in Table 9.

As detailed in Table 9, the proposed HSD-DETR comprehensively outperforms the baseline across all three datasets, securing substantial gains in precision (

P

), recall (

R

), mAP50, and mAP50-95. Notably, on the RSOD dataset, HSD-DETR achieves improvements of 3.0 and 3.2 percentage points in mAP50 and mAP50-95, respectively, alongside simultaneous gains in precision (from 93.9% to 95.3%) and recall (from 90.3% to 94.7%). In the SIMD, which contains densely distributed small objects, the method exhibits robust performance, delivering a solid 2.8 percentage point increase in mAP50 (from 75.7% to 78.5%). Similarly, on the NWPU VHR-10 dataset, HSD-DETR elevates mAP50 to 91.5% and achieves a 2.7 percentage point leap in precision (from 85.7% to 88.4%). These improvements confirm the model’s robustness and verify its effectiveness in reinforcing small object feature representations.

4.7. Visual Analysis

To visually evaluate the detection effectiveness of our proposed HSD-DETR across various complex RS scenarios, Figure 6 presents a comparative visualization of bounding boxes against the baseline RT-DETR-R18. In congested city intersections with dense target distributions, the baseline outputs are fragmented and overlapping boxes. In contrast, HSD-DETR demonstrates improved detection completeness by notably reducing redundant predictions and yielding higher confidence scores for densely packed vehicles. Similarly, on sports courts with scattered, minute targets, the baseline exhibits frequent missed detections. Our model, however, detects these isolated targets more effectively with increased confidence scores while maintaining a clear background. Furthermore, in an unevenly illuminated night street, the baseline tends to miss targets located in shadowed regions. Conversely, HSD-DETR demonstrates a higher recall rate for pedestrians with more precise bounding box contours.

To further observe the feature-level activation differences between the models, we visualize model heatmaps using gradient-weighted class activation mapping (Grad-CAM) [48]. Figure 7 compares our HSD-DETR with the RT-DETR-R18 baseline across typical VisDrone validation scenarios. In these visualizations, warmer colors indicate higher activation. Analyzing these diverse scenes reveals clear observational differences in how the models allocate attention. Specifically, for distant targets as illustrated in Figure 7a, HSD-DETR maintains robust and continuous activations along the entire road depth, providing more accurate coverage for remote vehicles compared to the baseline’s diminishing attention. In the sports court scenario, our model generates pinpoint activation spots precisely aligned with individual minute targets, rather than the dispersed activation areas produced by the baseline. Furthermore, in cluttered scenes like the street market in Figure 7c, HSD-DETR yields distinct activations concentrated predominantly on true targets, with significantly reduced background activation. Finally, as evidenced in Figure 7d, the model isolates individual objects with compact, focused activations instead of merging them into loose, overlapping regions for densely clustered instances.

5. Discussion

Combined experimental results, ranging from backbone comparisons to cross-dataset evaluations, confirm that HSD-DETR successfully achieves high detection accuracy with low computational overhead for remote sensing imagery.

A key contribution of this research is the development of a novel and highly efficient hybrid SSM–convolution fusion method. As shown in Table 5, HCSSNet achieves a superior accuracy-efficiency trade-off compared to mainstream backbones, suggesting an enhanced capacity to extract discriminative features from cluttered scenes. By mitigating the structural limitations of CNNs and SSMs, this hybrid approach reliably captures the fine-grained cues of small targets while maintaining a low parameter count.

Ablation studies (Table 7) confirm the complementary roles of the proposed modules, with SPDMixer, DSAIFI, and RF-MPDIoU yielding progressive mAP improvements. These gains align with the intended theoretical mechanisms. Specifically, SPDMixer preserves spatial details during downsampling, DSAIFI filters background noise via sparse feature interaction, and RF-MPDIoU tightens bounding box regression. Furthermore, Grad-CAM visualizations (Figure 7) provide intuitive support. Compared to the baseline’s diffuse heatmaps, HSD-DETR demonstrates highly concentrated responses on small objects, consistent with the quantitative enhancements in feature discriminability.

Compared with representative mainstream detectors, HSD-DETR secures highly competitive detection capabilities. It outperforms mainstream YOLO and DETR variants across diverse categories, particularly in heavily occluded or small-target classes (e.g., tricycle and motor). Crucially, this accuracy is achieved without brute-force parameter scaling, utilizing only 13.7 M parameters and 41.0 GFLOPs. This favorable trade-off stems from the synergistic design of the hybrid backbone and dynamic sparse attention. Furthermore, extended evaluations on datasets like RSOD, SIMD, and NWPU VHR-10 confirm that this efficiency translates into strong generalizability across varied remote sensing environments.

In summary, by achieving competitive accuracy with a minimal computational footprint, HSD-DETR establishes an efficient paradigm for Earth observation, demonstrating that our proposed hybrid SSM–convolution fusion architecture achieves an effective unification of high detection accuracy and low computational overhead. The seamless integration of these high-fidelity and sparse-aware mechanisms establishes a scalable foundation for next-generation efficient remote sensing systems.

6. Conclusions

This paper proposes HSD-DETR, a lightweight object detection framework based on the RT-DETR architecture, designed to address the accuracy and efficiency bottlenecks of detecting dense, small objects in complex remote sensing imagery. To overcome the inherent defects of standard architectures, we construct an end-to-end framework integrating feature preservation, efficient extraction, background purification, and refined localization. At the heart of this framework lies the HCSS-Fusion module, which forms the lightweight HCSSNet backbone. It efficiently extracts features by coupling local perception with global selective scanning. Building upon this core, the framework integrates three complementary components. First, the SPDMixer performs pixel-level lossless reorganization to prevent fine-grained detail loss during downsampling. Second, the DSAIFI module utilizes dynamic sparse attention in the encoder to suppress redundant background noise. Finally, the RF-MPDIoU loss function employs a non-linear mapping mechanism to optimize sample weighting, achieving precise geometric alignment for minute targets.

Experimental results on VisDrone2019 and other remote sensing datasets demonstrate that HSD-DETR outperforms the baseline RT-DETR-R18 and other mainstream models. It significantly improves mAP while drastically reducing parameter count and GFLOPs, validating the effectiveness of our lightweight design in balancing performance and computational complexity.

Despite these promising results, HSD-DETR still exhibits certain limitations under extreme conditions:

(1): Generalization bottlenecks under distribution shifts: In complex real-world scenarios, small targets are extremely sparse and often submerged in massive spatial redundancy. Under severe cross-domain distribution shifts (e.g., adverse weather, illumination variations, or sensor differences), the model’s ability to precisely decouple these minute features from backgrounds degrades. This limitation can occasionally cause missed detections or false positives in highly clustered or low-visibility regions.
(2): Edge deployment overhead: Although the theoretical complexity is low, the HCSS-Fusion module relies on a hybrid architecture combining local convolutions and SS2D. Deploying this custom structure on standard edge devices may encounter hardware-level operator compatibility issues, requiring further low-level optimization to fully realize theoretical inference acceleration.

Future work will address these challenges through targeted optimization. First, we plan to explore advanced domain adaptation strategies and leverage large-scale remote sensing foundation models. Fine-tuning with robust spatial-semantic priors will enhance generalization and robustness against cross-domain shifts. Second, to overcome the massive spatial redundancy inherent in remote sensing imagery, we aim to introduce dynamic inference mechanisms, such as adaptive token routing or dynamic spatial sparsification. These approaches will intelligently bypass redundant background patches, allocating computational resources strictly to salient target regions. Finally, we will focus on developing hardware-friendly operator fusion schemes to bridge the gap between theoretical efficiency and practical edge deployments.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/rs18101577/s1, Figure S1: Data distribution and statistics of the NWPU VHR-10 dataset; Figure S2: Data distribution and statistics of the RSOD dataset; Figure S3: Data distribution and statistics of the SIMD. For all figures, (a) Category distribution; (b) Bounding box size and shape distribution; (c) Spatial distribution of object centers; (d) Distribution of object width and height.

Author Contributions

Conceptualization, J.X. and C.W.; methodology, J.X.; software, J.X.; validation, W.L., Y.Z. and R.T.; formal analysis, J.X.; investigation, J.X.; resources, C.W.; data curation, W.L. and Y.Z.; writing—original draft preparation, J.X.; writing—review and editing, J.X. and C.W.; visualization, J.X. and R.T.; supervision, C.W.; project administration, C.W.; funding acquisition, C.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Key Research and Development Program of China under Grant 2023YFC3321601, and in part by the Scientific Research Project of Shandong University–Weihai Research Institute of Industrial Technology under Grant 0006202210020011.

Data Availability Statement

The HSD-DETR code is available at https://github.com/JinyuXu-SDU/HSD-DETR (accessed on 8 May 2026). Publicly available datasets were analyzed in this study. These data can be found at the following links: VisDrone Dataset (https://github.com/VisDrone/VisDrone-Dataset, accessed on 20 January 2026), RSOD Dataset (https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-, accessed on 26 January 2026), SIMD (https://aistudio.baidu.com/datasetdetail/144979, accessed on 26 January 2026), and NWPU VHR-10 Dataset (https://opendatalab.org.cn/OpenDataLab/NWPU_VHR-10/tree/main, accessed on 26 January 2026).

Acknowledgments

During the preparation of this manuscript, the author(s) used Gemini 3 Flash for the purposes of text translation and language polishing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

AGSA	adaptive gated sparse attention
AP	average precision
${A P}_{s}$	average precision for small objects
${A P}_{m}$	average precision for medium objects
CIoU	complete intersection over union
CNN	convolutional neural network
COCO	common objects in context
DEIM	detection transformer with improved matching
DETR	detection transformer
D-FINE	fine-grained distribution refinement
DIoU	distance intersection over union
DSAIFI	dynamic sparse adaptive intra-scale feature interaction
FFN	feed-forward network
FN	false negative
FP	false positive
FP32	32-bit floating point
FPS	frames per second
GELU	Gaussian error linear unit
GFLOPs	giga floating-point operations
GIoU	generalized intersection over union
Grad-CAM	gradient-weighted class activation mapping
HCSS	hybrid convolution and selective scanning
HCSS-Fusion	hybrid convolution and selective scanning fusion
HCSSNet	hybrid convolution and selective scanning network
HSD-DETR	hybrid scale dynamic detection transformer
Inner-IoU	inner intersection over union
IoU	intersection over union
LN	layer normalization
LSK	large selective kernel
LSKNet	large selective kernel network
mAP	mean average precision
MLP	multi-layer perceptron
MPDIoU	minimum point distance intersection over union
NMS	non-maximum suppression
NWD	normalized Wasserstein distance
NWPU	Northwestern Polytechnical University
Params	parameters
RF-MPDIoU	rational-focal minimum point distance intersection over union
RS	remote sensing
RSOD	remote sensing object detection
RT-DETR	real-time detection transformer
SIMD	satellite imagery multi-vehicle dataset
SiLU	sigmoid linear unit
SNR	signal-to-noise ratio
SPD-Conv	space-to-depth convolution
SPDMixer	space-to-depth mixer
SS2D	2D selective scan
SSMs	state space models
TP	true positive
UAV	unmanned aerial vehicle
VHR-10	very-high-resolution 10-class dataset
Vim	vision Mamba
YOLO	you only look once

References

Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
Yu, C.; Yang, H.; Ma, L.; Yang, J.; Jin, Y.; Zhang, W.; Wang, K.; Zhao, Q. Deep learning-based change detection in remote sensing: A comprehensive review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 24415–24437. [Google Scholar] [CrossRef]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Cheng, G.; Yuan, X.; Yao, X.; Yan, K.; Zeng, Q.; Xie, X.; Han, J. Towards large-scale small object detection: Survey and benchmarks. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 13467–13488. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 213–229. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M. YOLOv1 to YOLOv10: The fastest and most accurate real-time object detection systems. APSIPA Trans. Signal Inf. Process. 2024, 13, 1–38. [Google Scholar] [CrossRef]
Chu, X.; Zheng, A.; Zhang, X.; Sun, J. Detection in crowded scenes: One proposal, multiple predictions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 12211–12220. [Google Scholar]
Aleissaee, A.A.; Kumar, A.; Anwer, R.M.; Khan, S.; Cholakkal, H.; Xia, G.-S.; Khan, F.S. Transformers in remote sensing: A survey. Remote Sens. 2023, 15, 1860. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In Proceedings of the 10th International Conference on Learning Representations (ICLR), Virtual, Online, 25–29 April 2022. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for fast training convergence. In Proceedings of the 18th IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, Online, 11–17 October 2021; pp. 3631–3640. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the 19th IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 2–6 October 2023; pp. 16748–16759. [Google Scholar]
Xia, Z.; Pan, X.; Song, S.; Li, L.E.; Huang, G. Vision transformer with deformable attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 19–24 June 2022; pp. 4784–4793. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. arXiv 2023, arXiv:2312.00752. [Google Scholar] [CrossRef]
Zhu, L.; Liao, B.; Zhang, Q.; Wang, X.; Liu, W.; Wang, X. Vision Mamba: Efficient visual representation learning with bidirectional state space model. In Proceedings of the 41st International Conference on Machine Learning (ICML), Vienna, Austria, 21–27 July 2024; pp. 62429–62442. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual state space model. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 10–15 December 2024; pp. 103031–103063. [Google Scholar]
Wang, Z.; Li, C.; Xu, H.; Zhu, X.; Li, H. Mamba YOLO: A simple baseline for object detection with state space model. In Proceedings of the 39th AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; pp. 8205–8213. [Google Scholar]
He, H.; Zhang, J.; Cai, Y.; Chen, H.; Hu, X.; Gan, Z.; Wang, Y.; Wang, C.; Wu, Y.; Xie, L. MobileMamba: Lightweight multi-receptive visual Mamba network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 10–17 June 2025; pp. 4497–4507. [Google Scholar]
Yang, G.; Lei, J.; Tian, H.; Feng, Z.; Liang, R. Asymptotic feature pyramid network for labeling pixels and regions. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 7820–7829. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Li, Q.; Shen, L.; Guo, S.; Lai, Z. WaveCNet: Wavelet integrated CNNs to suppress aliasing effect for noise-robust image classification. IEEE Trans. Image Process. 2021, 30, 7074–7089. [Google Scholar] [CrossRef]
Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Proceedings of the 22nd Joint European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), Turin, Italy, 18–22 September 2023; pp. 443–459. [Google Scholar]
Hassani, A.; Shi, H. Dilated neighborhood attention transformer. arXiv 2022, arXiv:2209.15001. [Google Scholar]
Zhu, L.; Wang, X.; Ke, Z.; Zhang, W.; Lau, R. BiFormer: Vision transformer with bi-level routing attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 10323–10333. [Google Scholar]
Han, D.; Ye, T.; Han, Y.; Xia, Z.; Pan, S.; Wan, P.; Song, S.; Huang, G. Agent attention: On the integration of softmax and linear attention. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 124–140. [Google Scholar]
Yun, S.; Ro, Y. SHViT: Single-head vision transformer with memory efficient macro design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5756–5767. [Google Scholar]
Zou, S.; Zou, Y.; Li, J.; Gao, G.; Qi, G.-J. Cross paradigm representation and alignment transformer for image deraining. In Proceedings of the 33rd ACM International Conference on Multimedia (ACM MM), Dublin, Ireland, 27–31 October 2025; pp. 8448–8457. [Google Scholar]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the 34th AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 12993–13000. [Google Scholar]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Ma, S.; Xu, Y. MPDIoU: A loss for efficient and accurate bounding box regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 January 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO11. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 January 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO26. Available online: https://github.com/ultralytics/ultralytics (accessed on 20 January 2026).
Peng, Y.; Li, H.; Wu, P.; Zhang, Y.; Sun, X.; Wu, F. D-FINE: Redefine regression task of DETRs as fine-grained distribution refinement. In Proceedings of the 13th International Conference on Learning Representations (ICLR), Singapore, 24–28 April 2025; pp. 58148–58164. [Google Scholar]
Huang, S.; Lu, Z.; Cun, X.; Yu, Y.; Zhou, X.; Shen, X. DEIM: DETR with improved matching for fast convergence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 15162–15171. [Google Scholar]
Lv, W.; Zhao, Y.; Chang, Q.; Huang, K.; Wang, G.; Liu, Y. RT-DETRv2: Improved baseline with bag-of-freebies for real-time detection transformer. arXiv 2024, arXiv:2407.17140. [Google Scholar]
Yu, W.; Wang, X. MambaOut: Do we really need Mamba for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 4484–4496. [Google Scholar]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Haroon, M.; Shahzad, M.; Fraz, M.M. Multisized object detection using spaceborne optical imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2020, 13, 3032–3046. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proceedings of the 16th IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 618–626. [Google Scholar]

Figure 1. Overview of the proposed HSD-DETR architecture. (a) The macroscopic network structure comprising the Backbone, Neck, and Head components for multi-scale feature extraction and detection. Details of the (b) HCSS Layer and the (c) Attention module are illustrated on the right.

Figure 2. Architecture of the HCSS-Fusion module.

Figure 3. Structure of the SPDMixer module.

Figure 4. Overall architecture of the proposed DSAIFI module and the detailed structure of the AGSA mechanism.

Figure 5. Visualization of the VisDrone2019 training dataset. (a) Category distribution. (b) Bounding box size and shape distribution. (c) Spatial distribution of object centers. (d) Distribution of object width and height.

Figure 6. Qualitative comparison of detection results between RT-DETR-R18 (Baseline) and our HSD-DETR across diverse scenarios: (a) dense intersection, (b) sports court, and (c) night street.

Figure 7. Visual comparison of feature heatmaps and detection results among different models in complex remote sensing scenarios: (a) urban road, (b) sports courts, (c) street market, and (d) city intersection.

Table 1. Experimental environment configuration.

Environment	Configuration
OS	Ubuntu 20.04 LTS
CPU	Intel Xeon Gold 5218R (20-core)
GPU	NVIDIA GeForce RTX 3090 (24 GB)
Language	Python 3.10
Framework	PyTorch 2.2.2, Ultralytics 8.0.201

Table 2. Training parameter settings.

Parameter	Value
Optimizer	AdamW
Initial Learning Rate	$1 \times 10^{- 2}$
Momentum	0.9
Weight Decay	$1 \times 10^{- 4}$

Table 3. Comparison of the proposed HSD-DETR and other mainstream object detection models on the VisDrone2019 dataset.

Model	mAP50 (%)	mAP50–95 (%)	${A P}_{s}$ (%)	${A P}_{m}$ (%)	Params (M)	GFLOPs	FPS
YOLOv8m [38]	42.1	25.7	14.5	32.6	25.8	78.7	181
YOLO11m [39]	43.5	26.0	15.0	33.3	20.0	67.7	180
YOLO26m [40]	44.6	26.3	15.7	34.3	20.4	67.9	179
Mamba YOLO-B [21]	42.5	25.7	14.5	32.5	21.8	49.6	108
D-FINE-N [41]	30.5	16.9	10.1	24.5	3.7	7.1	378
D-FINE-S [41]	39.4	23.2	15.6	32.3	10.2	24.9	239
DEIM-D-FINE-N [42]	30.4	16.9	10.9	24.6	3.7	7.1	377
DEIM-D-FINE-S [42]	40.4	23.9	16.6	32.8	10.2	24.9	238
RT-DETRv2-R18 [43]	45.9	27.9	18.5	36.5	19.9	59.9	169
RT-DETR-R18 [8]	45.3	27.6	17.5	35.9	19.9	57.0	174
HSD-DETR	46.6	28.2	18.8	37.0	13.7	41.0	146

Table 4. Class-wise detection performance (mAP50) comparison of different models on the VisDrone2019 dataset. The object categories include people (peo), pedestrian (ped), tricycle (tri), awning-tricycle (awi), and others.

Model	All	Ped	Peo	Bicycle	Car	Van	Truck	Tri	Awi	Bus	Motor
YOLOv8m [38]	42.1	45.2	34.9	16.6	80.8	47.8	39.8	31.7	17.0	60.4	47.2
YOLO11m [39]	43.5	45.8	34.9	17.4	81.4	49.7	42.1	32.8	18.8	63.6	48.2
YOLO26m [40]	44.6	53.0	39.7	16.7	83.1	46.9	41.3	32.2	18.0	63.0	52.2
Mamba YOLO-B [21]	42.5	46.2	35.4	15.4	81.3	48.2	40.7	31.0	17.0	62.0	48.3
D-FINE-N [41]	30.5	30.3	30.3	8.3	73.2	34.5	20.4	19.6	9.0	38.8	40.8
D-FINE-S [41]	39.4	42.6	39.5	14.9	80.0	43.8	29.4	26.9	16.3	50.2	50.8
DEIM-D-FINE-N [42]	30.4	30.2	31.2	7.7	73.2	34.2	18.6	20.0	10.1	37.6	40.9
DEIM-D-FINE-S [42]	40.4	41.6	40.8	14.8	80.2	44.2	29.9	28.1	18.7	53.8	51.6
RT-DETRv2-R18 [43]	45.9	52.9	45.4	21.9	84.2	51.0	35.2	33.3	18.5	57.7	58.8
RT-DETR-R18 [8]	45.3	53.2	47.1	18.8	85.1	49.1	34.7	31.5	15.3	59.9	58.0
HSD-DETR	46.6	51.8	47.3	21.2	84.2	48.1	39.0	35.3	16.3	65.1	58.3

Table 5. Performance comparison of different backbones on the VisDrone2019 dataset.

Backbone	mAP50 (%)	mAP50–95 (%)	${A P}_{s}$ (%)	${A P}_{m}$ (%)	Params (M)	GFLOPs
ResNet-18 [15]	45.3	27.6	17.5	35.9	19.9	57.0
LSKNet-S [16]	43.5	26.0	16.4	35.0	20.7	60.9
MambaOut [44]	43.0	25.7	16.1	34.8	17.7	47.7
MobileMamba [22]	42.9	25.9	15.3	34.2	17.1	43.5
HCSSNet	45.0	27.1	17.8	35.5	13.9	40.4

Table 6. Performance comparison of different loss functions using RT-DETR-R18 on the VisDrone2019 dataset.

Model	Loss Function	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	${A P}_{s}$ (%)
RT-DETR-R18	CIoU [33]	59.3	42.6	45.0	27.3	17.2
	DIoU [33]	59.5	42.8	45.1	27.4	17.4
	GIoU [32]	59.8	43.6	45.3	27.6	17.5
	NWD [34]	59.1	42.1	44.7	27.1	17.1
	MPDIoU [36]	59.7	43.3	45.5	27.7	17.7
	Focal MPDIoU	60.1	43.5	45.6	27.8	17.8
	RF-MPDIoU	59.8	43.9	45.8	27.9	18.0

Table 7. Ablation study of the proposed modules in HSD-DETR on the VisDrone2019 dataset.

Model	HCSSNet	SPDMixer	DSAIFI	RF-MPDIoU	P (%)	R (%)	mAP50 (%)	${A P}_{s}$ (%)	Params (M)	GFLOPs
RT-DETR-R18					59.8	43.6	45.3	17.5	19.9	57.0
Ours	√				59.2	42.8	45.0	17.8	13.9	40.4
		√			60.1	43.8	45.6	17.9	19.9	52.8
			√		60.3	43.8	45.8	17.8	19.7	57.1
				√	59.8	43.9	45.8	18.0	19.9	57.0
	√	√			59.6	43.5	45.5	18.2	13.9	40.9
	√		√		59.9	43.7	45.6	18.1	13.7	40.5
	√			√	59.2	43.5	45.5	18.3	13.9	40.4
	√	√	√		60.3	44.2	46.2	18.4	13.7	41.0
		√	√	√	60.5	44.2	46.3	18.5	19.7	52.9
	√	√	√	√	60.3	44.6	46.6	18.8	13.7	41.0

Table 8. Extended ablation study of the proposed modules on the RSOD dataset.

Model	HCSSNet	SPDMixer	DSAIFI	RF-MPDIoU	mAP50 (%)	${A P}_{s}$ (%)
RT-DETR-R18					94.3	37.8
Ours	√				94.9	38.3
		√			95.1	38.2
			√		94.8	38.3
				√	95.3	38.6
	√	√			95.6	39.1
	√		√		95.4	38.8
	√			√	95.9	39.5
	√	√	√		96.5	40.1
		√	√	√	96.1	39.8
	√	√	√	√	97.3	40.5

Table 9. Comparative results on RSOD, SIMD, and NWPU VHR-10 datasets.

Dataset	Model	P (%)	R (%)	mAP50 (%)	mAP50–95 (%)	Params (M)	GFLOPs
RSOD	RT-DETR-R18 [8]	93.9	90.3	94.3	61.1	19.9	57.0
RSOD	HSD-DETR	95.3	94.7	97.3	64.3	13.7	41.0
SIMD	RT-DETR-R18 [8]	74.0	74.8	75.7	59.4	19.9	57.0
SIMD	HSD-DETR	77.5	78.8	78.5	62.1	13.7	41.0
NWPU VHR-10	RT-DETR-R18 [8]	85.7	85.1	89.9	60.0	19.9	57.0
NWPU VHR-10	HSD-DETR	88.4	87.5	91.5	62.8	13.7	41.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, J.; Liu, W.; Tian, R.; Wang, C.; Zhang, Y. HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images. Remote Sens. 2026, 18, 1577. https://doi.org/10.3390/rs18101577

AMA Style

Xu J, Liu W, Tian R, Wang C, Zhang Y. HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images. Remote Sensing. 2026; 18(10):1577. https://doi.org/10.3390/rs18101577

Chicago/Turabian Style

Xu, Jinyu, Wenwei Liu, Runze Tian, Chengyou Wang, and Yuanbo Zhang. 2026. "HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images" Remote Sensing 18, no. 10: 1577. https://doi.org/10.3390/rs18101577

APA Style

Xu, J., Liu, W., Tian, R., Wang, C., & Zhang, Y. (2026). HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images. Remote Sensing, 18(10), 1577. https://doi.org/10.3390/rs18101577

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

HSD-DETR: An Efficient Hybrid Scale Dynamic Network for Small Object Detection in Remote Sensing Images

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning for Remote Sensing Object Detection

2.2. State Space Models and Lightweight Feature Extraction

2.3. Optimization for Small Object Detection

3. Materials and Methods

3.1. Overview of HSD-DETR Model

3.2. Hybrid Convolution and Selective Scanning Fusion Module

3.3. Space-to-Depth Mixer Module

3.4. Dynamic Sparse Adaptive Intra-Scale Feature Interaction

3.5. Optimization of Loss Function

4. Experiment

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.3.1. Detection Accuracy Metrics

4.3.2. Computational Efficiency Metrics

4.4. Comparison with Previous Methods

4.5. Ablation Experiment

4.5.1. Comparison of Different Backbones

4.5.2. Comparison of Different IoU Loss Functions

4.5.3. Progressive Ablation Study of Proposed Modules

4.6. Generalization Test

4.6.1. Extended Ablation Experiments

4.6.2. Extended Generalization Comparative Experiments

4.7. Visual Analysis

5. Discussion

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI