RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing

Dong, Qing; Han, Tianxin; Wu, Gang; Qiao, Baiyou; Sun, Lina

doi:10.3390/rs17121965

Open AccessArticle

RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing

by

Qing Dong

^1,2

,

Tianxin Han

^1,2

,

Gang Wu

^1,2,*

,

Baiyou Qiao

^1,2

and

Lina Sun

³

¹

Computer Science and Technology, School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China

²

Key Laboratory of Intelligent Computing in Medical Image, Ministry of Education, Northeastern University, Shenyang 110819, China

³

Department of Process Equipment and Control Engineering, School of Mechanical Engineering and Automation, Northeastern University, Shenyang 110819, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(12), 1965; https://doi.org/10.3390/rs17121965

Submission received: 11 April 2025 / Revised: 2 June 2025 / Accepted: 4 June 2025 / Published: 6 June 2025

(This article belongs to the Special Issue Efficient Object Detection Based on Remote Sensing Images)

Download

Browse Figures

Versions Notes

Abstract

Detecting small objects in high-resolution remote sensing images presents persistent challenges due to their limited pixel coverage, complex backgrounds, and dense spatial distribution. These difficulties are further exacerbated by the increasing resolution and volume of remote sensing data, which impose stringent demands on both detection accuracy and computational efficiency. Addressing these challenges requires the development of lightweight yet robust detection frameworks tailored for small object detection under resource-constrained conditions. To address these challenges, we propose RSNet, a lightweight and highly efficient detection framework designed specifically for small object detection in high-resolution remote sensing images. At its core, RSNet features the Compact-Align Detection Head (CADH), which enhances scale-adaptive localization and improves detection sensitivity for densely distributed small-scale targets. To preserve features during spatial downsampling, we introduce the Adaptive Downsampling module (ADown), which effectively balances computational efficiency with semantic retention. Additionally, RSNet integrates GSConv to enable efficient multi-scale feature fusion while minimizing resource consumption. In addition, we adopt a K-fold cross-validation strategy to enhance the stability and credibility of model evaluation under spatially heterogeneous remote sensing data conditions. To evaluate the performance of RSNet, we conduct extensive experiments on two widely recognized remote sensing benchmarks, DOTA and NWPU VHR-10. The results show that RSNet achieves mean Average Precision (mAP) scores of 76.4% and 92.1%, respectively, significantly surpassing existing state-of-the-art models in remote sensing object detection. These findings confirm RSNet’s ability to balance high detection accuracy with computational efficiency, making it highly suitable for real-time applications on resource-constrained platforms such as satellite-based remote sensing systems or edge computing devices used in remote sensing applications.

Keywords:

remote sensing imagery; small object detection; real-time detection; lightweight detection network; DOTA dataset

1. Introduction

In recent years, with artificial intelligence emerging as a focal point of research and application, the utilization of remote sensing imagery has become an indispensable strategic asset for national security and socio-economic development [1]. It plays a critical role not only in optimizing urban planning, enhancing agricultural productivity, and monitoring environmental shifts but also exerts a profound influence on safeguarding national security, propelling economic growth, and facilitating public information services [2,3]. As illustrated in Figure 1, which synthesizes multi-geographical scenes (urban, forest, aquatic, agricultural, and mountainous terrains), remote sensing technology, through multi-source data fusion, effectively captures the heterogeneity of surface features, with its multi-scale analytical capacity spanning from macro-environmental dynamics to localized feature characterization.

The remarkable advancements in remote sensing technology, particularly the substantial enhancement in spatial resolution, have made it feasible to discern intricate details that were previously imperceptible. However, these advancements simultaneously introduce unprecedented challenges in image analysis [4]. Unlike conventional natural imagery, remote sensing data presents distinctive processing demands: in high-resolution images, objects are frequently characterized by diminutive pixel footprints, significant intra-class variability, and minimal inter-class variability [5]. The environmental complexity, as exemplified by the spatial–contextual discrepancies in Figure 1, amplifies the susceptibility of detection models to background clutter and noise. Moreover, the inherent complexity of multi-scale and multi-scene environments severely impedes discriminative feature extraction, resulting in elevated rates of false positives (FP) and false negatives (FN). This underscores the critical need for robust frameworks capable of reconciling multi-source data synergies to address the unique challenges of small-object detection in high-resolution remote sensing imagery [6].

Although deep learning models such as Faster R-CNN [7], SSD [8], and YOLO [9] have achieved remarkable success on natural image datasets like COCO [10] and Pascal VOC [11], they often fail to maintain comparable performance when deployed in remote sensing contexts. This discrepancy stems not only from the shift in the data domain, but more critically from the unique challenges associated with small object detection in remote sensing imagery.

Remote sensing imagery is characterized by a high density of small objects, significant scale variation, and complex backgrounds, all of which pose substantial challenges to the development of detection models that are both efficient and accurate [12]. In practical scenarios, the performance of such models is often constrained by insufficient feature representation. Small objects typically exhibit weak textures, low contrast, and limited spatial extent, resulting in ambiguous responses on feature maps and making it difficult for conventional detection heads to effectively capture their discriminative characteristics [13]. Consequently, localization becomes inaccurate, and the miss rate remains high. Moreover, widely adopted downsampling operations, though beneficial for expanding the receptive field and improving computational efficiency, inevitably cause severe loss of spatial details, further degrading the model’s sensitivity to small objects [14]. In addition, current lightweight convolutional structures, while advantageous in terms of model size and inference speed, tend to compromise inter-channel information exchange, thereby limiting the representational capacity when facing complex scenes and multi-scale interference [13]. These structural limitations are often compounded, making it difficult to strike a balance between detection accuracy and inference efficiency in small object detection within remote sensing images and highlighting the need for more adaptive and integrated architectural design.

To address the aforementioned challenges, we propose RSNet (Remote Sensing-oriented Compact Network), a lightweight and efficient detection framework specifically designed to tackle the complex distribution characteristics of small objects in remote sensing imagery. The design of RSNet is driven by three core challenges: (1) enhancing the feature discriminability and detection accuracy of small objects through an adaptive feature alignment mechanism; (2) achieving effective spatial compression while preserving semantic integrity; and (3) strengthening the efficient fusion and representational capacity of multi-scale features under limited computational budgets. Through a set of task-specific architectural innovations, RSNet achieves a balanced optimization between detection performance and inference efficiency, providing a practical and effective solution for small object detection in complex remote sensing scenarios.

Our main contributions include the following.

1.: We propose the Compact-Align Detection Head (CADH), a lightweight detection head tailored for small object detection in remote sensing imagery. Our method dynamically adjusts feature processing to accommodate objects of varying sizes and scales, ensuring accurate localization while minimizing computational overhead. By combining adaptive feature alignment with a compact architecture, CADH achieves high detection precision and is well-suited for real-time applications in resource-constrained environments.
2.: We introduce ADown, an optimized downsampling module that balances computational efficiency and feature preservation. By integrating preprocessing, channel splitting, and feature fusion, ADown reduces computation while retaining both global semantics and fine-grained details, achieving compact yet expressive feature representations.
3.: To address the limited cross-channel interactions in depthwise separable convolution, we integrate GSConv, which combines deep and pointwise convolutions. This design enhances feature extraction by improving both efficiency and representation capacity, enabling accurate detection with reduced computational cost.
4.: We designed a K-fold cross-validation strategy to obtain stable and statistically representative performance evaluations and to assess the robustness of the model under different data partition conditions. By conducting multiple rounds of data partitioning and training across different folds, this strategy effectively reduces the evaluation bias caused by a single partition scheme and improves the reliability of the experimental results.

The structure of this paper is organized as follows: Section 2 provides a comprehensive review of object detection techniques in remote sensing imagery, setting the stage for our investigation. Section 3 describes our proposed methodology in detail. Section 4 rigorously evaluates the effectiveness of our approach. Finally, Section 5 concludes the paper by discussing future research directions.

2. Related Work

The development of small object detection in remote sensing has undergone a significant transition from traditional methods to deep-learning-based techniques. Traditional methods laid the foundation for detection tasks in the early stages; however, with the advancement of deep learning technologies, the field of object detection has undergone substantial changes, particularly in terms of feature extraction and model optimization. This section reviews the core characteristics of traditional detection methods and further explores the breakthroughs brought about by deep learning technologies in remote sensing small object detection, with a particular focus on the advantages and potential applications of lightweight models in real-time detection tasks.

2.1. Traditional Object Detection Methods: Foundations and Limitations

Object detection, as a key research area in computer vision, has undergone a significant evolution from traditional methods to deep-learning-based approaches [15]. Before the advent of deep learning, traditional object detection methods relied on handcrafted feature extraction and shallow machine learning models to identify and localize objects. These methods utilized feature descriptors such as the Histogram of Oriented Gradients (HOG) and the Scale-Invariant Feature Transform (SIFT), coupled with classifiers like Support Vector Machines (SVMs) or similar techniques [16], to perform detection tasks. For example, the Viola–Jones framework, a representative approach, achieved real-time detection capabilities by leveraging simple features and cascade classifiers, demonstrating notable success in tasks like face detection [17].

However, traditional object detection methods exhibit several limitations in both theory and application. First, these approaches are heavily dependent on manually designed features, requiring researchers to possess an in-depth understanding of the objects and their environments. This dependency significantly limits their adaptability and scalability to diverse scenarios [18]. Second, traditional methods predominantly employ shallow models, which restrict their ability to capture multi-scale information and high-level semantic features of objects. These constraints severely hinder their performance in large-scale and highly variable application contexts.

With the rapid development of deep learning technologies, object detection methods based on CNN(Convolutional Neural Network) have become the dominant paradigm. These approaches enable end-to-end training, allowing models to autonomously learn hierarchical features from data [19]. This capability not only overcomes the limitations of handcrafted features but also significantly enhances detection accuracy for small objects in complex scenes. In comparison, while traditional methods laid the theoretical foundation for object detection, their inherent technical limitations make them inadequate for meeting the robustness, real-time processing, and precision demands of modern detection tasks.

2.2. Advancements in Object Detection Algorithms

Deep learning has substantially advanced object detection by overcoming the limitations of traditional handcrafted features and shallow models. Contemporary detectors exploit deep architectures to extract robust and discriminative representations, enabling accurate detection across diverse domains such as autonomous driving, surveillance, and remote sensing. Based on architectural paradigms, mainstream detection frameworks can be broadly classified into CNN-based and Transformer-based approaches, each with distinct strengths in accuracy, speed, and contextual modeling.

CNN-based frameworks remain widely adopted due to their efficiency and maturity. Among these, one-stage detectors such as SSD and the YOLO (You Only Look Once) series integrate object localization and classification into a unified pipeline [20]. The YOLO family, from its early versions to recent iterations like YOLOv8, has introduced various architectural improvements, including multi-scale feature aggregation, anchor-free mechanisms, and decoupled detection heads. These advancements contribute to its strong performance in real-time and resource-constrained environments, making it a preferred choice for practical applications.

Meanwhile, Transformer-based detectors have gained traction for their ability to model long-range dependencies and global contexts. DETR (DEtection TRansformer) initiated a new end-to-end paradigm by replacing anchor-based heuristics with self-attention mechanisms [21]. Subsequent developments, such as Deformable DETR and DINO [22], enhanced spatial flexibility and training efficiency through deformable attention and denoising-based learning. RtDetR [23] further introduced region-aware mechanisms to better adapt to structural variations. However, despite these innovations, Transformer-based models often involve high computational complexity and slower inference, limiting their practicality in latency-sensitive scenarios.

In summary, object detection has progressed from handcrafted techniques to algorithmically advanced frameworks with markedly improved accuracy and adaptability. While Transformer-based methods offer superior contextual reasoning, their deployment remains constrained by high resource demands. In contrast, CNN-based detectors—particularly the YOLO series—have achieved a favorable balance among detection performance, inference speed, and deployment flexibility through streamlined architectures and continuous optimization. As such, YOLO remains a practical and widely adopted solution for scenarios requiring both efficiency and reliability.

2.3. Object Detection Techniques in Remote Sensing Imagery

In the realm of remote sensing imagery, traditional object detection methods have been in use for many years, establishing a solid foundation for detecting changes in remote environments. Seo et al. introduced the Self-Pair method, which simplifies the detection of changes by generating alterations from a single source image [24]. Around the same time, Lang et al. presented an advanced detection system incorporating a neck attention module, significantly improving object detection performance, particularly in infrared imaging scenarios [25]. Despite such advancements, traditional methods face persistent challenges when dealing with complex scenarios. These techniques often struggle with background interference, exhibit reduced performance in real-world settings, and lack versatility when applied to different datasets, limiting their overall effectiveness in dynamic remote sensing contexts.

The rise of deep learning techniques has transformed the landscape of object detection in remote sensing imagery, offering solutions to some of the long-standing challenges. Deep-learning-based methods are particularly adept at identifying small and subtle objects with exceptional precision. For example, Zhang et al. developed SuperYOLO, a framework enhanced by super-resolution that significantly improves the detection of minute details in UAV(Unmanned Aerial Vehicle)-based remote sensing images [26]. Similarly, Bayrak et al. explored remote sensing image categorization via deep learning, emphasizing the critical role of image fidelity in detecting small objects within complex environments [27]. Wang et al. also contributed by introducing a specialized deep learning approach for detecting grazing livestock in UAV images, demonstrating superior performance with high mean Average Precision (mAP) in practical remote sensing applications [28]. Despite these advancements, however, deep learning methods remain computationally demanding, posing significant challenges for implementation in resource-constrained environments.

2.4. Lightweight Networks for Remote Sensing and Small Object Detection

With remote sensing imagery’s increasing demand for computational resources, the development of lightweight detection models has become a trend. Gu et al. have made significant advancements in the development of shipborne remote sensing detectors by leveraging feature distillation and Efficient Spatial Pyramid Pooling (ESPP), effectively enhancing the detection capabilities for small objects [29]. Liu et al. proposed the LACF-YOLO model, an improved YOLOv8-based lightweight attention mechanism that enhances small object detection in remote sensing by integrating Triplet Attention, dilated inverted convolution, and cross-scale feature fusion, achieving higher accuracy, fewer parameters, and faster convergence [30]. Concurrently, Zhang et al. introduced the Guided Hybrid Quantization approach [31] and developed a model suitable for high-resolution remote sensing imagery. This model achieves a good balance between processing speed and model size, although there is still room for improvement in accuracy [32]. Bai et al. [33] proposed CPMFNet, which enhances small object detection in remote sensing by introducing adaptive spatial perception and multiscale fusion modules to improve context modeling and suppress background interference. Similarly, Lin et al. [34] introduced AMMBA, an attention-guided label assignment strategy that strengthens oriented object detection in optical remote sensing by improving sample matching and feature representation. Yang et al. [35] developed FM-RTDETR, a real-time detection framework that boosts small object detection performance through enhanced feature fusion and a tailored loss function tailored for UAV-based scenarios. Moreover, Zhang et al. [36] proposed FFCA-YOLO, a lightweight detector incorporating feature enhancement, fusion, and context-aware modules to achieve high accuracy and efficiency in remote sensing small object detection.

Consequently, there is an urgent need to develop a model specifically designed for remote sensing image recognition, which should achieve high recognition accuracy and rapid responses while optimizing the number of parameters to enhance efficiency.

Despite these advancements, current lightweight models still struggle to achieve an optimal balance between detection accuracy and computational efficiency in complex remote sensing environments. Specifically, issues such as background interference, insufficient feature representation, lack of effective feature fusion, and inaccurate localization of overlapping objects remain unresolved in remote sensing imagery. Additionally, while lightweight techniques can improve computational efficiency, they often result in the loss of spatial details, thereby diminishing the model’s sensitivity to small objects. Therefore, existing frameworks still face significant limitations when addressing these multi-layered and dynamically evolving challenges. There is an urgent need for the development of a more refined detection method that not only optimizes computational efficiency but also enhances adaptability and detection accuracy in remote sensing scenarios, ensuring robust performance even in high-complexity remote sensing tasks.

3. Methods

3.1. Baseline Detection Framework

To support the development of RSNet, we adopt a widely used one-stage object detection framework as the baseline, characterized by its balance between detection accuracy and computational efficiency. Among existing architectures, the YOLO (You Only Look Once) family has undergone continuous evolution, with improvements across precision, speed, and network design. In particular, YOLOv8 strikes a favorable balance between accuracy, efficiency, and architectural reliability, making it a representative and well-established baseline for remote sensing detection tasks. However, under complex conditions—such as densely packed objects and substantial inter-object occlusion—its performance in accurate localization tends to degrade. These limitations reveal challenges in preserving detection robustness within cluttered aerial imagery, motivating the need for a more specialized framework tailored to remote sensing scenarios.

3.2. RSNet

Detecting small objects in remote sensing images presents significant challenges, particularly in resource-constrained environments where high accuracy and efficiency are critical. Existing models struggle with three key limitations: (1) insufficient feature representation, making it difficult to distinguish small objects from complex backgrounds; (2) suboptimal feature fusion, which weakens the integration of multi-scale information; and (3) limited capacity to balance accuracy and efficiency in dense, resource-constrained scenarios, where reduced model complexity constrains feature representation and detection precision. To address these challenges, we propose RSNet, an improved detection framework built upon YOLOv8, designed to enhance both accuracy and efficiency in complex remote sensing scenarios. Figure 2 illustrates the RSNet framework, highlighting its key structural enhancements. RSNet addresses these limitations with targeted architectural enhancements to improve feature extraction, representation, and localization. Specifically, we optimize the backbone for better feature preservation, enhance feature fusion for more effective representation, and refine the detection head to improve localization accuracy in densely packed scenes. These improvements ensure that RSNet maintains high detection performance while remaining computationally efficient. Its main components are outlined below.

3.2.1. Input

RSNet begins by processing the input image through resizing to a standard dimension and normalizing pixel values. These steps ensure consistency across diverse datasets, providing a reliable foundation for robust feature extraction and accurate detection results.

3.2.2. Backbone

To balance efficiency and feature preservation, RSNet enhances the backbone with two key components: ADown and GSConv. ADown mitigates the loss of spatial detail during downsampling through structured preprocessing, channel splitting, and feature fusion, preserving both local and global information for small-object recognition in complex backgrounds. GSConv improves cross-channel interaction by combining depthwise and pointwise convolutions, enhancing multi-scale representation without increasing computational cost. These enhancements jointly improve detection performance in dense and cluttered remote sensing scenes.

3.2.3. Head

The detection head of RSNet is built upon the Compact-Align Detection Head (CADH), a novel structure specifically designed for small object detection in remote sensing imagery. CADH introduces an adaptive feature alignment mechanism that dynamically adjusts feature processing according to object size and scale, enhancing the model’s sensitivity to fine-scale variations. By integrating a lightweight design with high-precision feature adaptation, CADH improves object localization while minimizing computational overhead. This makes RSNet particularly well-suited for real-time remote sensing applications, ensuring reliable performance even in densely packed object distributions.

3.3. Compact-Align Detection Head (CADH)

Small object detection in remote sensing imagery is hindered by factors such as limited resolution, significant scale variation, and complex background interference. These characteristics often lead to inadequate feature representation and reduced detection reliability, particularly in dense and cluttered scenes. To address these limitations, we introduce the Compact-Align Detection Head (CADH), a lightweight and adaptive detection module tailored to the unique demands of remote sensing tasks. By combining efficient architectural design with adaptive feature alignment, CADH improves sensitivity to subtle scale differences while maintaining localization and classification precision under constrained computational budgets. This makes it well-suited for real-time applications in resource-limited remote sensing scenarios.

The CADH forward process is detailed in Algorithm 1, and its output is formulated in Equation (4).

Algorithm 1: Forward Pass of Compact-Align Detection Head (CADH)

Input: Feature map F from backbone

Output: Detection output O

; //Shared Feature Extraction

1.: $F_{shared} \leftarrow {ConvGN}_{2} ({ConvGN}_{1} (F))$
2.: $F_{avg} \leftarrow AdaptiveAvgPool 2 d (F_{shared})$

; //Task Decomposition

3.: $F_{cls} \leftarrow TaskDecompCls (F_{shared}, F_{avg})$
4.: $F_{reg} \leftarrow TaskDecompReg (F_{shared}, F_{avg})$

; //Regression Alignment

5.: $(O_{offset}, O_{mask}) \leftarrow {Conv}_{offset} (F_{shared})$
6.: $M \leftarrow σ (O_{mask})$
7.: $F_{reg - aligned} \leftarrow DyDCNv 2 (F_{reg}, O_{offset}, M)$

; //Classification Refinement

8.: $P_{cls} \leftarrow σ ({Conv}_{2} (ReLU ({Conv}_{1} (F_{cls}))))$
9.: $F_{cls - refined} \leftarrow F_{cls} \cdot P_{cls}$

; //Final Output

10.: $B \leftarrow Scale ({Conv}_{bbox} (F_{reg - aligned}))$
11.: $C \leftarrow {Conv}_{cls} (F_{cls - refined})$
12.: $O \leftarrow Concat (B, C)$
13.: return O

To ensure clarity between the formal equations and the implementation details in Algorithm 1, we define the following symbol correspondence between the mathematical notation and the variable names used in the pseudocode, as shown in Table 1.

As shown in Figure 3, The classification task is handled through a task decomposition mechanism. In this phase, the classification task decomposition module ( TaskDecompCls) extracts class-relevant features from the multi-scale input. To further enhance classification accuracy, these features are fused with global context obtained via global average pooling (AdaptiveAvgPool2d). This fusion of local and global information refines the classification representation and enhances class separability. Mathematically, the process is defined as

clsFeat = TaskDecompCls (feat, avgFeat)

(1)

where

clsFeat

represents the classification features after decomposition,

feat

is the unified feature map, and

avgFeat

is the global feature from average pooling.

Meanwhile, in the regression branch, a similar decomposition process is applied using the TaskDecompReg module to extract location-specific features. To improve localization, a spatial transformation layer predicts both an offset map and a confidence mask. These are fed into a dynamic alignment module (DyDCNv2) to adaptively adjust the receptive fields and emphasize high-confidence regions within the regression features. The aligned regression feature map is expressed as

regFeat = DyDCNv 2 (TaskDecompReg (feat, avgFeat), offset, mask)

(2)

where

offset

and

mask

are derived from the input via a spatial prediction layer, and

regFeat

denotes the aligned regression representation.

While the regression features are dynamically aligned, the classification features undergo further refinement to yield final prediction scores. Specifically, the classification features are processed by two convolutional layers (ClsConv1 and ClsConv2), with ReLU and Sigmoid activations applied sequentially. The resulting class probability map is computed as

clsProb = σ (ClsConv 2 (ReLU (ClsConv 1 (clsFeat))))

(3)

where

sigmoid

denotes the Sigmoid activation function, and

clsProb

represents the predicted probability for each class.

After feature refinement, both branches contribute to the final detection output. The aligned regression features are passed through a convolutional layer (CV2) to predict bounding box coordinates, while the classification features are element-wise multiplied with their corresponding class probabilities and then processed by another convolutional layer (CV3) to generate class scores. This confidence-weighted mechanism attenuates irrelevant activations and enhances final prediction reliability. The final output is given by

f i n a l O u t p u t = Concat (Scale (CV 2 (regFeat)), CV 3 (clsFeat \cdot clsProb))

(4)

In summary, the proposed Compact-Align Detection Head (CADH) explicitly addresses the challenges of small-object detection and dense instance localization in complex remote sensing imagery. By unifying multi-scale feature integration and task-specific decomposition with a dynamic alignment mechanism, CADH enhances both semantic representation and spatial precision. The design employs adaptive global–local fusion to refine classification features and introduces deformable regression alignment via learned offsets and confidence masks, ensuring accurate bounding box prediction even under severe occlusion or clutter. Furthermore, the confidence-weighted classification refinement suppresses irrelevant responses, contributing to more discriminative and stable detection outputs. These targeted enhancements make CADH a lightweight yet effective detection head tailored for high-resolution aerial scenes with dense spatial distributions.

3.4. GSConv: A Lightweight Hybrid Convolution with Feature Shuffle

In the domain of remote sensing, particularly for small object detection, models are required to simultaneously achieve fine-grained spatial feature extraction and high computational efficiency. This dual objective is especially challenging due to the ever-increasing resolution of remote sensing imagery and the associated computational burden. Conventional convolutional operations, although effective in capturing complex patterns, often entail substantial computational costs, making them suboptimal for real-time applications on resource-constrained platforms such as unmanned aerial vehicles (UAVs) and satellite systems. These constraints highlight the necessity of lightweight convolutional modules that can maintain representational expressiveness while significantly reducing computational overhead.

To address this, we adopt GSConv [37], which combines depthwise separable convolution with grouped convolutions and channel shuffle to enhance efficiency while maintaining spatial expressiveness [38]. Compared with MobileNet and ShuffleNet [39], GSConv offers greater structural flexibility and adaptability. While MobileNet focuses on reducing parameters through depthwise separable convolution, and ShuffleNet improves efficiency via fixed-pattern channel shuffle, GSConv combines grouped convolution with a dynamic feature fusion strategy. This design enhances cross-channel interaction while preserving fine spatial details, making GSConv more suitable for small object detection in remote sensing imagery, where rich spatial context and multi-scale feature representation are critical.

Structurally, GSConv comprises two parallel branches: the first directly propagates the input features, while the second applies depthwise separable convolution to model localized spatial dependencies and inter-channel interactions.

As illustrated in Figure 4, given an input feature map

x \in R^{B \times C \times H \times W}

, the second branch performs the following operation:

x_{sep} = PWConv (DWConv (x)),

(5)

where

DWConv (\cdot)

denotes a depthwise convolution that captures spatial features within each individual channel, and

PWConv (\cdot)

represents a

1 \times 1

pointwise convolution that projects the output into a joint feature space by aggregating cross-channel information.

Subsequently, the feature maps generated by the two branches are merged along the channel axis, followed by a channel shuffle operation to facilitate inter-group feature interaction. The resulting output of the GSConv module can be expressed as

y = Shuffle (Concat (x, x_{sep})) .

(6)

The shuffle operation reorders the concatenated channels by performing a reshape–permute–reshape sequence, which enables interaction across previously isolated channel groups. Although this operation preserves the spatial resolution and channel dimensionality, it is crucial for mitigating the limitations of grouped convolutions and promoting holistic feature fusion.

Overall, GSConv achieves an effective trade-off between efficiency and feature expressiveness by enhancing inter-channel information flow through depthwise spatial filtering and lightweight semantic fusion. Its integration into RSNet strengthens the network’s capacity to model complex patterns and fine-grained object details, leading to improved detection performance on high-resolution remote sensing imagery while maintaining real-time feasibility on resource-limited platforms.

3.5. Adaptive Downsampling Module (ADown)

In the context of remote sensing image analysis, the complexity of object detection tasks continues to increase, demanding highly efficient computational techniques that can retain essential spatial and semantic features [40]. Conventional downsampling operations often sacrifice fine-grained details for computational efficiency, leading to degraded performance in downstream tasks. In response to this, we introduce the ADown module, a convolutional block designed to optimize downsampling. This module efficiently reduces spatial dimensions while maintaining important global and local feature information, ensuring that the model retains high performance with minimal computational cost.

As shown in Figure 5, In the first stage, the input feature map

X \in R^{B \times C \times H \times W}

undergoes average pooling for spatial downsampling and smoothing. The average pooling operation is applied with a

2 \times 2

kernel and a stride of 2, which reduces the local variance between features by computing the mean within each local region. This process effectively removes redundancy and generates a more compact feature representation. The mathematical formulation of the operation is as follows:

X_{avg} (b, c, i, j) = \frac{1}{4} \sum_{p = 0}^{1} \sum_{q = 0}^{1} X (b, c, 2 i + p, 2 j + q)

(7)

where

(i, j)

represents the position of the output feature. After this operation, the spatial resolution of the input feature map is reduced from

(H, W)

to

(H / 2, W / 2)

, while the number of channels C remains unchanged. This preprocessing stage ensures that the input features are both spatially compressed and smoothed, providing a refined basis for subsequent feature extraction.

Next, the processed feature map

X_{avg}

is split along the channel dimension into two equal parts, denoted as

X_{1}

and

X_{2}

. This channel-splitting strategy is based on a divide-and-conquer principle, allowing the module to independently process different types of features. Specifically,

X_{1}

focuses on global feature extraction to capture large-scale background information, while

X_{2}

is dedicated to mining local salient features, thereby enhancing the detail representation. These two feature subsets are processed through separate pathways to maximize diversity.

In the global feature pathway,

X_{1}

is passed through a convolutional layer and a stride of 2. This operation not only reduces the spatial resolution from

(H / 2, W / 2)

to

(H / 4, W / 4)

but also extracts smoothed global features using the local receptive field of the convolutional kernel. The convolution operation is mathematically expressed as

X_{1, out} (b, c^{'}, i, j) = \sum_{c = 0}^{C / 2 - 1} \sum_{p = 0}^{2} \sum_{q = 0}^{2} W_{cv 1} (c^{'}, c, p, q) \cdot X_{1} (b, c, 2 i + p, 2 j + q) + b_{cv 1} (c^{'})

(8)

where

W_{cv 1}

represents the convolutional kernel and

b_{cv 1}

is the bias term. This pathway effectively compresses spatial information while strengthening the semantic representation of global features.

In contrast, the local feature pathway processes

X_{2}

to emphasize salient details. Initially,

X_{2}

undergoes max pooling with a

3 \times 3

kernel and a stride of 2. Max pooling selects the maximum value within each local region, highlighting significant features and suppressing noise from the background. The max pooling operation is mathematically defined as

X_{2, pool} (b, c, i, j) = \max_{p = 0}^{2} \max_{q = 0}^{2} X_{2} (b, c, 2 i + p, 2 j + q)

(9)

Subsequently, the pooled features are processed by a

1 \times 1

convolutional layer, which compresses the channel dimension and extracts finer-grained local features. This combination of selective feature extraction through max pooling and channel compression via

1 \times 1

convolution ensures that the local feature pathway excels at preserving and enhancing feature details.

After the two pathways independently process their respective features, the outputs

X_{1, out}

and

X_{2, pool}

are concatenated along the channel dimension. This operation fuses global and local features into a unified representation, where the number of output channels is restored to

C^{'}

, and the spatial resolution is fixed at

(H / 4, W / 4)

. By integrating these complementary features, the module ensures that both global semantic information and local details are effectively preserved.

In summary, the ADown module offers an efficient and effective solution to the downsampling problem in remote sensing object detection. By combining preprocessing, pathway-specific processing, and feature fusion, it achieves a balance between computational efficiency and feature preservation, ensuring minimal performance loss in subsequent detection tasks.

3.6. Embedded Deployment and Optimization

Although personal computers and servers provide strong capabilities for training and inference, their large size, high power consumption, and lack of portability limit their suitability for remote sensing tasks that demand lightweight platforms and real-time processing. While cloud-based solutions support large-scale data processing, the inherent communication latency presents clear limitations in latency-sensitive scenarios such as emergency monitoring and target tracking in remote sensing applications.

In this study, the NVIDIA Jetson Nano is adopted as the embedded deployment platform. Featuring a compact form factor, low power consumption, and efficient computation, it is well-suited for edge-side visual analysis in resource-constrained environments. As shown in Figure 6, the complete deployment pipeline begins with constructing the remote sensing dataset and training the detection model on a high-performance computing platform. The training process is initialized with pretrained backbone weights to accelerate convergence and enhance detection accuracy. Once trained, the optimized RSNet model is deployed to an embedded edge device (e.g., Jetson Nano). During deployment, the TensorRT inference engine is employed to fuse and streamline the network structure, enabling a trade-off between precision and speed. This deployment strategy balances model performance and hardware constraints, allowing RSNet to operate reliably on low-power, high-frame-rate edge devices and effectively meeting the real-time and flexibility requirements of practical remote sensing applications.

4. Experimental Validation

4.1. Dataset and Experimental Configuration

Remote sensing object detection presents unique challenges, including ultra-high spatial resolution, diverse imaging conditions, dense object distributions, and arbitrary object orientations. These factors demand detection models with strong generalization and robustness under complex aerial scenarios. To evaluate the effectiveness of the proposed method, we adopt two widely used benchmarks: DOTA and NWPU VHR-10. DOTA serves as the primary evaluation dataset due to its large scale, rich category diversity, and high intra-class variation. NWPU VHR-10 complements DOTA with additional scene complexity and high-resolution variability, enabling comprehensive evaluation across diverse remote sensing scenarios.

4.1.1. DOTA Dataset

The DOTA (Dataset for Object deTection in Aerial images) dataset [41] is a large-scale benchmark for object detection in remote sensing, comprising 188,282 annotated instances across 15 object categories such as planes, ships, and storage tanks. Each instance is annotated with a quadrilateral bounding box, and the images are collected from diverse sources such as Google Earth, JL-1, and GF-2 satellites, covering a wide range of scales, orientations, and scene types. As illustrated in Figure 7, the dataset shows a clear class imbalance. The frequency histogram and co-occurrence matrix reveal the distribution and contextual relationships among object classes, providing meaningful cues for model evaluation. The right panel presents representative samples, reflecting scene diversity in terms of object types, scales, and spatial layouts.

In addition to scale diversity and arbitrary orientations, as illustrated in the right images in Figure 7, DOTA features densely packed and overlapping objects, variable illumination, and different imaging altitudes—factors that collectively pose substantial challenges for object detection models. These characteristics make DOTA a rigorous and representative benchmark for assessing the robustness and generalization ability of detection algorithms in complex aerial environments.

Small objects are typically defined as instances with bounding box areas smaller than

32 \times 32

pixels, that is, 1024 pixels². Considering the specific characteristics of aerial imagery, including arbitrary object orientations and large variations in the aspect ratio, the DOTA dataset adopts a definition based on the height of horizontal bounding boxes. This criterion provides a more consistent representation of object scale in remote sensing contexts. The detailed information is presented in Table 2. According to this definition, we classify objects into three categories: small (height between 10 and 50 pixels), medium (between 50 and 300 pixels), and large (greater than 300 pixels). Our statistical analysis shows that small objects account for over 57% of all annotated instances in DOTA, confirming its strong suitability for evaluating small object detection in complex aerial scenes.

4.1.2. NWPU VHR-10 Dataset

The NWPU VHR-10 dataset is a widely used benchmark for evaluating object detection performance in very-high-resolution (VHR) remote sensing imagery [42]. It consists of 800 images annotated with 10 object categories, including airplanes, ships, storage tanks, and harbors. All objects are labeled with horizontal bounding boxes. The dataset features substantial variations in object appearance, scale, and background complexity, making it suitable for assessing detection robustness in diverse urban and coastal environments.

Notably, the images are mainly cropped from Google Earth and the Vaihingen dataset and manually annotated by experts. While the annotations are precise, the overall dataset size is relatively limited, and the image content tends to be more regular and clean compared to large-scale real-world datasets.

As illustrated in Figure 8, compared with DOTA, NWPU VHR-10 includes fewer categories and instances but presents greater scene complexity and variation in object distribution. This complementary characteristic enables effective cross-dataset validation, facilitating a more comprehensive assessment of model generalization under diverse remote sensing conditions. In this study, NWPU VHR-10 is used as a supplementary evaluation set to DOTA, with the goal of testing the model’s robustness on representative object categories in VHR imagery. The size distribution of annotated objects in NWPU VHR-10 is summarized in Table 2, highlighting its complementarity to DOTA in terms of small object representation.

4.1.3. Technical Environment

The experiments were conducted on a machine running the Ubuntu 20.04 LTS operating system. The computational resources comprised an Intel^® Xeon^® Gold 6330 CPU, supported by an NVIDIA RTX 3090 GPU, providing substantial processing power for both general-purpose and GPU-accelerated tasks. The development environment was based on Python (version 3.8). Deep learning models were implemented using the PyTorch framework (version 1.10.0), with CUDA (version 11.3) support enabled to optimize GPU utilization.

4.1.4. Model Training Configuration

In our study, the image training process was meticulously structured, resizing images to

640 \times 640

pixels. A batch size of 36 images was consistently introduced to the model during each iteration, spanning across a comprehensive training period of 200 epochs. Furthermore, the incorporation of Mosaic augmentation served as a strategic enhancement to broaden the model’s exposure to diverse training data variations, thereby augmenting its generalization capabilities. Specific details regarding these parameters can be found in Table 3.

4.1.5. Training Hyperparameters

During the model optimization process, we adopted Stochastic Gradient Descent (SGD) as the core optimization strategy. To improve training efficiency and ensure stable parameter updates, a momentum coefficient of 0.937 was applied. This setup not only hastened the convergence of SGD toward the optimal solution but also effectively suppressed fluctuations in the parameter space. Given the risk of overfitting, we adopted a conservative weight decay coefficient of

5 \times 10^{- 4}

, designed to bolster the model’s generalization capabilities across diverse datasets. Moreover, we implemented a dynamic strategy for adjusting the learning rate, initially set at

1 \times 10^{- 2}

and gradually reduced to

1 \times 10^{- 4}

. This careful calibration of hyperparameters aimed to strike an optimal balance between the speed and stability of the learning process, thereby maximizing model performance. The optimized hyperparameters employed during model training are detailed in Table 4.

4.2. Criteria for Evaluating Detection Performance

Our proposed detection technique is evaluated through a comprehensive set of metrics, namely detection accuracy, computational complexity, and parameter count.

4.2.1. Detection Accuracy

Detection accuracy is quantified using the mean Average Precision (mAP) metric, which is computed at an Intersection over Union (IoU) threshold of 0.5. The formula for calculating mAP is defined as:

mAP = \frac{1}{n} \sum_{i = 1}^{n} AP (i)

(10)

Here, n represents the total number of categories in the dataset, and

AP (i)

denotes the Average Precision for class i. The Average Precision for each category is calculated as the area under the precision–recall curve:

AP = \int_{0}^{1} P (R) d R

(11)

P = \frac{T P}{T P + F P}

(12)

R = \frac{T P}{F N + T P}

(13)

Within these mathematical expressions, P is representative of precision, capturing the chance of making a correct prediction among positively identified instances. Concurrently, R stands for recall, highlighting the chance of capturing all genuine positive cases. TPs is shorthand for True Positives, while FPs and FNs correspond to False Positives and False Negatives, in that order.

4.2.2. Computational Complexity

The computational efficiency is evaluated using Giga Floating-point Operations Per Second (GFLOPs), a metric that quantifies the floating-point operations executed during a single forward propagation. This metric is particularly critical for evaluating the efficiency of small object detection algorithms in computationally constrained scenarios.

4.2.3. Parameter Count

The parameter count reflects the model’s architectural compactness and deployment efficiency. In remote sensing applications, models with fewer parameters are advantageous for low-memory environments while still needing to maintain high detection performance across diverse scenes and object scales.

4.3. Relative Performance Evaluation of Enhanced Algorithms

Our study presents a comparative performance evaluation of the proposed RSNet model against a conventional Baseline Model, employing two well-known remote sensing datasets, DOTA and NWPU VHR-10. This evaluation rigorously analyzes the model’s performance across these datasets, with detailed metrics compiled in Table 5.

In our analysis of the DOTA dataset, the RSNet model not only exhibited superior precision (89.6%), recall (77.3%), and

{mAP}_{50}

(82.7%) but also demonstrated significant reductions in parameter count by 41.9% (from 11.12 M to 6.49 M) and computational costs by 13.7% (from 28.5 GFLOPs to 24.6 GFLOPs). These improvements underscore the model’s enhanced capability to efficiently handle the complexities of remote sensing tasks. In particular, it adapts well to various remote sensing scenarios, markedly improving performance and efficiency compared to the baseline Model. In the NWPU VHR-10 dataset, the RSNet model surpassed the baseline in terms of precision (91.6%) and

{mAP}_{50}

(92.2%) while maintaining a robust recall rate (86.7%) and a reduced parameter volume (6.49 M), thereby offering better support for real-time processing tasks. In the field of remote sensing monitoring, the RSNet model is capable of effective real-time detection under the challenging conditions of remote sensing image analysis.

Overall, the RSNet model not only demonstrated improved performance metrics but also effectively addressed the computational demand challenges inherent in remote sensing tasks. Notably, in the demanding context of remote sensing image processing, it maintains high monitoring accuracy while achieving a more streamlined parameter structure, reflecting significant advancements in the field.

The visual representations in Figure 9 illustrate the P-R curves of the baseline Model and RSNet on both the DOTA and NWPU VHR-10 datasets. As shown in Figure 9a, the P-R curve of the baseline Model on the DOTA dataset demonstrates a specific trade-off between precision and recall. In contrast, Figure 9b highlights the P-R curve of RSNet on the DOTA dataset, showing a noticeable improvement in both precision and recall compared to the baseline. Moving to the NWPU VHR-10 dataset, Figure 9c illustrates the P-R curve for the baseline Model, while Figure 9d showcases the superior P-R curve of RSNet, indicating enhanced performance in object detection tasks. These comparisons underscore RSNet’s superior ability to balance precision and recall across both datasets, reflecting its improved efficiency and effectiveness in object detection.

To evaluate the deployment practicality of the proposed RSNet model on embedded platforms, we tested the trained RSNet model on the NVIDIA Jetson Nano. With its compact size and low power consumption, the Jetson Nano is an ideal platform for resource-constrained environments. The unoptimized model achieved an inference speed of 18.4 FPS on the Jetson Nano, which is suitable for basic applications but insufficient for tasks with high real-time requirements.

To enhance performance, TensorRT optimization was applied to RSNet. As a result, RSNet achieved a more than threefold increase in frame rate—reaching 72.6 FPS—while using fewer parameters and lower computational cost. Table 6 presents a detailed comparison before and after optimization.

4.4. Ablation Experiment Analysis

To evaluate the effectiveness of each component in RSNet, we conducted ablation experiments on two benchmark datasets: DOTA and NWPU VHR-10. We began with the original YOLOv8s model as the baseline, which achieves an

{mAP}_{50}

of 80.9% on DOTA and 88.5% on NWPU. Based on this foundation, we progressively integrated our proposed modules to observe their respective contributions to detection performance. The detailed results are presented in Table 7.

The addition of the ADown module led to a moderate but consistent performance boost, with

{mAP}_{50}

increasing to 81.3% on DOTA and 89.8% on NWPU. This suggests that ADown helps retain critical spatial features during the downsampling process, thereby improving the detection of small or weakly represented objects.

Incorporating GSConv further enhanced performance while significantly reducing computational cost. On DOTA,

{mAP}_{50}

reached 80.5% with a notable reduction in GFLOPs from 28.5 to 16.1. Similarly, NWPU maintained a strong performance with 88.0%

{mAP}_{50}

. These results demonstrate that GSConv successfully balances feature expressiveness and model efficiency, making it particularly suitable for lightweight detection tasks in high-resolution scenarios.

The integration of CADH had a more substantial effect on detection accuracy. When applied independently, CADH improved the

{mAP}_{50}

to 82.2% on DOTA and 91.9% on NWPU. These gains indicate that CADH effectively enhances the model’s capability in precise localization, especially under dense object distributions and complex visual conditions.

When combining GSConv and CADH, the model achieved further improvements, reaching 81.6%

{mAP}_{50}

on DOTA and 91.6% on NWPU, with the lowest parameter count (4.53 M) and GFLOPs among all tested configurations. This combination underscores the synergy between efficient convolution and adaptive detection head design.

Finally, the complete RSNet model, integrating ADown, GSConv, and CADH, achieved the best overall results: 82.7%

{mAP}_{50}

on DOTA and 92.2% on NWPU. These outcomes reflect not only the individual effectiveness of each component but also their collective optimization of feature extraction, fusion, and prediction. We note that current evaluations are based on structured benchmarks; generalization to more diverse or cross-domain scenarios will be considered in future work. The proposed RSNet thus demonstrates a strong capability to address the challenges of small-object detection, dense layouts, and constrained computational budgets in remote sensing imagery.

4.5. Cross-Validation Results

To validate the effectiveness and generalization ability of the proposed model, this section presents a comprehensive cross-validation study based on a K-fold evaluation scheme.

4.5.1. Model Validation Technique

To systematically evaluate the generalization performance of the proposed model across different data partitions, this study adopts K-fold cross-validation as the primary validation strategy. This widely used resampling method divides the dataset into K equally sized and mutually exclusive subsets (folds). During K-fold cross-validation, at every iteration, a distinct fold is designated as the validation subset, with the complementary K-1 folds constituting the training data. This procedure is iterated K times in a complete cycle, guaranteeing that every data partition is utilized precisely once for validation purposes.

By repeatedly training and evaluating the model on multiple data splits, this approach produces more representative and reliable performance estimates. Moreover, it facilitates the assessment of the model’s robustness under varying data distributions.

4.5.2. Data Partitioning and Validation Procedure

To comprehensively evaluate the performance and robustness of the proposed RSNet model, five-fold cross-validation (K = 5) is employed in this study. The entire dataset is first randomly shuffled to eliminate potential ordering bias and then partitioned into five approximately equal subsets. In each round, one subset is used as the validation set, while the remaining four subsets are used for training. This rotation ensures that all data points participate in both training and validation phases, thereby maximizing data utilization and supporting a comprehensive evaluation of model performance.

4.5.3. Cross-Validation Metrics

To evaluate the robustness and efficiency of the proposed model, we perform five-fold cross-validation on the DOTA dataset and report standard performance and complexity metrics. Detection accuracy is measured using mean Average Precision (mAP) at an IoU threshold of 0.5. In addition, precision (P) and recall (R) are reported to reflect the correctness and completeness of predictions. For computational assessment, we include the number of floating-point operations (GFLOPs) and total parameters (M) as indicators of model complexity and deployment feasibility. Together, these metrics provide a balanced evaluation of both effectiveness and efficiency.

4.5.4. Result Analysis

The cross-validation results in Table 8 demonstrate that the proposed model achieves consistently high performance across all five folds. The

{mAP}_{50}

scores range narrowly between 82.4% and 82.9%, with an average of 82.6%, indicating stable detection accuracy under varying data partitions. Precision remains uniformly high across folds (86.1–87.4%), while recall shows similarly small fluctuations (77.1–79.1%), suggesting a balanced and robust detection capability.

Importantly, the variance across folds is minimal for all metrics, highlighting the model’s strong generalization ability and resistance to overfitting. Meanwhile, the computational profile remains fixed with 24.6 GFLOPs and 6.49 million parameters in all configurations, further supporting the framework’s consistency and suitability for deployment in real-world remote sensing scenarios.

4.6. Benchmarking Against Leading Algorithms

To comprehensively evaluate the effectiveness of the proposed RSNet, we benchmark it against a series of representative object detection models on two widely used remote sensing datasets: DOTA and NWPU VHR-10. These benchmarks cover diverse challenges such as cluttered backgrounds, dense target distributions, and significant scale variations, making them well-suited for real-world detection evaluation.

The comparative study includes both two-stage detectors (e.g., SCRDet, ICN, Faster R-CNN, FADet) and one-stage frameworks (e.g., SSD, YOLO series, FMSSD). The results, summarized in Table 9, provide a holistic comparison across multiple performance indicators, including precision, recall, mAP@50, and model size.

As observed in Table 9, RSNet consistently outperforms competing methods across key performance indicators. On the DOTA dataset, it achieves a precision of 89.6%, a recall of 77.3%, and an mAP@50 of 82.7%. On the NWPU VHR-10 dataset, RSNet attains a precision of 91.6%, a recall of 86.7%, and an mAP@50 of 92.2%. These results surpass several state-of-the-art models, including FMSSD and FFCA-YOLO, and highlight RSNet’s stable and accurate detection of small objects in complex remote sensing imagery. This indicates that RSNet is more suitable for dense and scale-variant targets.

Moreover, RSNet delivers this strong performance with a highly compact architecture, requiring only 6.49 M parameters. In comparison, YOLOv7 contains 51.1 M parameters, Faster R-CNN has 107.8 M, and YOLOv5l has 45.5 M. This demonstrates RSNet’s remarkable balance between lightweight design and detection effectiveness.

A visual summary is provided in Figure 10, which illustrates the comparative performance of RSNet relative to other state-of-the-art models across both datasets. The results confirm that RSNet is both accurate and efficient, making it a strong candidate for real-time remote sensing applications under limited computational resources.

4.7. Feature Map Visualization and Analysis

To better evaluate the effectiveness of our architectural improvements, we visualize the intermediate feature maps generated on representative samples from the DOTA dataset. As shown in Figure 11, each row displays the original remote sensing image, the feature response after the CADH module, and the final output from the complete RSNet model.

After introducing the CADH module, the activation maps exhibit more concentrated spatial responses in small-object regions. The attention distribution becomes more focused, and the activation intensity in target areas is notably enhanced, indicating that the module effectively improves spatial alignment between features and objects.

In comparison, the final RSNet model generates even more concentrated and structured responses, with greater emphasis on object contours and reduced background noise. The activated regions corresponding to objects appear prominently in red, suggesting that the model is strongly focused on these targets, which facilitates subsequent detection. This indicates stronger semantic encoding and improved spatial discrimination capabilities.

4.8. Experimental Results in a Visual Format

To elucidate the advancements conferred by our refined RSNet on remote sensing image object detection, we evaluated its performance against the conventional baseline algorithm and provided a visual comparison of the outcomes. The analysis vividly underscores the disparities in object detection capabilities between the enhanced RSNet and the baseline system. Notably, RSNet demonstrates enhanced detection proficiency, particularly in recognizing small or intricate objects—tasks at which the original algorithm might falter by either missing or inaccurately identifying targets. RSNet consistently outperforms in such demanding scenarios due to the strategic integration of bespoke modules and sophisticated enhancements within its architecture. These improvements not only augment the model’s precision in object detection but also bolster its dependability across varied operational conditions.

Figure 12 offers a meticulous comparison of visual results, showcasing RSNet’s efficacy in practical scenarios. The advanced algorithm exhibits a heightened ability to detect smaller objects, which may elude the baseline system. Moreover, RSNet achieves a significant reduction in both false positives and negatives, markedly excelling beyond baseline in terms of effective error reduction, thus enhancing detection accuracy and minimizing incorrect assessments. Importantly, as evidenced in the results shown in Table 9, the model maintains a streamlined parameter profile, making it ideally suited for applications necessitating swift response capabilities. Comprehensive evaluations affirm that RSNet substantially improves upon the original algorithm in critical performance metrics such as accuracy, adaptability, and real-time operational efficiency, representing a notable leap forward in the technology.

5. Conclusions

In this paper, we propose RSNet, a lightweight and high-precision detection framework tailored for small object detection in high-resolution remote sensing imagery. The framework was designed in response to the inherent challenges of remote sensing environments, such as high object density, complex backgrounds, and significant scale variation.

To address these challenges, RSNet integrates several core innovations: the Compact-Align Detection Head (CADH) for adaptive feature alignment and improved localization precision in dense scenes; the Adaptive Downsampling module (ADown) to preserve both global and local features during resolution reduction; and GSConv, a lightweight convolutional operator, which enhances feature representation while reducing computational complexity. Together, these components form a cohesive architecture that balances detection accuracy with inference efficiency. In addition, we adopt a K-fold cross-validation strategy to enhance the stability and credibility of model evaluation under spatially heterogeneous remote sensing data conditions.

Extensive experiments on two representative benchmarks, DOTA and NWPU VHR-10, demonstrate that RSNet consistently outperforms existing baseline Models. It achieves higher mAP scores with fewer parameters and lower FLOPs, validating its effectiveness in real-time and resource-constrained remote sensing applications. Meanwhile, we acknowledge that RSNet’s current evaluation is limited to visible-spectrum data with predefined categories and scene types, which may not fully reflect the variability encountered in ultra-dense or cross-domain scenarios.

Building on this work, future efforts will focus on enhancing RSNet’s robustness in ultra-dense or occluded scenarios through advanced attention mechanisms, deformable convolutions, and self-supervised learning. We will also explore cross-domain generalization and real-time optimization. In addition, we plan to extend RSNet to multispectral and hyperspectral imagery to validate its spectral adaptability and further broaden its application scope.

In summary, RSNet presents a robust, scalable, and efficient solution for Small Object Detection in Remote Sensing, offering significant advancements in both accuracy and deployability across complex aerial imagery.

Author Contributions

Q.D. contributed to the conceptual design of the research. The initial manuscript was prepared by Q.D. and T.H. Data collection and organization were handled by G.W., while G.W. and L.S. provided technical input on data analysis. Language editing was performed by G.W. and T.H. All authors, including Q.D., T.H., G.W. and L.S., participated in the critical review and iterative revisions of the manuscript. Project supervision was undertaken by L.S. and B.Q. contributed to data organization and provided valuable insights during manuscript preparation. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China (Grant No. 2019YFB1405302) and the National Natural Science Foundation of China (Grant No. 61872072).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

This study utilizes two publicly available datasets: the widely used DOTA benchmark and the NWPU VHR-10 dataset. The relevant data resources can be accessed at https://github.com/chaozhong2010/VHR-10_dataset_coco (accessed on 3 June 2025). Additional materials and experimental data supporting this work are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CADH	Compact-Align Detection Head
GSConv	lightweight hybrid convolution with feature shuffle
ADown	adaptive downsampling module
CNN	Convolutional Neural Network
mAP	mean Average Precision
AP	Average Precision
IoU	Intersection over Union
P	Precision
R	Recall
TPs	True Positives
FPs	False Positives
FN	False Negatives
GFLOPs	Giga Floating-point Operations Per Second
FPS	Frame Per Second

References

Wang, Y.; Shao, Z.; Lu, T.; Wu, C.; Wang, J. Remote Sensing Image Super-Resolution via Multiscale Enhancement Network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Huang, S.; Lin, C.; Jiang, X.; Qu, Z. BRSTD: Bio-Inspired Remote Sensing Tiny Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Xu, Q.; Shi, Y.; Yuan, X.; Zhu, X.X. Universal Domain Adaptation for Remote Sensing Image Scene Classification. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4700515. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Wang, L.; Zhang, C. Land cover classification from remote sensing images based on multi-scale fully convolutional network. Geo-Spat. Inf. Sci. 2022, 25, 278–294. [Google Scholar] [CrossRef]
Shi, J.; Liu, W.; Shan, H.; Li, E.; Li, X.; Zhang, L. Remote Sensing Scene Classification Based on Multibranch Fusion Attention Network. IEEE Geosci. Remote Sens. Lett. 2023, 20, 1–5. [Google Scholar] [CrossRef]
Neubeck, A.; Van Gool, L. Efficient Non-Maximum Suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; Volume 3, pp. 850–855. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The Pascal Visual Object Classes (VOC) Challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Liang, D.; Zhang, J.W.; Tang, Y.P.; Huang, S.J. MUS-CDB: Mixed Uncertainty Sampling with Class Distribution Balancing for Active Annotation in Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
Zhao, D.; Shao, F.; Liu, Q.; Zhang, H.; Zhang, Z.; Yang, L. Improved Architecture and Training Strategies of YOLOv7 for Remote Sensing Image Object Detection. Remote Sens. 2024, 16, 3321. [Google Scholar] [CrossRef]
Zhu, Y.; Pan, Y.; Zhang, D.; Wu, H.; Zhao, C. A Deep Learning Method for Cultivated Land Parcels’ (CLPs) Delineation From High-Resolution Remote Sensing Images with High-Generalization Capability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–25. [Google Scholar] [CrossRef]
Han, T.; Dong, Q.; Wang, X.; Sun, L. BED-YOLO: An Enhanced YOLOv8 for High-Precision Real-Time Bearing Defect Detection. IEEE Trans. Instrum. Meas. 2024, 73, 1–13. [Google Scholar] [CrossRef]
Cheng, A.; Xiao, J.; Li, Y.; Sun, Y.; Ren, Y.; Liu, J. Enhancing Remote Sensing Object Detection with K-CBST YOLO: Integrating CBAM and Swin-Transformer. Remote Sens. 2024, 16, 2885. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, HI, USA, 8–14 December 2001. [CrossRef]
Wan, D.; Lu, R.; Wang, S.; Shen, S.; Xu, T.; Lang, X. YOLO-HR: Improved YOLOv5 for Object Detection in High-Resolution Optical Remote Sensing Images. Remote Sens. 2023, 15, 614. [Google Scholar] [CrossRef]
Shin, Y.; Shin, H.; Ok, J.; Back, M.; Youn, J.; Kim, S. DCEF2-YOLO: Aerial Detection YOLO with Deformable Convolution–Efficient Feature Fusion for Small Target Detection. Remote Sens. 2024, 16, 1071. [Google Scholar] [CrossRef]
Zhang, J.; Chen, Z.; Yan, G.; Wang, Y.; Hu, B. Faster and Lightweight: An Improved YOLOv5 Object Detector for Remote Sensing Images. Remote Sens. 2023, 15, 4974. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. In Proceedings of the International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar] [CrossRef]
Seo, M.; Lee, H.; Jeon, Y.; Seo, J. Self-Pair: Synthesizing Changes from Single Source for Object Change Detection in Remote Sensing Imagery. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 6363–6372. [Google Scholar] [CrossRef]
Lang, K.; Yang, M.; Wang, H.; Wang, H.; Wang, Z.; Zhang, J.; Shen, H. Improved One-Stage Detectors with Neck Attention Block for Object Detection in Remote Sensing. Remote Sens. 2022, 14, 5805. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
Bayrak, O.C.; Erdem, F.; Uzar, M. Deep learning based aerial imagery classification for tree species identification. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2023, XLVIII-M-1-2023, 471–476. [Google Scholar] [CrossRef]
Wang, Y.; Ma, L.; Wang, Q.; Wang, N.; Wang, D.; Wang, X.; Zheng, Q.; Hou, X.; Ouyang, G. A Lightweight and High-Accuracy Deep Learning Method for Grassland Grazing Livestock Detection Using UAV Imagery. Remote Sens. 2023, 15, 1593. [Google Scholar] [CrossRef]
Gu, L.; Fang, Q.; Wang, Z.; Popov, E.; Dong, G. Learning Lightweight and Superior Detectors with Feature Distillation for Onboard Remote Sensing Object Detection. Remote Sens. 2023, 15, 370. [Google Scholar] [CrossRef]
Liu, S.; Shao, F.; Chu, W.; Dai, J.; Zhang, H. An Improved YOLOv8-Based Lightweight Attention Mechanism for Cross-Scale Feature Fusion. Remote Sens. 2025, 17, 1044. [Google Scholar] [CrossRef]
Zhang, J.; Lei, J.; Xie, W.; Li, Y.; Yang, G.; Jia, X. Guided Hybrid Quantization for Object Detection in Remote Sensing Imagery via One-to-One Self-Teaching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5614815. [Google Scholar] [CrossRef]
Dong, Z.; Wang, M.; Wang, Y.; Zhu, Y.; Zhang, Z. Object Detection in High Resolution Remote Sensing Imagery Based on Convolutional Neural Networks with Suitable Object Scale Features. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2104–2114. [Google Scholar] [CrossRef]
Bai, P.; Xia, Y.; Feng, J. Composite Perception and Multiscale Fusion Network for Arbitrary-Oriented Object Detection in Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5645916. [Google Scholar] [CrossRef]
Lin, Q.; Chen, N.; Huang, H.; Zhu, D.; Fu, G.; Chen, C.; Yu, Y. Attention-Based Mean-Max Balance Assignment for Oriented Object Detection in Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–15. [Google Scholar] [CrossRef]
Yang, Y.; Dai, J.; Wang, Y.; Chen, Y. FM-RTDETR: Small Object Detection Algorithm Based on Enhanced Feature Fusion with Mamba. IEEE Signal Process. Lett. 2025, 32, 1570–1574. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Chollet, F. Xception: Deep Learning with Depthwise Separable Convolutions. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1800–1807. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Proceedings of the Computer Vision—ECCV 2024: 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 1–21. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. SCRDet: Towards More Robust Detection for Small, Cluttered and Rotated Objects. In Proceedings of the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
Azimi, S.M.; Vig, E.; Bahmanyar, R.; Körner, M.; Reinartz, P. Towards Multi-class Object Detection in Unconstrained Remote Sensing Imagery. In Proceedings of the Computer Vision—ACCV 2018, Perth, Australia, 2–6 December 2018; Jawahar, C.V., Li, H., Mori, G., Schindler, K., Eds.; Springer: Cham, Switzerland, 2019; pp. 150–165. [Google Scholar]
Zhou, Q.; Yu, C.; Wang, Z.; Wang, F. D2Q-DETR: Decoupling and Dynamic Queries for Oriented Object Detection with Transformers. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
He, X.; Liang, K.; Zhang, W.; Li, F.; Jiang, Z.; Zuo, Z.; Tan, X. DETR-ORD: An Improved DETR Detector for Oriented Remote Sensing Object Detection with Feature Reconstruction and Dynamic Query. Remote Sens. 2024, 16, 3516. [Google Scholar] [CrossRef]
Wang, P.; Sun, X.; Diao, W.; Fu, K. FMSSD: Feature-Merged Single-Shot Detection for Multiscale Objects in Large-Scale Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2020, 58, 3377–3390. [Google Scholar] [CrossRef]

Figure 1. This figure showcases representative remote sensing images arranged from left to right by object scale—large, medium, and small. As the object size decreases, the relative proportion of targets becomes smaller, and detection difficulty increases due to cluttered backgrounds and reduced visual cues. This progressive composition illustrates the scale variation challenge central to small object detection in remote sensing scenarios.

Figure 2. Architecture of the RSNet.

Figure 3. The CADH process and architecture.

Figure 4. The GSConv process and architecture.

Figure 5. The ADown process and architecture.

Figure 6. Workflow of the embedded deployment process.

Figure 7. (Left) Dataset statistics visualization, including category distribution, spatial distribution, and bounding box aspect ratio analysis. (Right) Sample raw images from the DOTA dataset.

Figure 8. A few chosen raw images from NWPU VHR-10.

Figure 9. P-R curves of the baseline Model and the RSNet on DOTA and NWPU VHR-10 datasets.

Figure 10. Benchmarking performance comparison between RSNet and representative detection models on the DOTA and NWPU VHR-10 datasets.

Figure 11. Feature map visualization on the DOTA dataset. From left to right: original image, after CADH, and after RSNet. RSNet yields more focused and structured activations with clearer object localization.

Figure 12. Comparison of detection results between the baseline and the proposed method on the DOTA and NWPU VHR-10 datasets. From top to bottom: original remote sensing image, detection output of the baseline algorithm, and detection output of the proposed approach.

Table 1. Symbol correspondence between mathematical notation and pseudocode.

Equation	Pseudocode	Description
$feat$	$F_{shared}$	Shared feature representation
$avgFeat$	$F_{avg}$	Global context feature
$clsFeat$	$F_{cls}$	Classification-specific feature
${regFeat}_{aligned}$	$F_{reg - aligned}$	Aligned regression feature
$clsProb$	$P_{cls}$	Class probability map
$clsFeat \cdot clsProb$	$F_{cls - refined}$	Refined classification feature
$finalOutput$	O	Final detection output

Table 2. Statistical analysis of object sizes in datasets.

Dataset	10–50 pixels	50–300 pixels	Above 300 pixels
NWPU VHR-10	0.15	0.83	0.02
DOTA	0.57	0.41	0.02

Table 3. Parameters for model training.

Training Hyperparameter	Selected Configuration
Number of Epochs	200
Image Dimensions	640 × 640
Batch Size	16
Data Augmentation Method	Mosaic

Table 4. Hyperparameters for training.

Hyperparameter	Setup
Optimizer	SGD
Initial Learning Rate (Lr0)	0.01
Final Learning Rate (Lrf)	0.01
Momentum	0.937
Weight Decay Coefficient	0.0005
Image Scale	0.5
Image Flip Left-Right	0.5
Image Translation	0.1
Mosaic	1.0

Table 5. Comparison of performance metrics of models on the DOTA and NWPU VHR-10 datasets.

Module	Dataset	Par. (M)	GFLOPs	P (%)	R (%)	mAP@50 (%)
YOLOv8s	DOTA	11.12	28.5	86.1	75.7	80.9
RSNet	DOTA	6.49	24.6	89.6	77.3	82.7
YOLOv8s	NWPU	11.12	28.5	91.4	81.9	88.5
RSNet	NWPU	6.49	24.6	91.6	86.7	92.2

Table 6. Jetson nano inference comparison.

Model	Parameter	FPS (Jetson Nano)	GFLOPs
baseline	11.12	18.4	28.4
RSnet	6.49	72.6	24.6

Table 7. Outcome analysis of ablation experiments.

Dataset	Model	ADown	GSConv	CADH	Parameters (M)	GFLOPs	P (%)	R (%)	mAP₅₀ (%)
DOTA	YOLOv8s				11.12	28.5	86.1	75.7	80.9
DOTA	+Adown	✓			9.48	25.7	88.6	73.7	81.3
DOTA	+GSConv		✓		5.92	16.1	86.6	75.3	80.5
DOTA	+CADH			✓	8.87	33.0	88.1	76.7	82.2
DOTA	+GSConv+CADH		✓	✓	4.53	10.9	87.0	76.3	81.6
DOTA	RSNet	✓	✓	✓	6.49	24.6	89.6	77.3	82.7
NWPU	YOLOv8s				11.12	28.5	91.4	81.9	88.5
NWPU	+ADown	✓			9.48	25.7	90.1	81.2	89.8
NWPU	+GSConv		✓		5.92	16.1	91.3	80.0	88.0
NWPU	+CADH		✓		8.87	33.0	89.5	87.3	91.9
NWPU	+GSConv+CADH		✓	✓	4.53	10.9	91.3	86.1	91.6
NWPU	RSNet	✓	✓	✓	6.49	24.6	91.6	86.7	92.2

Table 8. Cross-validation results of the proposed model on the DOTA dataset.

Fold	mAP⁵⁰ (%)	P (%)	R (%)	GFLOPs	Para. (M)
1	82.4	87.4	77.4	24.6	6.49
2	82.9	86.1	78.8	24.6	6.49
3	82.7	86.6	79.1	24.6	6.49
4	82.5	87.1	77.1	24.6	6.49
5	82.5	86.3	78.3	24.6	6.49
Average	82.6	86.7	78.1	24.6	6.49

Table 9. Comparative performance analysis of advanced algorithms on the DOTA and NWPU datasets.

Model	DOTA	NWPU	One-Stage	Two-Stage	Precision (%)	Recall (%)	mAP₅₀ (%)	Params
SCRDet [43]	✓			✓	75.4	84.2	71.7	–
ICN [44]	✓			✓	72.5	83.0	77.7	—
Faster R-CNN	✓			✓	62.6	76.7	58.3	107.8
FADet	✓			✓	75.4	88.2	78.6	–
SSD	✓		✓		59.6	68.4	56.2	90.7
RT-DETR [23]	✓		✓		—	—	68.1	32.8
D2Q-DETR [45]	✓		✓		—	—	78.8	–
DETR-ORD [46]	✓		✓		—	—	72.1	–
YOLO v3	✓		✓		60.9	69.0	57.1	60.4
YOLO v5l	✓		✓		81.5	69.1	73.9	45.5
YOLO v7	✓		✓		77.9	71.7	75.5	51.1
YOLO v8s	✓		✓		86.1	75.7	80.9	11.12
YOLO v9	✓		✓		79.6	72.0	76.5	61.8
YOLO v10s	✓		✓		78.4	72.7	75.5	8.04
YOLO 12s	✓		✓		84.2	69.5	76.0	9.23
FFCA-YOLO [36]	✓		✓		84.4	75.1	81.5	7.12
FMSSD [47]	✓		✓		72.4	82.1	67.5	61.3
RSNet (Ours)	✓		✓		89.6	77.3	82.7	6.49
Faster R-CNN		✓		✓	76.4	86.4	70.8	60.77
FPN		✓		✓	88.3	85.2	90.2	50.7
SSD		✓	✓		79.1	90.2	78.2	90.7
YOLO v3		✓	✓		72.3	73.6	72.9	60.4
YOLO v4		✓	✓		86.9	87.8	89.7	64.4
YOLO v7		✓	✓		90.3	88.6	91.1	36.9
YOLO v7-tiny		✓	✓		88.1	84.4	88.4	6.2
YOLO v8s		✓	✓		91.4	81.9	88.5	11.12
YOLO v10s		✓	✓		75.7	73.3	79.5	8.04
YOLO 12s		✓	✓		85.1	77.1	83.2	9.23
RSNet (Ours)		✓	✓		91.6	86.7	92.2	6.49

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Dong, Q.; Han, T.; Wu, G.; Qiao, B.; Sun, L. RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing. Remote Sens. 2025, 17, 1965. https://doi.org/10.3390/rs17121965

AMA Style

Dong Q, Han T, Wu G, Qiao B, Sun L. RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing. Remote Sensing. 2025; 17(12):1965. https://doi.org/10.3390/rs17121965

Chicago/Turabian Style

Dong, Qing, Tianxin Han, Gang Wu, Baiyou Qiao, and Lina Sun. 2025. "RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing" Remote Sensing 17, no. 12: 1965. https://doi.org/10.3390/rs17121965

APA Style

Dong, Q., Han, T., Wu, G., Qiao, B., & Sun, L. (2025). RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing. Remote Sensing, 17(12), 1965. https://doi.org/10.3390/rs17121965

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RSNet: Compact-Align Detection Head Embedded Lightweight Network for Small Object Detection in Remote Sensing

Abstract

1. Introduction

2. Related Work

2.1. Traditional Object Detection Methods: Foundations and Limitations

2.2. Advancements in Object Detection Algorithms

2.3. Object Detection Techniques in Remote Sensing Imagery

2.4. Lightweight Networks for Remote Sensing and Small Object Detection

3. Methods

3.1. Baseline Detection Framework

3.2. RSNet

3.2.1. Input

3.2.2. Backbone

3.2.3. Head

3.3. Compact-Align Detection Head (CADH)

3.4. GSConv: A Lightweight Hybrid Convolution with Feature Shuffle

3.5. Adaptive Downsampling Module (ADown)

3.6. Embedded Deployment and Optimization

4. Experimental Validation

4.1. Dataset and Experimental Configuration

4.1.1. DOTA Dataset

4.1.2. NWPU VHR-10 Dataset

4.1.3. Technical Environment

4.1.4. Model Training Configuration

4.1.5. Training Hyperparameters

4.2. Criteria for Evaluating Detection Performance

4.2.1. Detection Accuracy

4.2.2. Computational Complexity

4.2.3. Parameter Count

4.3. Relative Performance Evaluation of Enhanced Algorithms

4.4. Ablation Experiment Analysis

4.5. Cross-Validation Results

4.5.1. Model Validation Technique

4.5.2. Data Partitioning and Validation Procedure

4.5.3. Cross-Validation Metrics

4.5.4. Result Analysis

4.6. Benchmarking Against Leading Algorithms

4.7. Feature Map Visualization and Analysis

4.8. Experimental Results in a Visual Format

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI