Next Article in Journal
An Ensemble Learning Approach for Landslide Susceptibility Assessment Considering Spatial Heterogeneity Partitioning and Feature Selection
Previous Article in Journal
A Selective State-Space-Model Based Model for Global Zenith Tropospheric Delay Prediction
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement

1
School of Information and Communication Engineering, Hainan University, Haikou 570228, China
2
School of Computer Science and Cyber Engineering, Guangzhou University, Guangzhou 510006, China
3
Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(16), 2876; https://doi.org/10.3390/rs17162876
Submission received: 7 June 2025 / Revised: 6 August 2025 / Accepted: 14 August 2025 / Published: 18 August 2025

Abstract

Recent advances in deep learning have significantly improved remote sensing object detection (RSOD). However, tiny object detection (TOD) remains challenging due to two main issues: (1) limited appearance cues and (2) the traditional Intersection over Union (IoU)-based label assignment strategy, which struggles to identify enough positive samples. To address these, we propose DCEDet, a new tiny object detector for remote sensing images that enhances feature representation and optimizes label assignment. Specifically, we first design a dual-contrast feature enhancement structure, i.e., the Group-Single Context Enhancement Module (GSCEM) and Global-Local Feature Fusion Module (GLFFM). Among them, the GSCEM is designed to extract contextual enhancement features as supplementary information for TOD. The GLFFM is a feature fusion module devised to integrate both global object distribution and local detail information, aiming to prevent information loss and enhance the localization of tiny objects. In addition, the Normalized Distance and Difference Metric (NDDM) is designed as a dynamic distance measurement that enhances class representation and localization performance in TOD, thereby optimizing the training process. Finally, we conduct extensive experiments on two typical tiny object datasets, i.e., AI-TODv2 and LEVIR-SHIP, achieving optimal results of 27.8% AP t and 81.2% AP 50 . The experimental results demonstrate the effectiveness and superiority of our method.

1. Introduction

Remote sensing object detection (RSOD) in visible remote sensing images aims to accurately classify and localize objects of interest, and it plays a vital role in applications such as military security [1], maritime rescue [2], and traffic monitoring [3]. Unlike hyperspectral images [4,5], which suffer from low spatial resolution, or synthetic aperture radar images [6], which require complex processing, visible images offer finer spatial details and better interpretability, making them the most widely used data source for object detection [7]. Traditional object detection methods usually rely on handcrafted features [8,9], which limit their ability to represent complex backgrounds and diverse objects [10]. In contrast to traditional methods, deep learning-based object detection methods [11,12,13] can automatically learn robust feature representations and often achieve significantly better performance. Affected by this, RSOD based on deep learning has grown by leaps and bounds. Unlike object detection in natural scenes, RSOD presents several unique challenges [14,15] that need to be addressed, including multi-scale detection, rotated detection, weak detection, tiny detection, and detection under limited supervision. Among these, tiny object detection (TOD) has attracted significant attention due to its widespread occurrence and substantial application value [16,17,18].
In order to analyze the scales of tiny objects more specifically in RSOD, researchers consider objects with absolute size in the range (2∼8 pixels) as very tiny, (8∼16 pixels) as tiny, (16∼32 pixels) as small, and (32∼64 pixels) as medium [19,20,21,22,23]. Some researchers focus on the TOD task. For example, Liu et al. [24] propose a multiple receptive field adaptive feature refinement module to enhance the detection of tiny objects. Wu et al. [25] adopt an alignment mechanism combined with a progressive optimization strategy to extract more discriminative features and achieve precise localization. Zhang et al. [20] design a gated context aggregation module along with a central region label assignment strategy to effectively integrate contextual information and generate more positive samples for TOD.
However, the detection performance of the aforementioned methods for tiny objects remains unsatisfactory. We identify two key challenges that need to be addressed: (1) Limited visual appearance. Tiny objects in remote sensing images occupy only a few pixels, resulting in blurred appearances and dense distributions, as shown in Figure 1a. These characteristics of tiny objects make it difficult for detectors to extract sufficient feature representations [25]. (2) High sensitivity of the IoU metric to localization deviations. Our baseline model, i.e., Faster R-CNN [11], adopts the IoU metric to assign positive and negative samples during training. As shown in Figure 1b, slight positional deviations have varying impacts on different object sizes, which is consistent with the analysis in [26]. Specifically, for a tiny object, a two-pixel shift significantly reduces the IoU (from 0.92 to 0.20). For a normal object, the same shift causes only a slight drop (from 0.93 to 0.81). This disparity in IoU variation makes it difficult to obtain enough positive samples for tiny objects, resulting in an imbalance between positive and negative samples. Some methods [20,27,28,29] replace the IoU metric with alternative criteria; however, these strategies typically fail to dynamically adjust sample weights, which limits their effectiveness in improving the performance of tiny object detectors.
To address the above issues, we propose DCEDet, a tiny object detector that incorporates dual-contrast feature enhancement and dynamic distance measurement. We first design a dual-contrast feature enhancement structure, i.e., the Group-Single Context Enhancement Module (GSCEM) and Global-Local Feature Fusion Module (GLFFM). The term “dual-contrast” refers to the two complementary strategies to enhance feature discriminability for tiny object detection. The GSCEM enhances intra-layer contrast by integrating multi-scale contextual information at different levels of granularity. Specifically, the GSCEM consists of two core components: GSCEM-Group (GSCEM-G) and GSCEM-Single (GSCEM-S). GSCEM-G employs hierarchical connections across group-channel feature maps to capture hybrid receptive fields, while GSCEM-S applies attention mechanisms at the single-channel level to refine features with fine-grained spatial context. Together, they enable the model to better distinguish tiny objects from complex backgrounds within individual feature layers. The GLFFM enhances inter-layer contrast by improving the integration of semantic and spatial information across adjacent layers. It simultaneously models global object distribution and local detail information while mitigating information loss, thereby boosting the localization accuracy of tiny objects. This dual-contrast design strengthens both within-layer feature enhancement and cross-layer feature fusion, effectively addressing the limited appearance cues and scale variations characteristic of remote sensing tiny object detection. Additionally, we introduce a dynamic distance measurement based on the Gaussian Probability Distribution, termed the Normalized Distance and Difference Metric (NDDM), as an alternative to the IoU metric. It dynamically assigns positive samples for tiny objects during the training phase, further enhancing class representation and localization information to achieve balanced label assignment.
Our main contributions are summarized as follows:
  • We propose a new tiny object detector for remote sensing images named DCEDet, which improves detection performance by enhancing feature representation and aligning with a suitable label assignment strategy.
  • We present the GSCEM and GLFFM to extract, respectively, context information and fuse multi-view features in order to improve the feature representation of tiny objects.
  • We devise the NDDM to replace the IoU-based label assignment in Region Proposal Network (RPN), thereby facilitating the assignment of positive samples for tiny objects.
  • To demonstrate the effectiveness of our method, we conduct extensive experiments on two tiny object detection datasets, achieving optimal performance.
The rest of this paper is organized as follows. The related work is presented in Section 2. The proposed method is described in Section 3. The experiments are shown in Section 4. The conclusion is given in Section 5.

2. Related Work

2.1. Generic Object Detection

With the advancement of convolutional neural networks, object detection has made significant progress and is generally categorized into two main types: two-stage methods and one-stage methods [30,31]. Specifically, Faster R-CNN [11] and Cascade R-CNN [32] are representative two-stage detectors that include region proposal extraction and detection. SSD [12] and RetinaNet [33] are representative single-stage detectors that use a unified framework to directly predict object locations and categories. In addition to the anchor-based detectors mentioned above, anchor-free detectors, such as FCOS [34], CornerNet [35], and CenterNet [36], achieve object localization by predicting the centers or corners of objects.

2.2. Object Detection in Remote Sensing Images

Building upon general object detection, RSOD has seen substantial advancements [37]. However, object detection in remote sensing images still faces several challenges. First, due to the varying spatial resolutions of remote sensing images, scale variation remains a significant challenge in object detection tasks, attracting widespread attention [38,39,40]. Zhang et al. [38] design a scale-adaptive module to enrich feature representation. Ma et al. [39] tackle feature confusion caused by multi-scale objects through a feature split–merge–enhancement network. Liu et al. [40] integrate a context enhancement module to leverage rich semantic information for improved multi-scale detection. Second, bird’s-eye view in remote sensing images often results in objects with arbitrary orientations, which has spurred extensive research into oriented object detection [41,42,43]. Han et al. [41] propose a feature alignment module and an oriented detection module to improve detection performance. Cheng et al. [42] design an oriented proposal network and a localization-guided detection head to alleviate the feature misalignment between classification and localization. Yao et al. [43] introduce a rotated bounding box representation in the polar coordinate system to address the boundary discontinuity problem. Third, remote sensing images often contain substantial background noise, prompting various efforts to enhance and emphasize foreground regions [44,45,46]. Yu et al. [44] design a feature response separation module to separate background and objects as much as possible. Huang et al. [45] propose a nonlocal-aware pyramid attention mechanism that helps the network focus on salient features while suppressing background noise. Guo et al. [46] adopt a feature pyramid fusion module and a head enhancement module to improve foreground–background discrimination.
Furthermore, limited supervision [47,48] and few-shot learning [49,50] are emerging research directions that aim to reduce the burden of manual annotation. Despite these efforts, many existing studies overlook the challenge of tiny object detection (TOD), which is both common and critical in remote sensing imagery. To address this gap, we propose DCEDet to improve TOD performance.

2.3. Tiny Object Detection in Remote Sensing Images

In RSOD, numerous methods are proposed to improve the detection performance of tiny objects, which can be broadly categorized into three strategies: (1) data-based strategy, (2) feature enhancement, and (3) label assignment strategy.
Specifically, the first strategy focuses on transforming and expanding the training data to enhance the model’s generalization ability. For example, Akyon et al. [51] expand the dataset by extracting and resizing patches from images. Kisantal et al. [52] apply oversampling and use a “copy-paste” technique to increase the number of tiny objects. Yu et al. [53] propose a scale-match approach that aligns object scales between the training and pretraining datasets to enhance the representation of tiny objects.
The second strategy adopts feature enhancement techniques to improve detection performance, including multi-scale learning [54,55], contextual information [20,56], attention mechanism [57,58,59], and feature fusion [60,61,62]. For instance, Liu et al. [54] propose a denoising feature pyramid module to extract undistorted multi-scale features for precise detection. Tong et al. [55] design a multi-scale label supervision network to enhance small object detection. Zhang et al. [20] utilize a gated context aggregation module to enhance feature representation of tiny objects. Zhao et al. [56] propose a scene context detection network that distinguishes tiny objects from complex backgrounds. Li et al. [57] employ a mask augmented attention mechanism to detect tiny objects in remote sensing images. Shi et al. [58] present an attention mechanism with cross-dimensional interaction to enhance the perception of small objects. Liu et al. [59] apply explicit supervision to tiny object regions during training, generating attention maps that enhance relevant regions and suppress background noise. Gong et al. [60] propose fusion factors to control the information flow from deep to shallow layers, making the Feature Pyramid Network (FPN) [63] more suitable for TOD. Zhang et al. [61] introduce a feature fusion encoder and decoder, improving the performance of infrared small object detection. Chen et al. [62] introduce a saliency learning interaction module to improve feature interaction and reduce background interference for TOD. Additionally, methods based on GANs [64] and super-resolution [65] techniques can achieve feature enhancement for tiny objects.
Third, based on the previous analysis, an inappropriate label assignment strategy may lead to an imbalance between positive and negative samples. To address this issue, several methods focus on optimizing label assignment for tiny objects [20,21,22,23,27,28]. Zhang et al. [20] introduce a center-region-based label assignment to provide more positive samples for tiny objects. Xu et al. [21] adopt the Wasserstein Distance as a new evaluation metric and use a ranking-based assignment strategy for TOD. Ge et al. [22] propose a sample selection strategy based on adaptive dynamic label assignment, which assigns individual thresholds to ground-truth boxes to optimize training. Fu et al. [23] employ a Gaussian-similarity-based label assignment strategy to assign high-quality anchors to tiny objects. Xu et al. [27] propose a Dot Distance metric to reduce sensitivity to localization offsets. Xu et al. [28] introduce a Receptive Field Distance to measure similarity between the Gaussian receptive field and ground truth, facilitating more balanced learning for tiny objects.
Inspired by previous studies, our proposed DCEDet improves the detection performance of tiny objects in two key aspects: feature enhancement and label assignment strategy. Specifically, we first propose a multi-scale contextual information enhancement structure, namely GSCEM, and a localization feature enhancement module, namely GLFFM. Furthermore, we design a dynamic distance measurement, NDDM, aiming to balance the positive and negative samples during training.

3. Methodology

In this section, we first introduce the overall architecture of DCEDet. Then we explain GSCEM, GLFFM, and NDDM in detail. Finally, we present the loss function.

3.1. Overview

Figure 2 illustrates the overall architecture of our proposed DCEDet. DCEDet is based on Faster R-CNN and consists of five components: the backbone for feature extraction, the neck for feature fusion, the RPN [11] for anchor generation, the Region of Interest Align (RoI Align) [66] for feature alignment, and the detection head for classification and regression. More importantly, GSCEM, GLFFM, and NDDM are introduced in different components to improve TOD in remote sensing images. Specifically, the GSCEM is added after ResNet50 [67] at each layer C i ( i { 2 , 3 , 4 , 5 } ) to enhance the representation of tiny objects by aggregating multi-scale contextual information. The GLFFM is constructed in the FPN to combine global object distribution and local detail information. In the RPN module, we design a new label assignment strategy for TOD, called NDDM, to address the imbalance between positive and negative samples during training.

3.2. Group–Single Context Enhancement Module

In TOD, the low resolution and limited pixel count of tiny objects hinder detection modules from extracting sufficient features, resulting in suboptimal detection performance. On the other hand, contextual information is beneficial for detecting tiny objects. Therefore, we propose a GSCEM to obtain contextual enhancement features for information supplementation.
As shown in Figure 2, GSCEM consists of two core modules: GSCEM-G and GSCEM-S. GSCEM-G employs hierarchical connections to obtain hybrid receptive fields on feature maps at different group-channel levels, thereby capturing multi-scale contextual information. GSCEM-S leverages an attention mechanism to enhance contextual information on feature maps at the single-channel level, enabling GSCEM to utilize multi-scale contextual information at a finer granularity. Specifically, for the input feature map F i n R C × H × W , we perform a channel split to divide F i n into four feature groups, denoted as x i ( i { 1 , 2 , 3 , 4 } ). The channel dimension in each group x i is one-fourth of that in the input feature map, while the spatial sizes remain the same. After each x i , there is an operation block C i ( · ) that includes a convolutional layer, a batch normalization layer, and an activation layer. As shown in GSCEM-G, each block has a different convolution kernel size to model different receptive fields. In addition, we utilize hierarchical connections to fuse the feature maps of adjacent groups. Therefore, y i can be written as
y i = C i ( x i ) , i = 1 C i ( C i 1 ( x i 1 ) + x i ) , i = 2 , 3 , 4 .
Then, we concatenate the y i of different receptive fields with F i n and apply a 1 × 1 convolutional layer, obtaining the feature map F m :
F c = C a t [ y 1 , y 2 , y 3 , y 4 , F i n ]
F m = δ ( B ( C o n v 1 × 1 ( F c ) ) )
where C a t [ · ] denotes the concatenation operation, C o n v 1 × 1 represents the 1 × 1 convolution, B indicates batch normalization, and  δ represents the ReLU activation function.
Finally, F m is fed into the GSCEM-S, which is based on the channel attention mechanism. Specifically, GSCEM-S adopts a dual-branch residual structure similar to that proposed in [68], which merges feature weights and local information at the single-channel level. For the first branch, GSCEM-S applies a convolution along the spatial dimension to capture local contextual features, such as edges and textures, which can be modeled as
L = B ( C o n v 3 × 3 ( δ ( B ( C o n v 3 × 3 ( F m ) ) ) ) )
where C o n v 3 × 3 represents the 3 × 3 convolutional layer. The second branch performs global average pooling followed by a convolutional layer to generate channel-wise attention weights, reflecting the global semantic importance of each channel.
G = σ ( B ( C o n v 1 × 1 ( G A P ( F m ) ) ) )
where G A P denotes the global average pooling, σ ( · ) denotes the Sigmoid activation function. In the end, these two complementary features, i.e.,  G R C × 1 × 1 and L R C × H × W , are fused to output the final feature map F o u t via multiplication and residual connection.
F o u t = G L F m
where ⊕ denotes the element-wise sum and ⊗ denotes the multiplication.

3.3. Global–Local Feature Fusion Module

Most object detectors for remote sensing images rely on the FPN to combine deep semantic features with shallow detail features using an addition operation to fuse them. Specifically, FPN is a widely adopted architecture for multi-scale feature representation in object detection. It constructs a top–down pathway with lateral connections to fuse high-level semantic features and low-level spatially rich features. This hierarchical design enables the model to detect objects at various scales by progressively integrating information from different feature layers. However, limited appearance cues of tiny objects lead to weak feature representation, and many feature fusion strategies are unsuitable for TOD. Therefore, we introduce a new feature module, i.e., GLFFM, which improves the localization performance of tiny objects by effectively integrating global object distribution and local detail information while mitigating information loss.
As illustrated in Figure 2, for the features F l o w R C × H × W and F h i g h R C × H × W from adjacent layers in the neck component, we first concatenate them along the channel dimension to obtain the feature map F c R 2 C × H × W , which incorporates both semantic and detailed information. The form of F c is described as follows:
F c = C a t [ F l o w , F h i g h ]
The feature map F c is fed into three branches to capture multi-view feature maps. Specifically, the top branch adopts Global Average Pooling (GAP) to obtain global channel weights. The bottom branch follows a structure similar to that of the top branch, utilizing Global Max Pooling (GMP) to derive channel weights from a different perspective. Subsequently, the weights of the top branch and bottom branch are multiplied by the feature map F c , yielding the feature maps F t R 2 C × H × W and F b R 2 C × H × W , which contain global object distribution information.
F t = F c σ ( C o n v 1 × 1 ( δ ( C o n v 1 × 1 ( G A P ( F c ) ) ) ) )
F b = F c σ ( C o n v 1 × 1 ( δ ( C o n v 1 × 1 ( G M P ( F c ) ) ) ) )
where G A P represents the global average pooling and  G M P is the global max pooling.
The middle branch uses convolutions and an activation function to obtain the feature map F m , which contains local detail information. The formulation of F m is as follows:
F m = C o n v 3 × 3 ( δ ( C o n v 3 × 3 ( F c ) ) )
Finally, the global feature maps F t and F b are added to the local feature map F m . A 3 × 3 convolutional layer is then applied for information interaction and fusion, resulting in the final output feature map F o u t :
F o u t = C o n v 3 × 3 ( F t F m F b )

3.4. Normalized Distance and Difference Metric

In the baseline model, the label assignment strategy adopts the IoU metric to distinguish positive and negative samples during training. However, for the TOD task, the high sensitivity of IoU results in an insufficient number of positive samples for tiny objects [28], leading to poor class representation and inadequate localization information. Therefore, we design a new label assignment strategy, i.e., NDDM, which employs a dynamic distance measurement to assign more positive samples for tiny objects during training.
Following the framework of curriculum learning [69], our proposed NDDM is defined by three key parameters: the initial distance measurement D b e g i n , the end distance measurement D e n d , and the training scheduler α t . Before introducing these parameters, we first explain how to model the bounding box using a Gaussian distribution. The horizontal bounding box is a rectangle enclosing the object’s edge, with foreground at the center and background at the edge. It can be modeled as a 2D Gaussian distribution, with the highest weight at the center and decreasing towards the edge. For a horizontal bounding box, R = ( c x , c y , w , h ) , where ( c x , c y ) , w, and h represent the center coordinates, width, and height, respectively. The equation of the inscribed ellipse can be expressed as
( x μ x ) 2 σ x 2 + ( y μ y ) 2 σ y 2 = 1
where ( μ x , μ y ) is the center coordinates of the ellipse, σ x , σ y are the semi-axis lengths along the x-axis and y-axis. Accordingly, μ x = c x , μ y = c y , σ x = w 2 , σ y = h 2 .
The probability density function of a 2D Gaussian distribution is given by
f ( x | μ , Σ ) = e x p ( 1 2 ( x μ ) Σ 1 ( x μ ) ) 2 π | Σ | 1 2
where x , μ , and Σ represent the coordinates, mean vector, and covariance matrix of Gaussian distribution, respectively.
The ellipse in Equation (12) represents a density contour of the 2D Gaussian distribution. Thus, the horizontal bounding box R = ( c x , c y , w , h ) can be modeled as a 2D Gaussian distribution N ( μ , Σ ) with
μ = c x c y , Σ = w 2 4 0 0 h 2 4
For the anchor a and the ground truth g, the similarity can be formulated as x a = N ( μ a , Σ a ) and x g = N ( μ g , Σ g ) . Then, the similarity W 1 is as follows:
W 1 ( x a , x g ) = μ a μ g 2 2 + Σ a 1 / 2 Σ g 1 / 2 F 2
where · F is the Frobenius norm.
Then, we define D b e g i n as follows:
D b e g i n ( x a , x g ) = W 1 ( x a , x g )
For the D e n d , we first define a W 2 to exhibit scale invariance between two 2D Gaussian distributions, which is calculated as follows:
W 2 ( x a , x g ) = 1 2 ( μ a μ g ) Σ g 1 ( μ a μ g ) + 1 2 Tr ( Σ g 1 Σ a ) + 1 2 ln | Σ g | | Σ a | 1
where
( μ a μ g ) Σ g 1 ( μ a μ g ) = 4 ( c x a c x g ) 2 w g 2 + 4 ( c y a c y g ) 2 h g 2
Tr ( Σ g 1 Σ a ) = h a 2 h g 2 + w a 2 w g 2
ln | Σ g | | Σ a | = ln h g 2 h a 2 + ln w g 2 w a 2
Therefore, the  D e n d can be written as
D e n d ( x a , x g ) = 2 ( c x a c x g ) 2 w g 2 + 2 ( c y a c y g ) 2 h g 2 + 1 2 ( h a 2 h g 2 + w a 2 w g 2 ) + ln h g h a + ln w g w a 1
where { c x a , c y a , w a , h a } and { c x g , c y g , w g , h g } represent the anchor box a and ground truth box g, respectively.
Based on the D b e g i n and D e n d , we introduce the dynamic distance measurement, i.e., NDDM. As shown in Algorithm 1, the distance measurement D d d m t at the training step t can be as follows:
D d d m t = α t D b e g i n ( x a , x g ) + ( 1 α t ) D e n d ( x a , x g )
where α t is a training scheduler in our label assignment strategy, which is as follows:
α t = r 1 2 r 0 2 T 1 ( t 1 ) + r 0 2
where r 0 and r 1 denote the value range of α t , T represents the total number of training epochs.
Finally, the  D n d d m t is stated as follows:
D n d d m t = e x p ( D d d m t β )
It should be noted that the value ranges of D b e g i n ( x a , x t ) and D e n d ( x a , x t ) are both [ 0 , + ) , and thus the resulting D d d m also has a range of [ 0 , + ) . It cannot be directly employed as a similarity measurement for label assignment. To obtain a value range similar to the IoU (i.e., between 0 and 1), we heuristically select an exponential nonlinear transformation function to normalize the value range of D d d m t to ( 0 , 1 ] , resulting in the final  D n d d m t .
Algorithm 1 Normalized Distance and Difference Metric
Require:
   G is a set of ground truth boxes on the image
   A is a set of all anchor boxes
   r 0 , r 1 and β are the predefined hyperparameters
   T p and T n are the thresholds of positive and negative samples
  T is the total number of training epochs and t is the current training epoch
Ensure:
   P is a set of positive samples
   N is a set of negative samples
   I is a set of ignore samples
   1:
P , N , I
   2:
for t = 1, 2, 3, …, T do
   3:
    compute D b e g i n between G and A by Equation (16)
   4:
    compute D e n d between G and A by Equation (21)
   5:
    compute the scheduler: α t by Equation (23).
   6:
    compute D d d m t : D d d m t = α t D b e g i n + ( 1 α t ) D e n d
   7:
    compute D n d d m t : D n d d m t = e x p ( D d d m t β )
   8:
    for each anchor box a A  do
   9:
        if D n d d m t < T n  then
 10:
            N = N a
 11:
        end if
 12:
        if  D n d d m t T p  then
 13:
            P = P a
 14:
        end if
 15:
    end for
 16:
 end for
 17:
I = A P N
 18:
return P , N , I

3.5. Loss Function

The loss function of the proposed DCEDet comprises RPN and Fast R-CNN losses [11]. The RPN loss is formulated as a multi-task loss, defined as follows:
L r p n = 1 N a ( λ 1 i = 1 N a L c l s ( p i , p i * ) + λ 2 i = 1 N a p i * L r e g ( t i , t i * ) )
where N a represents the number of sampled anchors. p i denotes the predicted probability of the foreground class. p i * denotes the ground truth label, with a value of 1 for positive samples and 0 for negative samples. Furthermore, t i and t i * represent the predicted and ground truth regression offsets in the RPN, respectively. Finally, L c l s and L r e g represent the cross-entropy loss and L1 loss, respectively.
The Fast R-CNN loss also adopts the form of a multi-task loss, expressed as follows:
L r c n n = 1 N r ( λ 3 j = 1 N r L c l s ( c j , c j * ) + λ 4 j = 1 N r [ c j * 1 ] L r e g ( t j , t j * ) )
where N r denotes the number of sampled Region of Interest (RoIs). c j represents the predicted probability for each class, and c j * refers to the ground truth class. t j and t j * represent the predicted and ground truth regression offsets in the Fast R-CNN, respectively. Additionally, [ c j * 1 ] indicates the Iverson bracket operation. Finally, the hyperparameters λ 1 , λ 2 , λ 3 , and λ 4 control the balance of the loss function and are set to 1.
Therefore, the loss function of the proposed DCEDet can be written as follows:
L = L r p n + L r c n n
In addition, we perform bounding box regression using the following method:
t x = ( x x a ) / w a , t y = ( y y a ) / h a , t w = log ( w / w a ) , t h = log ( h / h a ) , t x * = ( x * x a ) / w a , t y * = ( y * y a ) / h a , t w * = log ( w * / w a ) , t h * = log ( h * / h a )
where x, y, w, and h represent the center coordinates, width, and height of the bounding box, respectively. The variables x, x a , and x * represent the predicted box, anchor box, and ground truth box, respectively (likewise for y, w, h).

4. Experiments

In this section, we first introduce the datasets. Next, we describe the evaluation metrics and implementation details. Finally, we present the ablation studies, comparison experiments, analytical experiments, and visual results.

4.1. Datasets

  • AI-TODv2 [21]: This dataset is an enhanced version of the AI-TOD [19] dataset, designed for TOD in remote sensing images. It contains 28,036 images, each with a resolution of 800 × 800 pixels, along with 752,754 object instances annotated with horizontal bounding boxes (HBBs). These instances are divided into eight categories: airplane (AI), bridge (BR), storage-tank (ST), ship (SH), swimming-pool (SP), vehicle (VE), person (PE), and wind-mill (WM). The average absolute size of these instances is only 12.7 pixels. Based on their sizes, they can be further classified into four categories: very tiny (2∼8 pixels), tiny (8∼16 pixels), small (16∼32 pixels), and medium (32∼64 pixels). The proportions of these categories are 12.4%, 73.4%, 12.4%, and 1.8%, respectively. In addition, the numbers of images in the training set, validation set, and test set are 11,214, 2804, and 14,018, respectively. In this paper, we combine the training and validation sets to train models, while the test set is used to evaluate performance.
  • LEVIR-SHIP [70]: This is a tiny ship detection dataset comprising 3896 remote sensing images, each with a resolution of 512 × 512 pixels. The images are captured by the GaoFen-1 and GaoFen-6 satellites and have a spatial resolution of 16 m. The dataset includes 3219 ship instances, annotated with HBBs. Most instances have sizes below 20 × 20 pixels, with a concentration around 10 × 10 pixels. Additionally, the distribution of images across the training, validation, and test sets corresponds to 3/5, 1/5, and 1/5 of the total dataset, respectively. In our experiments, we utilize the training set for model training and the test set for performance evaluation.

4.2. Evaluation Metrics

The widely used evaluation metrics for object detection include average precision (AP) and mean average precision (mAP), which are crucial for assessing the performance of detectors [71]. The definition of mAP is as follows:
m A P = 1 N c l s i = 1 N c l s A P i
in which
A P = 0 1 P ( R ) d R
where P represents the precision rate and R represents the recall rate. N c l s denotes the total number of classes.
We first adopt the COCO metrics (https://cocodataset.org (accessed on 10 May 2025)) to evaluate the performance of our proposed method. Specifically, AP represents the mean value of AP across IoU thresholds ranging from 0.5 to 0.95 with intervals of 0.05. AP 50 and AP 75 indicate the AP at IoU thresholds of 0.5 and 0.75. Moreover, AP vt , AP t , AP s , and AP m evaluate the detection performance for very tiny, tiny, small, and medium-sized objects, respectively [19,20,21,22,23].
Finally, we also utilize the Optimal Localization Recall Precision (oLRP) Error [72,73] to evaluate the detection performance. The oLRP of a detection box Y s with a confidence score greater than the threshold s [ 0 , 1 ] and the corresponding ground truth box X is defined as follows:
o L R P ( X , Y s ) = min s 1 Z ( N T P 1 τ L R P I o U ( X , Y s ) + | Y s | L R P F P ( X , Y s ) + | X | L R P F N ( X , Y s ) )
in which
L R P I o U ( X , Y s ) = 1 N T P i = 1 N T P ( 1 I o U ( x i , y x i ) )
L R P F P ( X , Y s ) = 1 P r e c i s i o n = N F P | Y s |
L R P F N ( X , Y s ) = 1 R e c a l l = N F N | X |
where τ [ 0 , 1 ) is the IoU threshold. I o U ( x i , y x i ) denotes the IoU between x i X and its assigned detection y x i Y s . Z = N T P + N F P + N F N . L R P I o U represents the IoU tightness of valid detections, i.e., localization error. L R P F P and L R P F N measure the false positives (FPs) and false negatives (FNs), respectively. The oLRP is a comprehensive error metric, with its components coined as optimal box localization ( oLRP IoU ), optimal FP ( oLRP FP ), and optimal FN ( oLRP FN ) components. It is worth noting that, unlike the AP series metrics, smaller values of the oLRP series indicate better detection performance.

4.3. Implementation Details

We conduct all experiments on a single NVIDIA A100-PCIE-40GB GPU. The model is built upon MMDetection 2.24.1 [74], an object detection benchmark implemented using PyTorch 1.10.0 framework. We adopt ResNet50 [67] pretrained on ImageNet [75] as the backbone. Unless otherwise specified, all models are trained for 12 epochs using the stochastic gradient descent (SGD) optimizer with a momentum of 0.9, a weight decay of 0.0001, a batch size of 8, and an initial learning rate of 0.01, which is decayed at the 8th and 11th epochs. We also implement a linear warm-up strategy, with a learning rate of 0.001 for the first 500 iterations. Random horizontal flipping with a probability of 0.5 is used as the data augmentation strategy. In the RPN and Fast R-CNN modules, the thresholds for dividing positive and negative samples are set to 0.7/0.3 and 0.5/0.5; the sample sizes are 256 and 512, with positive-to-negative ratios of 1:1 and 1:3, respectively. In addition, the RPN module generates 3000 proposals. During the inference stage, the score threshold and IoU threshold for Non-Maximum Suppression (NMS) [76] are set to 0.05 and 0.5, respectively. In the visualization stage, the score threshold is set to 0.5. In NDDM, we set the hyperparameters r 0 = 0.3 and r 1 = 0.85 , while the hyperparameter β is consistent with the reference value of 12.7 in [21].

4.4. Ablation Studies

In this section, we conduct comprehensive ablation experiments to evaluate the effectiveness of the proposed components, i.e., GSCEM, GLFFM, and NDDM. All results are evaluated on the AI-TODv2 dataset.

4.4.1. Effectiveness of GSCEM

Tiny objects have limited sizes, making it difficult for the network to extract effective feature representations, leading to poor detection performance. To tackle this problem, we design the GSCEM and integrate it into the feature maps generated by ResNet. Specifically, GSCEM consists of two components: GSCEM-G and GSCEM-S. GSCEM-G employs hierarchical connections to generate mixed receptive fields across different groups of channel-level feature maps, thereby capturing multi-scale contextual information. GSCEM-S adopts a two-branch residual structure: one branch uses global average pooling and a channel-wise attention mechanism to highlight semantically important channels, while the other applies spatial convolution directly on the feature maps to retain fine-grained details such as edges and contours. As shown in Table 1, adding GSCEM to the baseline increases AP, AP 50 , and AP 75 by 1.7%, 3.1%, and 1.9%, respectively. For tiny objects, AP t increases by 2.0%. Meanwhile, oLRP and oLRP IoU are significantly reduced (from 88.3% to 86.6% and from 30.1% to 28.4%). In addition, we present the detection performance for each category. As shown in Table 2, the model with GSCEM achieves performance gains across seven categories, including AI, BR, ST, SH, VE, PE, and WM, where AP increases by 0.3∼4.5% and oLRP decreases by 0.8∼4.5%. This demonstrates the effectiveness of the proposed GSCEM, which improves detector performance in TOD by leveraging context-enhanced features as supplementary information.

4.4.2. Effectiveness of GLFFM

To achieve more accurate localization of tiny objects, we design GLFFM to replace the original addition operation between deep and shallow feature layers in the FPN. As stated in Table 1, the model equipped with GLFFM achieves performance improvements of 0.8%, 1.6%, and 0.9% in AP, AP 50 , and AP 75 , respectively. For tiny objects, GLFFM improves the baseline by 0.9%. In addition, the metrics of the oLRP series show a significant reduction, with oLRP and oLRP IoU decreasing by 0.8 % and 0.9%, respectively. Table 2 shows that the detection performance improves across seven categories, with AP increasing by at most 3% and oLRP decreasing by at most 2.6%. This highlights the effectiveness of our proposed GLFFM, which seamlessly integrates global and local features to enhance localization performance and mitigate information loss regarding tiny objects. Furthermore, it indicates that a proper connection structure is crucial for the TOD task.

4.4.3. Effectiveness of NDDM

As previously discussed, we introduce a dynamic label assignment strategy, NDDM, to increase the number of positive samples. Therefore, we conduct ablation studies to evaluate the detection performance of NDDM. As shown in Table 1 and Table 2, “baseline + NDDM” boosts detection performance on all metrics compared to the baseline. Specifically, the proposed NDDM improves AP, AP 50 , and AP 75 by 7.9%, 21.6%, and 3.9%, respectively. For very tiny and tiny objects, AP vt increases by 5.8%, and AP t increases by 10.7%. The oLRP and oLRP IoU decrease by 6.2 % and 0.8%. Moreover, “baseline + NDDM” improves the detection performance across all categories, with AP increasing by up to 13.9% and oLRP declining by up to 12.2%. Similarly, the last two rows of Table 1 and Table 2 further demonstrate that the NDDM significantly enhances detection performance, improving AP, AP 50 , AP 75 , AP vt , and AP t by 9.1%, 21.8%, 5.9%, 8.4%, and 13.2%, respectively, while declining oLRP and oLRP IoU by 7.6% and 0.6%. Additionally, the model with “GSCEM + GLFFM + NDDM” improves detection performance across all categories compared to the model with “GSCEM + GLFFM”, with AP increasing by up to 22.2% and oLRP decreasing by up to 20.4%. This suggests that the NDDM enhances class representation and localization performance of tiny objects through dynamic label assignment, which reassigns positive and negative samples to optimize the training process.

4.5. Comparison Experiments

We compare the proposed DCEDet with other methods on the AI-TODv2 and LEVIR-SHIP datasets.

4.5.1. Results on the AI-TODv2 Dataset

We present the results of comparative performance on the AI-TODv2 dataset in Table 3 and Table 4. Specifically, under the same 1× training schedule, our DCEDet achieves 23.5% AP, 53.9% AP 50 , 16.8% AP 75 , 8.5% AP vt , 24.1% AP t , 28.1% AP s , and 37.1% AP m . Among these, the three most critical metrics (AP, AP vt , and AP t ) are the best compared to other 1× schedule methods. In addition, our method outperforms others in five out of eight categories (AI, SH, SP, VE, and PE) in the per-class accuracy comparison. Under the same 3× training schedule, the performance of our DCEDet* also surpasses that of the SOTA methods, i.e., Cascade R-CNN w/NWD-RKA*, ORFENet*, and ESG_TODNet*, in the equivalent setting. It achieves the highest AP in six out of eight object categories (AI, ST, SH, SP, VE, and PE), falling short of the best results by only 1.3% and 1.6% in the BR and WM classes, respectively. The slightly lower performance in the BR and WM categories may be attributed to category-specific challenges. BR instances often appear in visually complex scenes such as rivers, coastlines, or urban fringes, where similar background textures make them difficult to distinguish. WM instances are extremely tiny, often appearing as indistinct blobs at low resolutions, which challenges most detectors. These results demonstrate that DCEDet consistently achieves superior performance under both training schedules. Furthermore, while the 3× schedule leads to modest improvements, DCEDet excels even under the 1× schedule, highlighting that the performance gains are primarily due to the proposed modules rather than extended training.

4.5.2. Results on the LEVIR-SHIP Dataset

In our experimental setting, we present a comparison of our proposed method and other methods on the LEVIR-SHIP dataset. The compared methods include Faster R-CNN [11], FCOS [34], SSD [12], RetinaNet [33], HSF-Net [83], CenterNet [36], and EfficientDet [84], evaluated using AP 50 and AP t . Table 5 shows that our proposed DCEDet achieves the highest performance, with 81.2% AP 50 and 13.4% AP t , outperforming the suboptimal method by 1.8% and 0.3% respectively. This demonstrates the effectiveness of our proposed DCEDet and further highlights its strong generalizability.

4.6. Analytical Experiments

To fully explore and demonstrate the detection performance of the proposed DCEDet, we conduct a series of analytical experiments on the AI-TODv2 dataset.

4.6.1. Analysis of GSCEM-G

GSCEM-G employs a hierarchical connection strategy to integrate four convolutional blocks from bottom to top, thereby capturing multi-scale contextual information through varying receptive field sizes. It is evident that selecting the appropriate convolution kernel sizes is crucial. We select the “baseline + GSCEM-G” model for analytical experiments to determine the optimal structure of GSCEM-G. As shown in Table 6, the model achieves its highest performance (13.9% AP, 0.1% AP vt , and 10.4% AP t ) when the convolution kernel sizes in GSCEM-G are configured as “3-5-7-9”. Compared to the “3-3-3-3” convolutional structure, there are improvements of 0.2% in AP, 0.1% in AP vt , and 0.8% in AP t , respectively. This highlights the importance of an appropriate hierarchical structure for TOD. Therefore, we adopt the “3-5-7-9” convolutional structure in GSCEM-G to capture multi-scale contextual information and improve the detection performance of tiny objects.

4.6.2. Analysis of GSCEM-S

GSCEM-S performs fine-grained, task-specific refinement, making the overall GSCEM particularly effective at enhancing weak features characteristic of tiny objects. For experimental analysis, we use the “baseline + GSCEM” model and evaluate various attention mechanisms, including SE [85], ECA [86], MS-CAM [87], and our proposed GSCEM-S, to determine which is the most effective for TOD. The results shown in Table 7 reveal that our proposed mechanism achieves the best performance across all metrics, with an AP of 14.1% and an AP t of 10.8%. Among them, compared to SE, AP and AP t are improved by 1% and 1.5%, respectively; compared to MS-CAM, AP and AP t show improvements of 0.7% and 0.9%, respectively. Thus, we select our proposed attention mechanism as the final structure of GSCEM. In addition, we find that not all attention mechanisms improve performance. Adding SE, ECA, and MS-CAM actually reduces performance, which suggests that GSCEM-S is a more suitable attention mechanism for TOD. It effectively enhances context information and improves detection performance.

4.6.3. Analytical Experiments of GSCEM Internal Components

We conduct analysis experiments on the internal components of GSCEM and compare the detection performance under four different configurations. As shown in Table 8, using either GSCEM-G or GSCEM-S alone leads to noticeable improvements in AP and AP t compared to the baseline. The complete GSCEM achieves the best results across the three key metrics: AP, AP vt , and AP t . These indicate that GSCEM is internally consistent, and each submodule is essential and contributes to improved detection performance.

4.6.4. Analytical Experiments on Determining the Training Scheduler α t in NDDM

We conduct analysis experiments to determine the training scheduler α t in NDDM. Three different functions are compared: the linear function, the root function, and the exponential function. The corresponding formulas are as follows (Equations (35)–(37)). We set r 0 = 0 , r 1 = 1 , T as the total number of epochs, and t as the current epoch number to explore a more effective scheduler. As shown in Table 9, the root function yields the best detection performance. Therefore, we adopt the root function to calculate α t .
α t ( l i n e a r ) = r 1 2 r 0 2 T 1 ( t 1 ) + r 0 2
α t ( r o o t ) = r 1 2 r 0 2 T 1 ( t 1 ) + r 0 2
α t ( e x p o n e n t i a l ) = 2 log 2 r 1 log 2 r 0 T 1 ( t 1 ) + log 2 r 0

4.6.5. Analysis of Hyperparameters in NDDM

NDDM improves label assignment and enhances detection performance of tiny objects. To analyze the hyperparameters r 0 and r 1 involved in NDDM, we conduct experiments using the “baseline + NDDM” model. Table 10 shows that when r 0 is set to 0.3 and r 1 to 0.85, the model achieves the optimal detection results (20.6% AP, 5.2% AP vt , and 20.3% AP t ). Based on these findings, selecting an appropriate scheduler for the NDDM is crucial. Therefore, we set the hyperparameters of NDDM to r 0 = 0.3 and r 1 = 0.85 .

4.6.6. Analysis of Curriculum Learning Strategy

We use the weight α t in NDDM as the training scheduler within the curriculum learning strategy, following a square root function that increases from r 0 = 0.3 to r 1 = 0.85 . By adjusting the weights of D b e g i n and D e n d , the detector can be assigned appropriate distance measurements at different training steps. We first conduct an ablation study on either D b e g i n or D e n d , and then compare the two versions of NDDM. As shown in Table 11, α t = 0 represents the model only uses the distance measurement D e n d , α t = 1 means the model only adopts the measurement D b e g i n , “0.30–0.85” is our NDDM, and “0.85–0.30” is the reduced-weight version of the NDDM. The last row in Table 11 achieves the optimal detection results (20.3% AP, 5.8% AP vt , and 19.5% AP t ), and all indicators are better than those of the other three cases. It can be found that the detection performance using dynamically varying distances outperforms detection with only two fixed distances. In addition, for the tiny object detector based on curriculum learning, an appropriate trend is crucial for detection performance.

4.6.7. Analysis of Positive Samples Obtained by NDDM

We conduct analysis experiments and quantitatively demonstrate that the NDDM obtains significantly more positive samples compared to the traditional IoU strategy. As shown in Table 12, the IoU strategy obtains approximately 244K positive samples, while our proposed NDDM obtains around 669K, a substantial increase. Furthermore, this increase results in a significant improvement in detection performance (AP, AP vt , and AP t ). These results clearly illustrate that the NDDM strategy effectively increases the number of positive samples, thereby providing more supervision information.

4.6.8. Analysis of Confusion Matrix

The confusion matrix provides a comprehensive overview to illustrate the relationship between predicted and true classes. We compare the confusion matrices of the baseline (Faster R-CNN) and DCEDet, as illustrated in Figure 3. For the correct detection rates (i.e., the diagonal elements), DCEDet shows a substantial improvement over the baseline, with an increase ranging from 8% to 50%. Regarding the missed detection rates (see the last column), DCEDet exhibits a notable reduction, decreasing by 9% to 50% compared to the baseline. The above analysis indicates that DCEDet achieves more accurate and higher-confidence detection results, emphasizing the effectiveness of our proposed method.

4.7. Visual Results

We visualize detection results on AI-TODv2 and LEVIR-SHIP datasets to demonstrate the performance of our DCEDet.

4.7.1. Detection Results on the AI-TODv2 Dataset

The visualization results of the DCEDet on the AI-TODv2 dataset are presented in Figure 4. It demonstrates that DCEDet effectively detects tiny objects across various categories, with notable performance in airplane, storage-tank, vehicle, and ship. Furthermore, we visualize the detection performance for some superior detectors, such as Faster R-CNN [11], RetinaNet [33], and the higher-performing Cascade R-CNN w/NWD-RKA [21]. Three representative categories are selected for comparison, i.e., storage-tank, ship, and vehicle. As shown in Figure 5, the green, blue, and red boxes represent correct detection results, false alarms, and missed detection results, respectively. It can be found that the DCEDet exhibits significant improvements over the other methods. Specifically, Faster R-CNN and RetinaNet suffer from numerous false alarms and missed detections, leading to low-quality detection performance. Although Cascade R-CNN w/NWD-RKA outperforms these two methods, its overall performance remains insufficient. In contrast, our proposed DCEDet, displayed in the last column, significantly reduces both false alarms and missed detections, yielding satisfactory detection results. It further proves the superiority of our DCEDet by enhancing feature representation and aligning with a suitable label assignment strategy.

4.7.2. Detection Results on the LEVIR-SHIP Dataset

We present a visual comparison of detection results between DCEDet and other competing models. As shown in Figure 6, DCEDet effectively detects tiny ships under various challenging conditions, including “calm sea,” “thin cloud,” “thick cloud,” and “fractal cloud.” It consistently outperforms the other methods, demonstrating the superior detection capability of our method.

5. Conclusions

This paper proposes a new approach for tiny object detection in remote sensing images, which focuses on addressing two critical issues, i.e., limited appearance cues and unbalanced label assignment. In particular, we design a GSCEM to extract contextual enhancement features at the group-channel and single-channel levels as supplementary information for tiny objects. Next, a GLFFM is developed to integrate both global distribution and local information, enhancing the features of tiny objects. Furthermore, to ensure balanced label assignment, we adopt a new dynamic distance measurement NDDM to enhance class representation and localization performance for the tiny object detection task. Finally, extensive experiments on two typical tiny object datasets demonstrate that the proposed DCEDet is effective and achieves optimal performance in TOD. In future work, we plan to explore category-specific enhancements, including resolution-aware modules to improve the discriminative properties of the foreground, structure-preserving attention mechanisms to amplify critical structural cues, and targeted denoising modules to enhance the weak features of tiny objects, particularly in real-world complex remote sensing scenarios.

Author Contributions

Conceptualization, X.H., Z.R. and M.H.; methodology, X.H., Z.R. and U.A.B.; validation, X.H., Z.R. and U.A.B.; formal analysis, M.H., U.A.B. and Y.W.; investigation, X.H. and Z.R.; resources, M.H., U.A.B. and Y.W.; writing—original draft preparation, X.H. and Z.R.; writing—review and editing, X.H., Z.R. and U.A.B.; supervision, M.H., U.A.B. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62463006, Grant 62394330, and Grant 623943304.

Data Availability Statement

The datasets in this study are available online from https://github.com/Chasel-Tsui/mmdet-aitod (accessed on 6 June 2025) and https://github.com/WindVChen/LEVIR-Ship (accessed on 6 June 2025).

Acknowledgments

The authors are grateful for the support from NVIDIA Corporation for providing the A100-PCIE-40GB GPU used in this research.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Wang, B.; Sui, H.; Ma, G.; Zhou, Y.; Zhou, M. Gmodet: A real-time detector for ground-moving objects in optical remote sensing images with regional awareness and semantic–spatial progressive interaction. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–23. [Google Scholar] [CrossRef]
  2. Ren, Z.; Tang, Y.; Yang, Y.; Zhang, W. Sasod: Saliency-aware ship object detection in high-resolution optical images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  3. Wu, X.; Li, W.; Hong, D.; Tian, J.; Tao, R.; Du, Q. Vehicle detection of multi-source remote sensing data using active fine-tuning network. ISPRS J. Photogramm. Remote Sens. 2020, 167, 39–53. [Google Scholar] [CrossRef]
  4. Zhang, Q.; Zheng, Y.; Yuan, Q.; Song, M.; Yu, H.; Xiao, Y. Hyperspectral image denoising: From model-driven, data-driven, to model-data-driven. IEEE Trans. Neural Networks Learn. Syst. 2023, 35, 13143–13163. [Google Scholar] [CrossRef]
  5. Zhang, Q.; Yuan, Q.; Song, M.; Yu, H.; Zhang, L. Cooperated spectral low-rankness prior and deep spatial prior for HSI unsupervised denoising. IEEE Trans. Image Process. 2022, 31, 6356–6368. [Google Scholar] [CrossRef]
  6. Li, J.; Chen, J.; Cheng, P.; Yu, Z.; Yu, L.; Chi, C. A survey on deep-learning-based real-time SAR ship detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 3218–3247. [Google Scholar] [CrossRef]
  7. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  8. Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1627–1645. [Google Scholar] [CrossRef]
  9. Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; IEEE: Piscataway, NJ, USA, 2005; Volume 1, pp. 886–893. [Google Scholar]
  10. Ming, Q.; Miao, L.; Zhou, Z.; Song, J.; Pizurica, A. Gradient calibration loss for fast and accurate oriented bounding box regression. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
  11. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
  12. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I. Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  13. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
  14. Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object detection in aerial images: A large-scale benchmark and challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
  15. Zhang, X.; Zhang, T.; Wang, G.; Zhu, P.; Tang, X.; Jia, X.; Jiao, L. Remote sensing object detection meets deep learning: A metareview of challenges and advances. IEEE Geosci. Remote Sens. Mag. 2023, 11, 8–44. [Google Scholar] [CrossRef]
  16. Muzammul, M.; Li, X. A survey on deep domain adaptation and tiny object detection challenges, techniques and datasets. arXiv 2021, arXiv:2107.07927. [Google Scholar] [CrossRef]
  17. Tong, K.; Wu, Y. Deep learning-based detection from the perspective of small or tiny objects: A survey. Image Vis. Comput. 2022, 123, 104471. [Google Scholar] [CrossRef]
  18. Zhu, Y.; Li, C.; Liu, Y.; Wang, X.; Tang, J.; Luo, B.; Huang, Z. Tiny object tracking: A large-scale dataset and a baseline. IEEE Trans. Neural Netw. Learn. Syst. 2023, 35, 10273–10287. [Google Scholar] [CrossRef]
  19. Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G.S. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3791–3798. [Google Scholar]
  20. Zhang, T.; Zhang, X.; Zhu, X.; Wang, G.; Han, X.; Tang, X.; Jiao, L. Multistage enhancement network for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–12. [Google Scholar] [CrossRef]
  21. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
  22. Ge, L.; Wang, G.; Zhang, T.; Zhuang, Y.; Chen, H.; Dong, H.; Chen, L. Adaptive dynamic label assignment for tiny object detection in aerial images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 6201–6214. [Google Scholar] [CrossRef]
  23. Fu, R.; Chen, C.; Yan, S.; Heidari, A.A.; Wang, X.; Mansour, R.F.; Chen, H. Gaussian similarity-based adaptive dynamic label assignment for tiny object detection. Neurocomputing 2023, 543, 126285. [Google Scholar] [CrossRef]
  24. Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. Tiny object detection in remote sensing images based on object reconstruction and multiple receptive field adaptive feature enhancement. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5616213. [Google Scholar] [CrossRef]
  25. Wu, J.; Pan, Z.; Lei, B.; Hu, Y. Fsanet: Feature-and-spatial-aligned network for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  26. Lu, X.; Zhang, Y.; Yuan, Y.; Feng, Y. Gated and axis-concentrated localization network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 58, 179–192. [Google Scholar] [CrossRef]
  27. Xu, C.; Wang, J.; Yang, W.; Yu, L. Dot distance for tiny object detection in aerial images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Nashville, TN, USA, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1192–1201. [Google Scholar]
  28. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Rfla: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part IX. Springer: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
  29. Zhou, Z.; Zhu, Y. Kldet: Detecting tiny objects in remote sensing images via kullback-leibler divergence. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4703316. [Google Scholar] [CrossRef]
  30. Li, Z.; Dong, Y.; Shen, L.; Liu, Y.; Pei, Y.; Yang, H.; Zheng, L.; Ma, J. Development and challenges of object detection: A survey. Neurocomputing 2024, 598, 128102. [Google Scholar] [CrossRef]
  31. Ming, Q.; Miao, L.; Zhou, Z.; Vercheval, N.; Pižurica, A. Not all boxes are equal: Learning to optimize bounding boxes with discriminative distributions in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622514. [Google Scholar] [CrossRef]
  32. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 6154–6162. [Google Scholar]
  33. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
  34. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9627–9636. [Google Scholar]
  35. Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 734–750. [Google Scholar]
  36. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6569–6578. [Google Scholar]
  37. Zhang, R.; Lei, Y. Afgn: Attention feature guided network for object detection in optical remote sensing image. Neurocomputing 2024, 610, 128527. [Google Scholar] [CrossRef]
  38. Zhang, Y.; Liu, T.; Yu, P.; Wang, S.; Tao, R. Sfsanet: Multiscale object detection in remote sensing image based on semantic fusion and scale adaptability. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–10. [Google Scholar] [CrossRef]
  39. Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  40. Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. Abnet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  41. Han, J.; Ding, J.; Li, J.; Xia, G.S. Align deep features for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
  42. Cheng, G.; Yao, Y.; Li, S.; Li, K.; Xie, X.; Wang, J.; Yao, X.; Han, J. Dual-aligned oriented detector. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–11. [Google Scholar] [CrossRef]
  43. Yao, Y.; Cheng, G.; Wang, G.; Li, S.; Zhou, P.; Xie, X.; Han, J. On improving bounding box representations for oriented object detection. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–11. [Google Scholar] [CrossRef]
  44. Yu, L.; Zhi, X.; Hu, J.; Zhang, S.; Niu, R.; Zhang, W.; Jiang, S. Improved deformable convolution method for aircraft object detection in flight based on feature separation in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 8313–8323. [Google Scholar] [CrossRef]
  45. Huang, Z.; Li, W.; Xia, X.G.; Wu, X.; Cai, Z.; Tao, R. A novel nonlocal-aware pyramid and multiscale multitask refinement detector for object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
  46. Guo, H.; Yang, X.; Wang, N.; Gao, X. A centerNet++ model for ship detection in sar images. Pattern Recognit. 2021, 112, 107787. [Google Scholar] [CrossRef]
  47. Li, M.; Gao, Y.; Cai, W.; Yang, W.; Huang, Z.; Hu, X.; Leung, V.C. Enhanced attention guided teacher–student network for weakly supervised object detection. Neurocomputing 2024, 597, 127910. [Google Scholar] [CrossRef]
  48. Zhang, J.; Ye, B.; Zhang, Q.; Gong, Y.; Lu, J.; Zeng, D. A visual knowledge oriented approach for weakly supervised remote sensing object detection. Neurocomputing 2024, 597, 128114. [Google Scholar] [CrossRef]
  49. Cheng, G.; Yan, B.; Shi, P.; Li, K.; Yao, X.; Guo, L.; Han, J. Prototype-cnn for few-shot object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–10. [Google Scholar] [CrossRef]
  50. Lu, X.; Sun, X.; Diao, W.; Mao, Y.; Li, J.; Zhang, Y.; Wang, P.; Fu, K. Few-shot object detection in aerial imagery guided by text-modal knowledge. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–19. [Google Scholar] [CrossRef]
  51. Akyon, F.C.; Altinuc, S.O.; Temizel, A. Slicing aided hyper inference and fine-tuning for small object detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 966–970. [Google Scholar]
  52. Kisantal, M. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
  53. Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale match for tiny person detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 1257–1265. [Google Scholar]
  54. Liu, H.I.; Tseng, Y.W.; Chang, K.C.; Wang, P.J.; Shuai, H.H.; Cheng, W.H. A denoising fpn with transformer r-cnn for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
  55. Tong, X.; Su, S.; Wu, P.; Guo, R.; Wei, J.; Zuo, Z.; Sun, B. Msaffnet: A multiscale label-supervised attention feature fusion network for infrared small target detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–16. [Google Scholar] [CrossRef]
  56. Zhao, Z.; Du, J.; Li, C.; Fang, X.; Xiao, Y.; Tang, J. Dense tiny object detection: A scene context guided approach and a unified benchmark. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606913. [Google Scholar] [CrossRef]
  57. Li, S.; Tong, Q.; Liu, X.; Cui, Z.; Liu, X. Ma2-fpn for tiny object detection from remote sensing images. In Proceedings of the 2022 15th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 5–7 November 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 1–8. [Google Scholar]
  58. Shi, T.; Gong, J.; Hu, J.; Zhi, X.; Zhu, G.; Yuan, B.; Sun, Y.; Zhang, W. Adaptive feature fusion with attention-guided small target detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5623116. [Google Scholar] [CrossRef]
  59. Liu, D.; Zhang, J.; Qi, Y.; Wu, Y.; Zhang, Y. A tiny object detection method based on explicit semantic guidance for remote sensing images. IEEE Geosci. Remote Sens. Lett. 2024, 21, 1–5. [Google Scholar] [CrossRef]
  60. Gong, Y.; Yu, X.; Ding, Y.; Peng, X.; Zhao, J.; Han, Z. Effective fusion factor in fpn for tiny object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1160–1168. [Google Scholar]
  61. Zhang, X.; Zhang, X.; Cao, S.Y.; Yu, B.; Zhang, C.; Shen, H.L. Mrf3net: An infrared small target detection network using multireceptive field perception and effective feature fusion. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5629414. [Google Scholar] [CrossRef]
  62. Chen, P.; Wang, J.; Zhang, Z.; He, C. Frli-net: Feature reconstruction and learning interaction network for tiny object detection in remote sensing images. IEEE Signal Processing Letters 2025, 32, 2159–2163. [Google Scholar] [CrossRef]
  63. Lin, T.Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
  64. Li, J.; Liang, X.; Wei, Y.; Xu, T.; Feng, J.; Yan, S. Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1222–1230. [Google Scholar]
  65. Lin, C.; Mao, X.; Qiu, C.; Zou, L. Dtcnet: Transformer-cnn distillation for super-resolution of remote sensing image. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 11117–11133. [Google Scholar] [CrossRef]
  66. He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2961–2969. [Google Scholar]
  67. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27 June–1 July 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 770–778. [Google Scholar]
  68. Zhang, Q.; Dong, Y.; Zheng, Y.; Yu, H.; Song, M.; Zhang, L.; Yuan, Q. Three-dimension spatial-spectral attention transformer for hyperspectral image denoising. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5531213. [Google Scholar] [CrossRef]
  69. Wang, X.; Chen, Y.; Zhu, W. A survey on curriculum learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 4555–4576. [Google Scholar] [CrossRef]
  70. Chen, J.; Chen, K.; Chen, H.; Zou, Z.; Shi, Z. A degraded reconstruction enhancement-based method for tiny ship detection in remote sensing images with a new large-scale dataset. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–14. [Google Scholar] [CrossRef]
  71. Everingham, M.; Van Gool, L.; Williams, C.K.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
  72. Oksuz, K.; Cam, B.C.; Akbas, E.; Kalkan, S. Localization recall precision (LRP): A new performance metric for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Springer: Cham, Switzerland, 2018; pp. 504–519. [Google Scholar]
  73. Oksuz, K.; Cam, B.C.; Kalkan, S.; Akbas, E. One metric to measure them all: Localisation recall precision (lrp) for evaluating visual detection tasks. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 9446–9463. [Google Scholar] [CrossRef]
  74. Chen, K.; Wang, J.; Pang, J.; Cao, Y.; Xiong, Y.; Li, X.; Sun, S.; Feng, W.; Liu, Z.; Xu, J.; et al. Mmdetection: Open mmlab detection toolbox and benchmark. arXiv 2019, arXiv:1906.07155. [Google Scholar] [CrossRef]
  75. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
  76. Neubeck, A.; Van Gool, L. Efficient non-maximum suppression. In Proceedings of the 18th International Conference on Pattern Recognition (ICPR’06), Hong Kong, China, 20–24 August 2006; IEEE: Piscataway, NJ, USA, 2006; Volume 3, pp. 850–855. [Google Scholar]
  77. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 6054–6063. [Google Scholar]
  78. Qiao, S.; Chen, L.C.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 10213–10224. [Google Scholar]
  79. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 9759–9768. [Google Scholar]
  80. Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 9657–9666. [Google Scholar]
  81. Lu, X.; Li, B.; Yue, Y.; Li, Q.; Yan, J. Grid r-cnn. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 7363–7372. [Google Scholar]
  82. Kong, T.; Sun, F.; Liu, H.; Jiang, Y.; Li, L.; Shi, J. Foveabox: Beyound anchor-based object detection. IEEE Trans. Image Process. 2020, 29, 7389–7398. [Google Scholar] [CrossRef]
  83. Li, Q.; Mou, L.; Liu, Q.; Wang, Y.; Zhu, X.X. Hsf-net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2018, 56, 7147–7161. [Google Scholar] [CrossRef]
  84. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 10781–10790. [Google Scholar]
  85. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 21–26 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
  86. Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 14–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar]
  87. Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Virtual, 5–9 January 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 3560–3569. [Google Scholar]
Figure 1. (a) Examples of tiny objects in remote sensing images. These objects are characterized by their tiny size and dense distribution, as illustrated in the blue boxes. (b) IoU tolerance analysis for normal and tiny objects. A i denotes the ground truth box, and B i and C i represent predicted bounding boxes with offsets of one pixel and three pixels towards the bottom right, respectively. Slight movement leads to a significant drop in IoU for the tiny object (from 0.92 to 0.20) compared to the normal object (from 0.93 to 0.81).
Figure 1. (a) Examples of tiny objects in remote sensing images. These objects are characterized by their tiny size and dense distribution, as illustrated in the blue boxes. (b) IoU tolerance analysis for normal and tiny objects. A i denotes the ground truth box, and B i and C i represent predicted bounding boxes with offsets of one pixel and three pixels towards the bottom right, respectively. Slight movement leads to a significant drop in IoU for the tiny object (from 0.92 to 0.20) compared to the normal object (from 0.93 to 0.81).
Remotesensing 17 02876 g001
Figure 2. The overall framework of the proposed DCEDet. For an input image, the contextual enhancement features are first extracted by the Backbone and GSCEM. Then, the features are fed into the FPN equipped with GLFFM for feature fusion to generate a feature pyramid. Next, the RPN utilizes NDDM to provide reliable proposals. Finally, the classification and regression branches with cross-entropy loss and L1 loss classify and localize the object instances in the image, respectively.
Figure 2. The overall framework of the proposed DCEDet. For an input image, the contextual enhancement features are first extracted by the Backbone and GSCEM. Then, the features are fed into the FPN equipped with GLFFM for feature fusion to generate a feature pyramid. Next, the RPN utilizes NDDM to provide reliable proposals. Finally, the classification and regression branches with cross-entropy loss and L1 loss classify and localize the object instances in the image, respectively.
Remotesensing 17 02876 g002
Figure 3. Confusion matrices of baseline (a) and DCEDet (b) on the AI-TODv2 test set. The proposed DCEDet achieves better detection performance.
Figure 3. Confusion matrices of baseline (a) and DCEDet (b) on the AI-TODv2 test set. The proposed DCEDet achieves better detection performance.
Remotesensing 17 02876 g003
Figure 4. Visualization of the detection results from our DCEDet on the AI-TODv2 test set, where each color represents a category. The proposed DCEDet accurately detects tiny objects in various scenarios. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Figure 4. Visualization of the detection results from our DCEDet on the AI-TODv2 test set, where each color represents a category. The proposed DCEDet accurately detects tiny objects in various scenarios. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Remotesensing 17 02876 g004
Figure 5. Visual comparison of the proposed method (DCEDet) and other methods (Faster R-CNN, RetinaNet, and Cascade R-CNN w/NWD-RKA) on the AI-TODv2 test set. The green, blue, and red boxes represent the true positive (TP), false positive (FP), and false negative (FN) predictions, respectively. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Figure 5. Visual comparison of the proposed method (DCEDet) and other methods (Faster R-CNN, RetinaNet, and Cascade R-CNN w/NWD-RKA) on the AI-TODv2 test set. The green, blue, and red boxes represent the true positive (TP), false positive (FP), and false negative (FN) predictions, respectively. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Remotesensing 17 02876 g005
Figure 6. Visual comparison of the proposed method (DCEDet) and other methods (Faster R-CNN, RetinaNet, and EfficientDet) on the LEVIR-SHIP test set. Each row from top to bottom corresponds to “calm sea,” “thin cloud,” “thick cloud,” and “fractal cloud,” respectively. The proposed DCEDet successfully detects tiny ships in various environments. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Figure 6. Visual comparison of the proposed method (DCEDet) and other methods (Faster R-CNN, RetinaNet, and EfficientDet) on the LEVIR-SHIP test set. Each row from top to bottom corresponds to “calm sea,” “thin cloud,” “thick cloud,” and “fractal cloud,” respectively. The proposed DCEDet successfully detects tiny ships in various environments. For clarity, we retain only the detection boxes, ignoring labels and confidence scores.
Remotesensing 17 02876 g006
Table 1. Ablation experiments on the effectiveness of each module of the proposed DCEDet. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance. In the last row, DCEDet performs best.
Table 1. Ablation experiments on the effectiveness of each module of the proposed DCEDet. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance. In the last row, DCEDet performs best.
GSCEMGLFFMNDDM AP AP 50 AP 75 AP vt AP t AP s AP m oLRP oLRP IoU oLRP FP oLRP FN
12.428.78.60.08.824.236.888.330.148.268.3
14.131.810.50.110.827.138.086.628.442.865.4
13.230.39.50.09.725.337.587.529.233.967.2
20.350.312.55.819.526.335.182.129.336.049.2
14.432.110.90.110.927.138.686.428.343.865.2
23.553.916.88.524.128.137.178.827.732.444.7
Table 2. Ablation experiments on the effectiveness of each module of the proposed DCEDet. The table reports the class-wise AP and class-wise oLRP on the AI-TODv2 test set. Bold fonts indicate the best performance. In the last row, DCEDet performs best.
Table 2. Ablation experiments on the effectiveness of each module of the proposed DCEDet. The table reports the class-wise AP and class-wise oLRP on the AI-TODv2 test set. Bold fonts indicate the best performance. In the last row, DCEDet performs best.
GSCEMGLFFMNDDMAIBRSTSHSPVEPEWM
AP / oLRP AP / oLRP AP / oLRP AP / oLRP AP / oLRP AP / oLRP AP / oLRP AP / oLRP
25.6/77.03.4/96.319.1/81.820.2/82.112.7/86.813.8/87.24.2/95.30.1/99.9
27.7/75.27.9/91.821.1/80.323.3/79.111.9/86.514.9/86.25.4/94.50.4/98.9
26.9/75.46.4/93.719.7/81.021.1/81.312.5/87.014.2/86.84.7/95.00.2/99.9
28.5/75.814.1/87.532.6/71.334.1/69.914.2/86.424.7/78.68.8/92.14.9/95.4
27.3/75.19.9/90.120.5/81.123.7/78.712.8/86.415.3/85.75.7/94.30.1/99.5
32.4/71.517.5/83.534.2/69.545.9/58.315.0/86.026.9/76.110.9/90.45.3/94.7
Table 3. All-class performance of the proposed DCEDet and other methods on the AI-TODv2 test set. Underlined and bold fonts indicate the best results under the 1× and 3× training schedules, respectively. In the last four rows, * denotes that the method was trained for 36 epochs. The proposed DCEDet * achieves the best performance with 26.8% AP.
Table 3. All-class performance of the proposed DCEDet and other methods on the AI-TODv2 test set. Underlined and bold fonts indicate the best results under the 1× and 3× training schedules, respectively. In the last four rows, * denotes that the method was trained for 36 epochs. The proposed DCEDet * achieves the best performance with 26.8% AP.
MethodPublicationBackboneAP AP 50 AP 75 AP vt AP t AP s AP m
Anchor-based two-stage
Faster R-CNN [11]TPAMI2016ResNet-5012.829.99.40.09.224.637.0
Faster R-CNN [11]TPAMI2016ResNet-10113.130.79.20.09.724.635.5
Faster R-CNN [11]TPAMI2016HRNet-w3214.532.810.60.111.127.437.8
Cascade R-CNN [32]CVPR2018ResNet-5015.134.211.20.111.526.738.5
TridentNet [77]ICCV2019ResNet-5010.124.56.70.16.319.831.9
DetectoRS [78]CVPR2021ResNet-5016.135.512.50.112.628.340.0
DotD [27]CVPR2021ResNet-5020.451.412.38.521.124.630.4
Cascade R-CNN w/NWD-RKA [21]ISPRS2022ResNet-5022.252.515.17.821.828.037.2
Anchor-based one-stage
SSD [12]ECCV2016VGG-1610.732.54.02.08.716.828.0
RetinaNet [33]CVPR2017ResNet-508.924.24.62.78.413.120.4
ATSS [79]CVPR2022ResNet-5013.031.08.72.311.218.029.9
Anchor-free
FCOS [34]ICCV2019ResNet-5012.030.27.32.211.116.626.9
RepPoints [80]ICCV2019ResNet-509.323.65.42.810.012.318.9
Grid R-CNN [81]CVPR2019ResNet-5014.331.111.00.111.025.736.7
FoveaBox [82]TIP2020ResNet-5011.328.17.41.48.617.832.2
FSANet [25]TGRS2022ResNet-5017.645.010.55.415.822.933.8
ORFENet [24]TGRS2024ResNet-5018.944.412.76.918.423.430.3
ESG_TODNet [59]GRSL2024ResNet-5019.947.713.66.119.324.730.4
FRLI-Net [62]SPL2025ResNet-5020.148.513.56.120.825.931.8
DCEDet (Ours)ResNet-5023.553.916.88.524.128.137.1
Cascade R-CNN w/NWD- RKA * [21]ISPRS2022ResNet-5025.155.418.910.125.029.238.8
ORFENet * [24]TGRS2024ResNet-5024.855.418.29.724.428.735.1
ESG_ TODNet * [59]GRSL2024ResNet-5024.655.118.19.524.029.435.6
DCEDet * (Ours)ResNet-5026.855.821.911.227.831.240.2
Table 4. Class-wise AP of the proposed DCEDet and other methods on the AI-TODv2 test set. Underlined and bold fonts indicate the best results under the 1× and 3× training schedules, respectively. In this table, AP is used as the evaluation metric for each class. The abbreviations for the categories are as follows: airplane (AI), bridge (BR), storage-tank (ST), ship (SH), swimming-pool (SP), vehicle (VE), person (PE), and wind-mill (WM). In the last four rows, * denotes that the method was trained for 36 epochs. The proposed DCEDet * achieves the best performance in the AI, ST, SH, SP, VE, and PE classes.
Table 4. Class-wise AP of the proposed DCEDet and other methods on the AI-TODv2 test set. Underlined and bold fonts indicate the best results under the 1× and 3× training schedules, respectively. In this table, AP is used as the evaluation metric for each class. The abbreviations for the categories are as follows: airplane (AI), bridge (BR), storage-tank (ST), ship (SH), swimming-pool (SP), vehicle (VE), person (PE), and wind-mill (WM). In the last four rows, * denotes that the method was trained for 36 epochs. The proposed DCEDet * achieves the best performance in the AI, ST, SH, SP, VE, and PE classes.
MethodPublicationBackboneAIBRSTSHSPVEPEWM
Anchor-based two-stage
TridentNet [77]ICCV2019ResNet-5019.30.117.216.212.412.53.40.0
Faster R-CNN [11]TPAMI2016ResNet-5019.74.819.019.93.714.44.80.0
Faster R-CNN [11]TPAMI2016ResNet-10125.38.519.419.912.514.64.50.0
Faster R-CNN [11]TPAMI2016HRNet-w3227.99.521.521.413.016.75.90.0
Cascade R-CNN [32]CVPR2018ResNet-5026.29.624.024.313.217.55.80.1
TridentNet [77]ICCV2019ResNet-5019.30.117.216.212.412.53.40.0
DetectoRS [78]CVPR2021ResNet-5028.511.723.226.414.917.66.50.2
DotD [27]CVPR2021ResNet-5018.717.534.737.012.425.410.37.4
Cascade R-CNN w/NWD-RKA [21]ISPRS2022ResNet-5028.517.536.938.313.726.610.45.7
Anchor-based one-stage
SSD [12]ECCV2016VGG-1614.99.613.218.210.612.72.93.1
RetinaNet [33]CVPR2017ResNet-501.311.814.323.65.811.42.30.5
ATSS [79]CVPR2022ResNet-5015.411.720.027.69.414.84.70.0
Anchor-free
FCOS [34]ICCV2019ResNet-507.213.420.226.78.416.33.50.0
RepPoints [80]ICCV2019ResNet-500.00.122.528.80.218.34.10.0
Grid R-CNN [81]CVPR2019ResNet-5024.511.720.923.512.116.15.10.4
FoveaBox [82]TIP2020ResNet-5015.63.321.120.89.716.34.00.0
FSANet [25]TGRS2022ResNet-5019.216.028.333.012.920.46.05.3
ORFENet [24]TGRS2024ResNet-5014.618.832.238.213.125.58.40.0
ESG_TODNet [59]GRSL2024ResNet-5017.518.234.137.813.025.18.05.2
FRLI-Net [62]SPL2025ResNet-5017.319.333.737.914.225.58.96.1
DCEDet (Ours)ResNet-5032.417.534.245.915.026.910.95.3
Cascade R-CNN w/NWD- RKA * [21]ISPRS2022ResNet-5032.316.836.253.116.627.012.06.8
ORFENet * [24]TGRS2024ResNet-5026.021.135.850.617.127.811.08.7
ESG_ TODNet * [59]GRSL2024ResNet-5026.620.535.550.716.527.811.28.0
DCEDet * (Ours)ResNet-5034.719.837.054.018.429.114.37.1
Table 5. The performance of the proposed DCEDet and other methods on the LEVIR-SHIP test set. Bold and underlined fonts indicate the best and second best performance. The proposed DCEDet achieves the best result with 81.2% AP 50 and 13.4% AP t .
Table 5. The performance of the proposed DCEDet and other methods on the LEVIR-SHIP test set. Bold and underlined fonts indicate the best and second best performance. The proposed DCEDet achieves the best result with 81.2% AP 50 and 13.4% AP t .
Method AP 50 AP t
Faster R-CNN [11]69.97.0
FCOS [34]75.110.8
SSD [12]51.13.4
RetinaNet [33]73.710.6
HSF-Net [83]73.48.7
CenterNet [36]77.910.1
EfficientDet [84]79.413.1
DCEDet (Ours)81.213.4
Table 6. Analytical experiments on determining the size of each convolution kernel in GSCEM-G. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Table 6. Analytical experiments on determining the size of each convolution kernel in GSCEM-G. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Kernel SizesAP AP 50 AP 75 AP vt AP t AP s AP m
1-3-5-713.630.910.00.09.726.338.1
3-3-3-313.731.410.00.09.626.438.1
3-5-7-913.931.710.30.110.426.638.0
5-5-5-513.631.19.80.110.226.237.4
5-7-9-1113.530.610.30.09.626.238.0
7-7-7-713.530.510.00.09.825.938.5
Table 7. Analytical experiments on determining the channel attention mechanism in GSCEM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Table 7. Analytical experiments on determining the channel attention mechanism in GSCEM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
AttentionAP AP 50 AP 75 AP vt AP t AP s AP m
13.931.710.30.110.426.638.0
SE13.129.79.70.09.325.837.7
ECA13.330.39.80.19.525.837.8
MS-CAM13.430.69.60.09.926.238.0
GSCEM-S14.131.810.50.110.827.138.0
Table 8. Analytical experiments of GSCEM internal components. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Table 8. Analytical experiments of GSCEM internal components. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
ModelAP AP 50 AP 75 AP vt AP t AP s AP m
Baseline12.428.78.60.08.824.236.8
Baseline_GSCEM-G13.931.710.30.110.426.638.0
Baseline_GSCEM-S13.831.210.30.010.026.738.8
Baseline_GSCEM-G_GSCEM-S14.131.810.50.110.827.138.0
Table 9. Analytical experiments on determining the training scheduler α t in NDDM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Table 9. Analytical experiments on determining the training scheduler α t in NDDM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
α t AP AP 50 AP 75 AP vt AP t AP s AP m
Baseline_NDDM_linear18.847.011.53.618.325.234.3
Baseline_NDDM_root19.648.512.24.319.025.735.2
Baseline_NDDM_exponential15.941.48.82.814.422.133.5
Table 10. Hyperparameter determination experiments in NDDM. r 0 and r 1 represent the initial value and the end value of the training scheduler, respectively. The table reports the all-class performance on the AI-TODv2 validation set. Bold fonts indicate the best performance.
Table 10. Hyperparameter determination experiments in NDDM. r 0 and r 1 represent the initial value and the end value of the training scheduler, respectively. The table reports the all-class performance on the AI-TODv2 validation set. Bold fonts indicate the best performance.
r 0 0.000.100.150.200.250.30
r 1 AP / AP vt / AP t AP / AP vt / AP t AP / AP vt / AP t AP / AP vt / AP t AP / AP vt / AP t AP / AP vt / AP t
0.8019.8/4.0/19.220.3/4.8/19.820.3/3.5/19.820.2/3.8/19.820.3/4.2/20.019.9/3.9/20.0
0.8520.0/3.3/19.820.3/4.2/20.020.0/4.6/20.120.1/5.0/20.020.3/4.7/20.220.6/5.2/20.3
0.9019.5/3.3/19.019.9/3.2/20.020.0/3.9/19.920.4/3.5/19.720.0/4.8/19.720.1/3.4/19.8
0.9519.7/3.6/19.320.1/4.2/20.220.2/3.5/19.920.2/3.3/20.120.1/3.0/20.120.3/3.1/20.0
1.0019.7/2.6/19.420.1/2.7/20.020.2/3.3/20.220.2/3.3/19.620.3/2.6/19.720.2/2.4/19.8
Table 11. Analytical experiments on the training scheduler in NDDM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
Table 11. Analytical experiments on the training scheduler in NDDM. The table reports the all-class performance on the AI-TODv2 test set. Bold fonts indicate the best performance.
α t AP AP 50 AP 75 AP vt AP t AP s AP m
013.634.58.22.011.020.534.4
118.847.211.12.718.226.134.9
0.85–0.3019.248.211.45.018.526.035.1
0.30–0.8520.350.312.55.819.526.335.1
Table 12. Analytical experiments on positive samples obtained by NDDM. Pos_num represents the average number of positive samples acquired per epoch during training, where K denotes 10 3 . The table reports the all-class performance on the AI-TODv2 test set.
Table 12. Analytical experiments on positive samples obtained by NDDM. Pos_num represents the average number of positive samples acquired per epoch during training, where K denotes 10 3 . The table reports the all-class performance on the AI-TODv2 test set.
ModelPos_numAP AP vt AP t
Baseline244,177 (∼244K)12.20.08.6
Baseline + NDDM669,011 (∼669K)20.05.419.4
Baseline + GSCEM + GLFFM244,229 (∼244K)14.40.010.8
Baseline + GSCEM + GLFFM + NDDM669,030 (∼669K)23.37.223.4
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, X.; Ren, Z.; Bhatti, U.A.; Huang, M.; Wu, Y. DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement. Remote Sens. 2025, 17, 2876. https://doi.org/10.3390/rs17162876

AMA Style

Hu X, Ren Z, Bhatti UA, Huang M, Wu Y. DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement. Remote Sensing. 2025; 17(16):2876. https://doi.org/10.3390/rs17162876

Chicago/Turabian Style

Hu, Xinkai, Zhida Ren, Uzair Aslam Bhatti, Mengxing Huang, and Yirong Wu. 2025. "DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement" Remote Sensing 17, no. 16: 2876. https://doi.org/10.3390/rs17162876

APA Style

Hu, X., Ren, Z., Bhatti, U. A., Huang, M., & Wu, Y. (2025). DCEDet: Tiny Object Detection in Remote Sensing Images Based on Dual-Contrast Feature Enhancement and Dynamic Distance Measurement. Remote Sensing, 17(16), 2876. https://doi.org/10.3390/rs17162876

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop