Next Article in Journal
PixelCraftSR: Efficient Super-Resolution with Multi-Agent Reinforcement for Edge Devices
Next Article in Special Issue
Stereo Online Self-Calibration Through the Combination of Hybrid Cost Functions with Shared Characteristics Considering Cost Uncertainty
Previous Article in Journal
Federated Subgraph Learning via Global-Knowledge-Guided Node Generation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks

1
University of Chinese Academy of Sciences, Beijing 100049, China
2
Xi’an Institute of Optics and Precision Mechanics of CAS, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Sensors 2025, 25(7), 2241; https://doi.org/10.3390/s25072241
Submission received: 28 February 2025 / Revised: 28 March 2025 / Accepted: 1 April 2025 / Published: 2 April 2025
(This article belongs to the Special Issue Computer Vision and Pattern Recognition Based on Remote Sensing)

Abstract

:
With the rapid development of remote sensing technology, optical remote sensing images are increasingly being used in areas such as military reconnaissance, environmental monitoring, and urban planning. Due to the small number of pixels, fuzzy features, and complex background, it is difficult for conventional convolutions to effectively extract features from small objects. To address this problem, we propose to use multi-scale dilated convolutions to increase the receptive field size of the model to adapt to changes in object size, capture multi-scale contextual information of the feature map, and extract richer object features. First, we propose a Dilated Convolutional Residual (DCR) module for high-level feature extraction in the network. Second, the context aggregation (CONTEXT) module uses remote interaction to perform associative computation on images using contextual aggregation, allowing the model to understand the global semantic information of the image. We propose a novel object detection method, DCN-YOLO, which achieves an AP50 of 56.6 on the AI-TOD dataset, effectively improving the detection accuracy and robustness of small objects in remote sensing images. It provides a new technical approach to the detection of small objects in remote sensing.

1. Introduction

Remote sensing images’ small-object detection refers to the task of identifying and locating small objects in high-resolution remote sensing images. This task has important application value in many fields such as military reconnaissance, environmental monitoring, urban planning, etc. Due to the characteristics of remote sensing images, such as large changes in object scale, complex backgrounds, and high similarity between objects and backgrounds, small-object detection has become a technical challenge. Traditional object detection methods rely on manual feature extraction and shallow machine learning models, which often fail to meet high accuracy and real-time requirements.
In recent years, with the rapid development of deep learning technology, deep learning-based object detection algorithms have made significant progress in the field of small-object detection in remote sensing images. These algorithms are mainly divided into two categories: one-stage and two-stage object detection algorithms. One-stage algorithms, such as the YOLO [1,2,3,4,5,6,7] series and SSD [8,9], are preferred for their fast detection speed and ease of end-to-end training; two-stage algorithms, such as Faster R-CNN [10] and Mask R-CNN [11], are slower but have an advantage in detection accuracy.
AODN [12] is used to simultaneously detect multiple object types in remotely sensed images with large scale changes. The feature extractor has been redesigned using cascaded ReLU and Inception modules, which can increase the diversity of receptive field sizes. MCGR [13] proposes a new multi-class cyclic super-resolution generative adversarial network for benchmarking object detection based on image super-resolution. The single-shot multi-box detector optimises SSD [14] to enhance the feature extraction capability of small objects in shallow networks and improve the fusion effect. The single-shot multi-box detector effectively improves remote sensing object detection in complex scenes. MSA R-CNN [15] uses an ultra-multi-scale feature extraction network to improve feature extraction from multi-scale images and is used to solve the problem of information loss in the feature pyramid network.
LSKNet [16] can dynamically adjust its larger spatial receptive field to better simulate the ranging backgrounds of various objects in remote sensing scenes. GLSANet [17] designs a global semantic information interaction module to mine and enhance the high-level semantic information in the deep feature map, thereby mitigating the obstruction of complex backgrounds to foreground objects. The feature pyramid network is optimised to improve the performance of multi-scale object detection in remote sensing images. RSADet [18] takes into account the spatial distribution, scale, and direction/shape variation in objects in remote sensing images to alleviate the problem of object occlusion and overlap. Lu proposed an end-to-end network called Attention and Feature Fusion SSD [19]. A multi-layer feature fusion structure was designed to enhance the semantic information of shallow features. A dual-path attention module was introduced to sift through feature information, suppress background noise, and highlight key features. SME-Net [20] eliminates the salient information of large objects to highlight the features of small objects in shallow feature maps and to reduce feature confusion between multi-scale objects. ABNet [21] designs an enhanced and effective channel attention mechanism to improve the feature representation capability of the backbone network, thereby reducing the obstruction of complex backgrounds to foreground objects.
DNTR [22] significantly improves small-target detection performance through an innovative noise reduction module and a transformer-enhanced detection head. However, the DNTR model has high computational requirements and low inference speed, making it difficult to achieve high-speed inference. NWD [23] effectively solves the problem of the IoU being sensitive to the position deviation of small objects by modelling the bounding box as a Gaussian distribution, which improves the accuracy of label assignment. NWD does not perform well when embedded in a target detector with NMS, and the effect has not been verified for targets that are not extremely small. RFLA [24] effectively alleviates the sample imbalance problem in small-object detection through Gaussian receptive field modelling and hierarchical label assignment. The HLA module during training requires multiple sorting and iterations, and the effect on the one-step object detector is not improved. HS-FPN [25] effectively improves small target detection performance through high-frequency feature enhancement and spatial dependency modelling and has a flexible structure and strong compatibility. However, high-frequency enhancement may introduce noise or reduce the detection accuracy of other targets due to the bias of attentional allocation to small targets. SimD [26] effectively improves the quality of label assignment for small-object detection by combining adaptive evaluation metrics for position and shape similarity and has been shown to be superior on several benchmark datasets. The fixed threshold mechanism and the lack of adaptability to mixed-size scenes are insufficient. No direct solution is proposed to address the problem of feature loss in small-object detection.
In order to deal with the small number of pixels, fuzzy features, and complex backgrounds of small remote sensing objects, we propose to use dilated convolutions to increase the receptive field size of the model, adapt to changes in object size, capture multi-scale contextual information of the feature map, and extract richer object features, thereby improving object detection performance. CONTEXT can perform correlation calculations on the whole image, allowing the model to understand the global semantic information of the image. We propose a new object detection framework, DCN-YOLO, which achieves excellent detection results on the AI-TOD dataset and provides a novel technical approach for small-object detection in remote sensing.
The main contributions of this paper can be summarised as follows:
I. The Dilated Convolutional Residual (DCR) module is proposed and applied to the high-level feature extraction process of the network. DCR can effectively expand the receptive field of the model, enabling it to flexibly adapt to dynamic changes in the object size and accurately capture multi-scale contextual information in the feature map, thereby extracting richer and more discriminative object features.
II. Context aggregation (CONTEXT) can perform correlation calculations on the whole image, allowing the model to understand the global semantic information of the image.
III. DCN-YOLO achieves an AP50 of 56.6 on the AI-TOD dataset, effectively improving the detection accuracy and robustness of small objects in remote sensing images.

2. Method

2.1. YOLOv5

The YOLOv5 model is one of the most widely used frameworks in the field of object detection and has a wide range of application scenarios in one-stage object detection. The DCN-YOLO model is based on the basic framework architecture of YOLOv5. YOLOv5 offers four model variants from small to large, namely, YOLOv5s, YOLOv5m, YOLOv5l and YOLOv5x, depending on the number of parameters and the size of the calculation. Despite the differences in model size, the basic structure remains the same. Model size is adjusted by changing the depth and width of the model, which is reflected in changes to the number of b o t t l e n e c k layers and convolutional kernels. This multi-model feature gives YOLOv5 greater flexibility and versatility in practical application scenarios, allowing the appropriate model version to be flexibly selected according to different task requirements and hardware conditions. Given the comprehensive consideration of model parameters and computational complexity, YOLOv5s was selected as the baseline model in this study.
As shown in Figure 1, the YOLOv5 network architecture consists mainly of two core components: the b a c k b o n e network and the h e a d network. The main task of the backbone network is to extract low-level texture features and high-level semantic features from the input image.
These extracted features are then passed to the head network, which achieves robust semantic feature transfer by constructing a top-down enhanced feature pyramid network while propagating local textures from bottom to top.
This bidirectional feature propagation mechanism effectively solves the problem of variable object scale by enhancing object detection capabilities at different scales, enabling the model to cope with the task of detecting objects of different sizes.
In the b a c k b o n e network and the h e a d network, the Concentrated-Comprehensive Convolution (C3) module is a key component, as shown in Figure 2, and usually contains several b o t t l e n e c k layers. These b o t t l e n e c k layers consist of two successive convolutional layers, and a residual connection is introduced between the two convolutional layers, i.e., the input is directly connected to the output. This residual connection mechanism has a significant effect in mitigating the gradient disappearance problem, which can speed up the model training process and make the model converge to an optimal parameter configuration more quickly during training.
The C3 module splits the input feature map into two parts based on the channel dimension. One part goes directly to the bottleneck layer for complex feature transformation, while the other part remains unchanged. After going through their own processing flows, the two parts of the feature map are finally merged together. This unique design strategy effectively reduces the amount of computation and improves the operational efficiency of the model without sacrificing model performance, allowing the model to maintain good performance in resource-constrained environments.
The Spatial Pyramid Pooling Fast (SPPF) module is another important component of YOLOv5, as shown in Figure 3. It consists of several parallel max-pooling layers with different kernel sizes. The main function of this module is to extract deep multi-scale features from the input feature map. By extracting and fusing features at different scales, the model’s ability to detect objects of different sizes is further enhanced, allowing the model to better cope with complex and changing object scale situations.

2.2. DCN-YOLO

In DCN-YOLO, as shown in Figure 4, the backbone mainly consists of the C3 module and the convolutional downsampling operation. Among them, the C3 module still plays an important role in the DCN-YOLO model to reduce the amount of computation and improve the overall performance of the network. The head part of DCN-YOLO consists of the DCR module, the Upsample layer, the Concat layer and the CONTEXT module. Among them, DCR, as an advanced convolutional residual feature extraction module, focuses on the extraction of high-level network features. This module can effectively increase the receptive field size of the model, allowing it to flexibly adapt to dynamic changes in object size, and accurately capture multi-scale contextual information in the feature map, thereby extracting richer and more discriminative object features. The CONTEXT module has excellent long-range dependency capture capabilities and can well understand the overall structure and semantic information of images, greatly enhancing the model’s processing of small objects. This allows the model to maintain high detection accuracy and stability for small objects in complex backgrounds.
Finally, the detection head part uses a combination of various loss functions such as classification loss, localisation loss, and confidence loss to achieve accurate detection of small objects in remote sensing images. The complete process of the DCN-YOLO model from feature extraction to object detection provides an efficient and accurate solution to the task of detecting small objects in remote sensing images.

2.3. The Dilated Convolutional Residual (DCR) Module

The DCR module is designed using a residual method [27], which decomposes the method of obtaining multi-scale contextual information into a two-step method, effectively reducing the difficulty of acquisition. DCR is implemented as follows: First, the size of the input feature map is X R C × H × W (H, W, and C are the height, width, and number of channels, respectively), as shown in Figure 5. The number of channels is divided into two parts. One part is subjected to a C o n v 1 , 1 to preserve the original features of the feature map.
Y 1 = Conv 1 ( X C / 2 )
where Y is the output feature map, X is the input feature map, and C o n v i , d consists of the operations of convolution, batch normalisation, and the activation function SiLU, with i representing the kernel size of the convolution, d representing the dilation rate of the dilated convolution, and the other half undergoing a convolution operation with a C o n v 1 , 1 followed by a C o n v 3 , 1 operation, where the number of input and output channels is both C / 2 . A series of concise feature maps in the form of regions of different sizes are generated to provide material for morphological filtering in semantic residualisation.
Y 2 = Conv 1 X C / 2
Y 3 = Conv 3 Y 2
where a C o n v 1 , 1 is used to flexibly adjust the number of channels, reducing the amount of computation while increasing the expressiveness of the features. This is followed by C o n v 3 , d operations with d = 1 , d = 3 , and d = 5 , respectively.
Y 4 = Conv 1 ( Concat [ Conv 3 , 1 ( Y 3 ) , Conv 3 , 3 ( Y 3 ) , Conv 3 , 5 ( Y 3 ) ] )
These operations process features in different ways. Different dilation rates are applied to different groups. Depth convolution is used for morphological filtering, and only one desired receptive field is applied to each channel feature to avoid redundant receptive fields. In addition, based on the receptive field size of this step, a concise region feature map is obtained, which is required for region residual learning to reverse-match the receptive field, so that the learning process can be organised and multi-scale contextual information can be preserved more effectively. After these convolution operations, the output is fed into the C o n c a t operation, resulting in a channel count of C / 2 , C / 4 , or C / 4 . The C o n v 1 , 1 and channel C o n c a t operations are performed again to concatenate the results with the original features of the feature map.
Y = Conv 1 Concat ( add Y 4 + Y 2 , Y 1 )
After the C o n v 1 , 1 operation, the features are restored to the original H × W × C size and output.
At the same time, the DCR module adjusts the expansion rate and the capacity of the expansion convolution according to the network stage, making full use of the feature maps with different regional sizes and paying attention to the small receptive fields at each stage, expanding the capacity of the output channel of the first branch to twice that of the other branches. By repeatedly using the splicing operation, the features extracted by the previous layer can be fully reused, which improves the utilisation rate of the features and reduces the loss of information.

2.4. The Context Aggregation Module (CONTEXT)

The context aggregation (CONTEXT) module can implement remote interaction like a transformer [28], while taking advantage of the inductive bias of local convolution operations to achieve faster convergence. In particular, CONTEXT can perform associative computations directly on all regions of an image, allowing the model to better understand the global semantic information of the image. For regions of an image that are far apart, CONTEXT can effectively establish dependencies between them. In an image, different parts of the object may be far apart in space, but semantically related. CONTEXT can capture this long-range dependency well and thus better understand the overall structure and semantics of the image.
Our specific implementation of CONTEXT is shown in Figure 6. The size of the input feature map is X R C × H × W (where H, W, and C are the height, width, and number of channels, respectively). The input image is first flattened into a sequence of tokens, where N = H W , and fed into the network. The network usually consists of several building blocks with remaining connections, defined as follows
Y = F X , W i + X .
where X and Y are the input and output vectors, and W i is the parameter to be learned. F determines how the information in X is aggregated to compute the features at a given location. First, an affinity matrix A is defined, which represents the neighbourhood for contextual aggregation. Equation (6) can be rewritten as
Y = ( A V ) W 1 + X ,
where V R N × C is the X transformation obtained by linear projection V = X W 2 . W 1 and W 2 are trainable parameters. A i j is the affinity value between X i and X j . By multiplying the affinity matrix by V, information is propagated between features based on affinity values. The modelling power of this context aggregation module can be increased by introducing multiple affinity matrices, thus giving the network multiple ways to obtain contextual information between X. A multiheaded version of Equation (7) is
Y = C o n c a t A 1 V 1 , , A M V M W 2 + X .
Among them, A m represents the affinity matrix of each head. Compared to the single-head version, different A m ’s are likely to capture different relationships in feature space, thereby improving the representational ability of contextual aggregation. It should be noted that when using an affinity matrix for contextual aggregation, only spatial information is propagated; there is no cross-channel information exchange in the affinity matrix multiplication, and there is no non-linear activation function.

2.5. Loss Function

DCN-YOLO mainly consists of three types of loss: classification loss ( L c l s ) , confidence loss ( L o b j ) , and localisation loss ( L b o x ) . L c l s measures the difference between the object category in the prediction box and the real category in the real box. The Binary Cross-Entropy (BCE) loss function is used to calculate the classification error, which encourages the model to learn to correctly classify each object.
C E c l s ( p , y ) = log ( p ) , if y = 1 log ( 1 p ) , otherwise .
P t = p , if y = 1 1 p , otherwise
C E c l s ( p , y ) = C E c l s ( p t ) = log ( p t )
p indicates the probability that the predicted sample is 1, and y indicates the label, which can take the values (−1, +1). For the prediction of a positive sample, the closer the predicted output is to the true sample label y = 1 , the smaller the loss function L; the closer the predicted output is to 0, the larger the L.
L o b j measures the difference between the confidence in the object in the prediction box and the confidence in the object in the real box. The object confidence indicates whether the object is present in the prediction box. By optimising the loss function of the object confidence, the model can learn how to accurately determine whether the object is present in the prediction box.
C E o b j ( p , y ) = log ( p ) , if y = 1 log ( 1 p ) , otherwise .
P t = p , if y = 1 1 p , otherwise
C E o b j ( p , y ) = C E o b j ( P t ) = log ( P t )
where y is the classification of positive and negative samples, where a positive sample is 1, and p is the object’s IoU score.
The L b o x measures the difference between the model’s predicted bounding box and the real bounding box, which helps the model to locate objects accurately. CIOU [29] is based on the IOU and takes into account the distance between the centroids of the real and predicted boxes, as well as the diagonal distance of the minimum enclosing box of the two boxes. When the two boxes do not overlap, the IOU is equal to 0, while CIOU can alleviate the problem of not being able to perform backpropagation because the loss is 0. CIOU is defined as follows:
C I o U = I o U ρ 2 ( b , b g t ) c 2 α v
v = 4 π 2 arctan w g t h g t arctan w h 2
α = v ( 1 I o U ) + v
L b o x = 1 C I o U
where v measures the consistency of the aspect ratio, and α is a positive weighting parameter. The weighting parameter α is defined such that the overlap area factor has higher priority in the regression. b and B g t denote the midpoints of b and B g t , W g t and H g t denote the true box’s length and width, W and H denote the predicted box’s length and width, ρ ( ) is the Euclidean distance, and c is the diagonal length of the smallest enclosing frame covering both boxes.
L = a × L o b j + b × L c l s + c × L b o x
where a, b, and c are 0.05, 0.5, and 1.0, respectively. The DCN-YOLO loss function can effectively help the model to optimise parameters and improve the accuracy and robustness of object detection.

3. Experiments

3.1. Experimental Dataset Description

3.1.1. Tiny Object Detection in Aerial Images (AI-TOD) Dataset

Tiny Object Detection in Aerial Images (AI-TOD) dataset [30] is a dataset specifically designed for the detection of very small objects in aerial images. The dataset contains 28,036 aerial images. There are 700,621 object instances in eight categories. Compared to existing aerial object detection datasets, the average size of the objects in the AI-TOD dataset is only about 12.8 pixels, which is much smaller than the object sizes in other datasets, posing a challenge to existing object detection algorithms, as shown in Figure 7. All objects are accurately labelled with bounding boxes, which is very important for model training and helps the model to accurately learn the position and shape information of the object. The images come from different geographical environments, which increases the generalisation of the model and enables the trained model to adapt to different real-world scenarios such as geographical information analysis, environmental monitoring, urban planning, and agricultural management. The AI-TOD dataset provides an important resource and benchmark for research and applications in the field of aerial micro-object detection and is of great importance for promoting the development of related technologies.

3.1.2. Unicorn Small Object (USOD) Dataset

The USOD was created using the visible light data from UNICORN 2008 by filtering [31], segmenting, and manually adding annotations for small vehicle objects. The USOD contains a total of 3000 images and 433,788 vehicle instances. The ratio of training set to test set is 7:3. As shown in Figure 8, the proportion of objects smaller than 16 × 16 is 96.3%, and the proportion of objects smaller than 32 × 32 is 99.9%. The proportion of small- and medium-sized objects in the USOD (99.9%) is higher than in other small object datasets. The training set has a uniform distribution of small objects. The USOD dataset contains many examples of vehicles in low light and shaded conditions, which allows the model’s performance in detecting small objects to be better verified. The USOD was used to verify the robustness of the model, taking into account image degradation factors such as blur, Gaussian noise, stripe noise, and fog.

3.2. Experimental Configuration and Parameter Settings

The experiment was based on an RTX3090 GPU computer, and the environment was based on Python 3.9, PyTorch 1.13, and CUDA 11.7 under the Ubuntu 18.04 operating system. The initial learning rate was set to 0.01 during training, and the minimum learning rate was 0.001. The SGD optimiser was used to update the network parameters, with a batch size of 50 and an epoch of 300.

3.3. Experimental Analysis

3.3.1. Ablation Study

To verify the effectiveness of each key module in the DCN-YOLO model, we independently embedded the DCR and CONTEXT modules in YOLOv5 and evaluated their impact on model performance. The experimental results are presented in Table 1.
First, when the DCR module was embedded in YOLOv5, the AP value increased from 17.2 to 23.0, and the AP50 value increased from 46.0 to 49.3, indicating that the module had significant advantages in feature extraction. DCR uses multi-scale depth-wise dilated convolutions, which can effectively increase the receptive field of the model, making it more adaptable to changes in object size and thus extract richer object features. This enhanced feature expression capability further improves detection accuracy, particularly the ability to detect small objects in complex scenes.
Secondly, when the CONTEXT module was introduced alone, the AP value was improved by 4 percentage points, and the AP50 was improved by 4.8 percentage points compared to the original YOLOv5s. This result shows that the CONTEXT module enables the model to better understand the global semantic information of an image by computing global feature associations, especially when modelling the dependency between distant regions. This global perceptual capability further enhances the model’s discriminative ability and reduces the phenomenon of missed detection and misdetection of small objects due to background clutter or insufficient local features.
When the DCR module was used in conjunction with the CONTEXT module, the performance of the model was further improved, with the AP50 value increasing significantly to 56.6 and the AP value increasing to 23.9. This result shows that the DCR module and the CONTEXT module complement each other well in feature extraction and global information modelling. The DCR module improves the ability to extract local features, allowing the model to capture more fine-grained object information, while the CONTEXT module improves the efficiency of using global information, making the detection results more robust. The combination of the two effectively improves the model’s ability to detect small objects, further reducing the rate of misses and false positives, and showing significant advantages in the AP50 metric.
Overall, the experimental results show that the introduction of the DCR and CONTEXT modules play a key role in improving the overall performance of DCN-YOLO. Their synergy significantly enhances the model’s performance in the small-object detection task, providing an optimal solution for high-precision object detection.

3.3.2. Comparative Analysis by Categories

As shown in Table 2, the performance advantages and potential shortcomings of DCN-YOLO were analysed in detail by comparing the performance of different algorithms in each category. DCN-YOLO achieved high detection accuracy in several categories, reaching an accuracy of 27.2 in the aircraft category, higher than YOLOv6’s 11.7 and YOLOR’s 11.7; an accuracy of 40.9 in the storage tank category, higher than YOLOv7’s 36.7 and YOLOR’s 29.0; and an accuracy of 41.9 in the ship category, higher than YOLOv6’s 38.7 and YOLOv7’s 33.7. This demonstrates the effectiveness and generalisability of DCN-YOLO in detecting small objects of different types, and its ability to adapt to the characteristics of small objects in different categories to accurately identify and locate them. This shows that DCN-YOLO can effectively deal with the challenges of small object size, blurred features and complex backgrounds when processing small-object detection tasks.
However, in some categories, such as the Bridge category, DCN-YOLO’s accuracy was comparable to YOLOv7’s (22.9 vs. 22.2); in the Windmill category, DCN-YOLO’s accuracy of 3.9 was relatively low compared to some other algorithms. This may be related to the specific characteristics of the objects in these categories and the distribution and characteristics of the samples in that category in the dataset.

3.3.3. DCN-YOLO Analysis on AI-TOD

Experimental results on the AI-TOD dataset showed that DCN-YOLO achieved optimal performance in the small-object detection task, as shown in Table 3 [32,33,34,35,36,37,38,39,40,41]. It outperformed other object detection algorithms in the three core metrics, AP reaching 23.9, AP50 reaching 56.6, and AP75 reaching 15.4. DCN-YOLO’s inference performance is shown in Figure 9, demonstrating excellent detection capabilities.
DCN-YOLO achieved an AP improvement of more than 10 percentage points compared to traditional two-stage detectors (Faster R-CNN, Cascade R-CNN, and ATSS). This shows that DCN-YOLO can more effectively capture small object features and reduce missed detection by optimising feature extraction and multi-scale information fusion. At the same time, DCN-YOLO achieved an AP50 of 56.6, significantly higher than Cascade R-CNN’s 30.8 and Faster R-CNN’s 26.3, indicating its superior performance in terms of high recall detection.
Compared to other YOLO series algorithms (YOLOv5s and YOLOv8s), DCN-YOLO improved the AP metric by 6.7 and 12.3 percentage points, respectively. The DCN-YOLO had improved accuracy over the current YOLOV9 through YOLOv11 [42], indicating that it was deeply optimised for small-object detection while maintaining the high efficiency of the YOLO structure.
DCN-YOLO still showed a clear advantage over the transformer-based DETR series (DETR, Deformable-DETR and DAB-DETR). DETR performed poorly on small-object detection tasks due to the limitations of its fixed position coding, with an AP of only 2.7, while even the improved Deformable-DETR only achieved an AP of 17.0, which was still 6.9 percentage points lower than that of DCN-YOLO. This shows that DCN-YOLO avoids the inherent defects of the transformer structure in detecting small objects while maintaining high detection accuracy and has better generalisation ability and stability.
DCN-YOLO even outperformed the specially optimised small-object detection network FSANet, demonstrating DCN-YOLO’s excellent global perception capabilities.
DCN-YOLO achieved 6.4 on A P v t , which was lower than Faster R-CNN’s 11.3, Cascade R-CNN’s 9.9, and the highest in the DETR series, Deformable-DETR’s 7.2, but higher than QueryDet’s 2.4 and FSANet’s 6.3. This shows that DCN-YOLO’s ability to detect very small objects is lower than that of the two-stage object detectors, which is attributed to DCN-YOLO’s limited feature extraction ability for extremely small targets covered by noise. For the three metrics A P t , A P s , and A P m , DCN-YOLO achieved 22.1, 36.9, and 46.2, respectively. This was due to its excellent feature extraction ability in the global receptive field for targets with many pixels. DCN-YOLO has the advantages of high computational efficiency and being a lightweight model while maintaining high detection accuracy, making it more competitive in practical applications.

3.3.4. DCN-YOLO Analysis on USOD

Experimental results on the USOD dataset showed that DCN-YOLO achieved the best detection performance with extremely high parameter efficiency; the inference performance is shown in Figure 10. Compared to other mainstream object detection algorithms, DCN-YOLO performed well on key metrics such as precision (91.2), AP (48.1), and AP50 (88.5), demonstrating excellent detection capabilities and lightweight advantages, as shown in Table 4.
Compared with DSSD and RefineDet, DCN-YOLO’s precision was significantly better. At the same time, in terms of AP50, DCN-YOLO reached 88.5, which was 35.4 higher than DSSD, indicating that it had stronger detection capabilities on small objects and in complex background scenes.
Compared to YOLOv3, YOLOv4, YOLOv5m, and YOLOv8m, DCN-YOLO still maintained a leading advantage in the three core precision indicators, precision, AP50, and AP. Precision was 0.2 better than that of THP-YOLOv5 and better than that of YOLOv8m and YOLOv5m. AP50 was slightly lower than that of THP-YOLOv5, while AP reached 48.1, far exceeding all other methods, such as YOLOv8m’s 32.4 and YOLOv5m’s 32.3, an improvement of more than 15 percentage points.
In addition, DCN-YOLO had only 7.6 M parameters, making it the lightest model of all the methods. In contrast, YOLOv5m (20.9 M) and YOLOv8m (29.7 M) were 2.7 and 3.9 times larger than DCN-YOLO, respectively. THP-YOLOv5 had almost six times the parameters of DCN-YOLO, but its AP was still lower than that of DCN-YOLO (32.1 vs. 48.1).
Experimental results of DCN-YOLO on the USOD dataset show that it had the best balance between accuracy, detection performance, and model lightness. DCN-YOLO not only had the best AP and accuracy but also achieved performance far beyond other methods with a very low number of parameters (7.6 M), demonstrating extremely high computational efficiency and hardware adaptability. As a result, DCN-YOLO is currently the most advantageous object detection model for USOD tasks and is particularly suitable for application scenarios where accuracy and computational resources are tightly constrained.

4. Conclusions

In this study, we proposed the DCN-YOLO algorithm for the challenging task of small-object detection in remote sensing images. Through an innovative network structure design, especially the introduction of the dilated convolutional residual feature extraction module (DCR) and the context aggregation module (CONTEXT), DCN-YOLO achieved excellent performance on the AI-TOD dataset. However, there are some limitations to this research. There is room for improvement in the detection accuracy of some specific category goals, and it may be necessary to further optimise the model structure or add training strategies for specific category goals. Future work could consider further exploring how to better integrate multimodal information to further improve the model’s ability to detect small objects in complex backgrounds. At the same time, the application of the model to more practical scenarios should be explored and verified on large datasets to further verify the effectiveness and generalisability of the model. In addition, the computational efficiency and real-time nature of the model are also important considerations in practical applications. In future work, we will continue to explore more lightweight target detectors and deploy them on platforms with limited computing resources to verify performance in the real world.

Author Contributions

Methodology, M.X.; writing, Q.T. and Y.T.; funding acquisition, X.F., H.S. and W.H. All authors have read and agreed to the published version of the manuscript.

Funding

The APC was funded by Xi’an Institute of Optics and Precision Mechanics of CAS.

Institutional Review Board Statement

The study did not require ethical approval.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data and code can be obtained from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Redmon, J. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  2. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  3. Redmon, J. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  4. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  5. Mahendrakar, T.; White, R.T.; Wilde, M.; Kish, B.; Silver, I. Real-time satellite component recognition with YOLO-V5. In Proceedings of the Small Satellite Conference, Virtual, 26–27 April 2021. [Google Scholar]
  6. Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
  7. Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
  8. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer International Publishing: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  9. Jeong, J.; Park, H.; Kwak, N. Enhancement of SSD by concatenating feature maps for object detection. arXiv 2017, arXiv:1705.09587. [Google Scholar]
  10. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  11. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
  12. Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
  13. Wang, Y.; Bashir, S.M.A.; Khan, M.; Ullah, Q.; Wang, R.; Song, Y.; Guo, Z.; Niu, Y. Remote sensing image super-resolution and object detection: Benchmark and state of the art. Expert Syst. Appl. 2022, 197, 116793. [Google Scholar]
  14. Wang, L.; Yin, S.; Alyami, H.; Laghari, A.A.; Rashid, M.; Almotiri, J.; Alyamani, H.J.; Alturise, F. A novel deep learning-based single shot multibox detector model for object detection in optical remote sensing images. Geosci. Data J. 2024, 11, 237–251. [Google Scholar]
  15. Sharifuzzaman Sagar, A.S.M.; Chen, Y.; Xie, Y.; Kim, H.S. MSA R-CNN: A comprehensive approach to remote sensing object detection and scene understanding. Expert Syst. Appl. 2024, 241, 122788. [Google Scholar]
  16. Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.; Yang, J.; Li, X. Large selective kernel network for remote sensing object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 16794–16805. [Google Scholar]
  17. Gao, T.; Niu, Q.; Zhang, J.; Chen, T.; Mei, S.; Jubair, A. Global to local: A scale-aware network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5615614. [Google Scholar] [CrossRef]
  18. Yu, D.; Ji, S. A new spatial-oriented object detection framework for remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 4407416. [Google Scholar] [CrossRef]
  19. Lu, X.; Ji, J.; Xing, Z.; Miao, Q. Attention and feature fusion SSD for remote sensing object detection. IEEE Trans. Instrum. Meas. 2021, 70, 5501309. [Google Scholar] [CrossRef]
  20. Ma, W.; Li, N.; Zhu, H.; Jiao, L.; Tang, X.; Guo, Y.; Hou, B. Feature split–merge–enhancement network for remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5616217. [Google Scholar] [CrossRef]
  21. Liu, Y.; Li, Q.; Yuan, Y.; Du, Q.; Wang, Q. ABNet: Adaptive balanced network for multiscale object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5614914. [Google Scholar] [CrossRef]
  22. Liu, H.-I.; Tseng, Y.; Chang, K.; Wang, P.; Shuai, H.; Cheng, W. A denoising fpn with transformer r-cnn for tiny object detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4704415. [Google Scholar] [CrossRef]
  23. Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar]
  24. Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
  25. Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High Frequency and Spatial Perception FPN for Tiny Object Detection. arXiv 2024, arXiv:2412.10116. [Google Scholar]
  26. Shi, S.; Fang, Q.; Xu, X.; Zhao, T. Similarity distance-based label assignment for tiny object detection. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 13711–13718. [Google Scholar]
  27. Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking efficient acquisition of multi-scale contextual information for real-time semantic segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar]
  28. Lu, J.; Mottaghi, R.; Kembhavi, A. Container: Context aggregation networks. Adv. Neural Inf. Process. Syst. 2021, 34, 19160–19171. [Google Scholar]
  29. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  30. Wang, J.; Yang, W.; Guo, H.; Zhang, R.; Xia, G. Tiny object detection in aerial images. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; pp. 3791–3798. [Google Scholar]
  31. Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for small object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
  32. Sunkara, R.; Luo, T. No more strided convolutions or pooling: A new CNN building block for low-resolution images and small objects. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Grenoble, France, 9–23 September 2022; Springer Nature: Cham, Switzerland, 2022; pp. 443–459. [Google Scholar]
  33. Li, Y.; Chen, Y.; Wang, N.; Zhang, Z. Scale-aware trident networks for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6054–6063. [Google Scholar]
  34. Tian, Z.; Chu, X.; Wang, X.; Wei, X.; Shen, C. Fully convolutional one-stage 3d object detection on lidar range images. Adv. Neural Inf. Process. Syst. 2022, 35, 34899–34911. [Google Scholar]
  35. Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar]
  36. Ross, T.-Y.; Dollár, G.K.H.P. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2980–2988. [Google Scholar]
  37. Zhang, S.; Chi, C.; Yao, Y.; Lei, Z.; Li, S.Z. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9759–9768. [Google Scholar]
  38. Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  39. Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  40. Qiao, S.; Chen, L.; Yuille, A. Detectors: Detecting objects with recursive feature pyramid and switchable atrous convolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10213–10224. [Google Scholar]
  41. Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. You only learn one representation: Unified network for multiple tasks. arXiv 2021, arXiv:2105.04206. [Google Scholar]
  42. Duan, X.; Li, Z.; Zhang, J.; Li, S.; Lei, J.; Zhan, L. Weighted Learnable Recursive Aggregation Network for Visible Remote Sensing Image Detection. IEEE Trans. Geosci. Remote Sens. 2025; early access. [Google Scholar] [CrossRef]
Figure 1. The YOLOv5 network architecture consists of three components: b a c k b o n e , h e a d , and d e t e c t .
Figure 1. The YOLOv5 network architecture consists of three components: b a c k b o n e , h e a d , and d e t e c t .
Sensors 25 02241 g001
Figure 2. The C3 module consists mainly of b o t t l e n e c k and K × K C o n v layers of, where K is the convolution kernel size.
Figure 2. The C3 module consists mainly of b o t t l e n e c k and K × K C o n v layers of, where K is the convolution kernel size.
Sensors 25 02241 g002
Figure 3. SPPF module flowchart.
Figure 3. SPPF module flowchart.
Sensors 25 02241 g003
Figure 4. The network structure of DCN-YOLO. The DCR focuses on extracting high-level features from the network, while the CONTEXT module provides excellent long-range dependency capture, which significantly enhances the model’s ability to handle small objects.
Figure 4. The network structure of DCN-YOLO. The DCR focuses on extracting high-level features from the network, while the CONTEXT module provides excellent long-range dependency capture, which significantly enhances the model’s ability to handle small objects.
Sensors 25 02241 g004
Figure 5. Schematic showing the DCR networking structure.
Figure 5. Schematic showing the DCR networking structure.
Sensors 25 02241 g005
Figure 6. Schematic showing the CONTEXT networking structure.
Figure 6. Schematic showing the CONTEXT networking structure.
Sensors 25 02241 g006
Figure 7. (a) Distribution of the number of categories in the AI-TOD dataset. (b) Distribution of object sizes in AI-TOD.
Figure 7. (a) Distribution of the number of categories in the AI-TOD dataset. (b) Distribution of object sizes in AI-TOD.
Sensors 25 02241 g007
Figure 8. Distribution of object sizes in USOD.
Figure 8. Distribution of object sizes in USOD.
Sensors 25 02241 g008
Figure 9. Performance of the DCN-YOLO inference on the AI-TOD dataset.
Figure 9. Performance of the DCN-YOLO inference on the AI-TOD dataset.
Sensors 25 02241 g009
Figure 10. Performance of the DCN-YOLO inference on the USOD dataset.
Figure 10. Performance of the DCN-YOLO inference on the USOD dataset.
Sensors 25 02241 g010
Table 1. Ablation study on AI-TOD testset-dev.
Table 1. Ablation study on AI-TOD testset-dev.
APAP50
YOLOv5s17.246.0
v5+CONTEXT22.950.0
v5+DCR23.049.3
DCR+CONTEXT23.956.6
Table 2. AP of different YOLO algorithms on the single category of the AI-TOD dataset.
Table 2. AP of different YOLO algorithms on the single category of the AI-TOD dataset.
AirplaneBridgeStorage-TankShipSwimming-PooiVehiclePersonWind-Will
YOLOv611.72241.838.72025.310.210.9
YOLOv729.922.236.733.723.324.810.55.1
YOLOvR11.78.22926.32.916.95.65.7
FCOS0.917.429.943.54.724.54.51.1
Faster R-CNN8.912.237.325.017.124.96.34.3
TridentNet9.670.7712.317.13.211.93.90.94
M-CenterNet18.610.627.522.27.518.69.22.0
DCN-YOLO27.222.940.941.925.026.411.23.9
Table 3. Test accuracy of different detectors on the AI-TOD dataset.
Table 3. Test accuracy of different detectors on the AI-TOD dataset.
APAP50AP75 AP vt AP t AP s AP m
YOLOv917.940.7
YOLOv1017.839.5
YOLOv1117.941.4
FoveBox8.119.85.10.95.812.615.9
DoubleHead R-CNN10.124.36.70.07.020.030.2
Faster R-CNN11.126.37.611.321.021.626.6
YOLOv8s11.627.47.7
QueryDet12.229.37.92.410.518.526.3
ATSS12.830.68.54.014.521.531.9
Cascade R-CNN13.830.810.59.921.324.130.3
DETR2.710.30.70.72.13.012.4
Conditional-DETR2.910.07.00.92.23.014.2
DAB-DETR4.916.01.71.73.67.018.0
Deformable-DETR17.045.98.87.217.122.728.2
DABDeformable-DETR16.542.69.97.915.223.831.9
FSANet20.348.114.06.319.026.836.7
YOLOv5s17.246.0
DCN-YOLO23.956.615.46.422.136.946.2
Table 4. Performance of different detectors on USOD.
Table 4. Performance of different detectors on USOD.
PrecisionAP50APParameter
DSSD64.553.1--
RefineDet88.185.1--
YOLOv371.257.5-60 M
YOLOv479.377.8-64 M
YOLOv5m89.287.332.320.9 M
YOLOv8m90.587.632.429.7 M
TPH-YOLOv591.089.532.145.4 M
DCN-YOLO91.288.548.17.6 M
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xie, M.; Tang, Q.; Tian, Y.; Feng, X.; Shi, H.; Hao, W. DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks. Sensors 2025, 25, 2241. https://doi.org/10.3390/s25072241

AMA Style

Xie M, Tang Q, Tian Y, Feng X, Shi H, Hao W. DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks. Sensors. 2025; 25(7):2241. https://doi.org/10.3390/s25072241

Chicago/Turabian Style

Xie, Meilin, Qiang Tang, Yuan Tian, Xubin Feng, Heng Shi, and Wei Hao. 2025. "DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks" Sensors 25, no. 7: 2241. https://doi.org/10.3390/s25072241

APA Style

Xie, M., Tang, Q., Tian, Y., Feng, X., Shi, H., & Hao, W. (2025). DCN-YOLO: A Small-Object Detection Paradigm for Remote Sensing Imagery Leveraging Dilated Convolutional Networks. Sensors, 25(7), 2241. https://doi.org/10.3390/s25072241

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop