YOLO-DSNet for Small Target Detection

Xu, Haokun; He, Huangleshuai; Zhi, Qike; Yang, Zhengyi; Han, Bocheng

doi:10.3390/app16031493

Open AccessArticle

YOLO-DSNet for Small Target Detection

by

Haokun Xu

¹,

Huangleshuai He

²

,

Qike Zhi

³,

Zhengyi Yang

²

and

Bocheng Han

^1,2,*

¹

Vecton AI, Sydney, NSW 2000, Australia

²

School of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, NSW 2052, Australia

³

Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(3), 1493; https://doi.org/10.3390/app16031493

Submission received: 15 January 2026 / Revised: 26 January 2026 / Accepted: 30 January 2026 / Published: 2 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

Small target detection in Unmanned Aerial Vehicle (UAV) applications is often plagued by inherent challenges such as small object sizes, sparse information, and complex background interference. Traditional detection algorithms and existing YOLO series models suffer from limitations in detection accuracy and fine-grained detail preservation. To address this, this paper proposes YOLO-DSNet, a small target detection network based on YOLOv13n. First, we introduce the dual-stream attention module (DSAM), which enhances discriminative features by leveraging bidirectional context modeling. Second, we design the Multi-scale Attention C2f (MSA-C2f) module—an adaptive architecture that optimizes feature extraction via multi-scale enhancement, effectively preserving and integrating small target information. Finally, through dataset augmentation, we significantly improve the model’s detection performance. The proposed YOLO-DSNet achieves a mAP@0.5 improvement from 30.8% to 40.1% on the VisDrone2019 dataset with only 0.8 million additional parameters, yielding a 30% accuracy gain while increasing computational overhead by merely 11.6 Gigaflops (GFLOPs). Experiments demonstrate YOLO-DSNet’s effectiveness in small target detection tasks such as UAV aerial photography and remote sensing imagery, successfully balancing accuracy and efficiency with high practical value.

Keywords:

small target detection; UAV imagery; attention mechanism; multi-scale feature fusion; YOLOv13; deep learning

1. Introduction

Target detection stands as a cornerstone task in computer vision, extensively applied across intelligent surveillance, autonomous driving, drone aerial photography, and remote sensing analysis. However, under complex backgrounds and distant viewing conditions, small targets—characterized by low pixel density, sparse texture details, and susceptibility to scale variations and occlusion—often suffer from feature degradation caused by convolutional kernel smoothing. This results in inadequate feature representation, posing significant challenges to detection models’ localization and classification accuracy. While mainstream frameworks like Faster R-CNN [1], RetinaNet [2], and the YOLO series [3] excel at medium-to-large targets, they frequently exhibit high false-negative rates and feature representation deficiencies in small target detection.

In recent years, researchers have proposed various improved approaches for small object detection. Regarding multi-scale feature fusion, Lin et al. [4] introduced Feature Pyramid Networks (FPNs) that enhance multi-level semantic information integration through top-down feature propagation, improving small object detection capabilities. However, its simple linear fusion method tends to drown out small object features in higher semantic layers. To address this, Liu et al. [5] implemented bottom-up path enhancement in PANet to improve bidirectional information flow, though it still loses the fine-grained structural information of small objects. In attention mechanisms, Hu et al. [6] proposed the Squeeze-and-Excitation (SE) module to boost feature representation through channel attention, while Woo et al. [7] optimized feature selection by combining spatial and channel attention in CBAM. Wang et al. [8] achieved lightweight improvements through efficient convolutional adaptive modeling of channel dependencies in ECA-Net. However, these methods primarily focus on global salient regions, with limited capture of local fine-grained features in small objects. In lightweight network architectures, models like Howard et al. [9]’s MobileNet and Bochkovskiy et al. [10]’s YOLOv4-tiny achieve superior detection performance while maintaining high inference speeds. Jocher [11]’s YOLOv8n further enhances real-time performance through an improved backbone network and decoupled head architecture. However, these methods typically sacrifice accuracy for speed, resulting in suboptimal small object detection in complex scenarios. Regarding data augmentation, Bochkovskiy, Wang and Liao [10] introduced Mosaic and MixUp to diversify samples, effectively improving model generalization. However, these approaches fail to address the imbalance in small object dimensions and density distributions, leading to limited improvements in detection performance.

To address the aforementioned challenges, this study proposes the YOLO-DSNet model based on YOLOv13n [12], focusing on three key improvements for detection performance: feature enhancement, attention guidance, and data optimization. First, we introduce a dual-stream attention fusion module (DSAM) that combines prediction-guided foreground–background separation with multi-scale attentional spatial pooling (ASPP) [13] to improve small object and background feature discrimination. Second, we develop a Multi-scale Attention module (MSA-C2f) that integrates cross-scale attention channels and spatial interactions within a lightweight architecture, enhancing small object information extraction and feature fusion. Finally, we implement a target density-guided data augmentation strategy to significantly increase the quantity and diversity of small object samples while maintaining sample balance, thereby improving the model’s robustness in complex scenarios. We validate the feasibility of improving the YOLOv13n [12] model.

The main contributions of this paper are as follows:

We propose a lightweight object detection model, YOLO-DSNet, which integrates a dual-stream attention mechanism and multi-scale feature enhancement module to achieve efficient feature learning;
We develop multi-stage data augmentation strategies for small-scale targets to optimize training sample distribution and improve model generalization performance;
We perform systematic experiments on the VisDrone2019 dataset [14], demonstrating that YOLO-DSNet maintains real-time performance while achieving a 9.3 percentage point improvement in mAP@0.5 over the baseline YOLOv13n, outperforming multiple mainstream lightweight models.

2. Related Work

2.1. The Development and Evolution of Classical Object Detection Methods

The mainstream methods for object detection are categorized into single-stage detectors and two-stage detectors. The two-stage detectors, exemplified by the R-CNN series, first generate candidate regions before performing classification and regression. While demonstrating superior accuracy, they exhibit slower inference speeds, making them unsuitable for real-time applications.

Single-stage methods (e.g., SSD, YOLO series [11]) have become the mainstream approach in recent years by delivering both location and category predictions through end-to-end prediction, balancing detection accuracy and real-time performance. The YOLO series has evolved from the multi-scale prediction mechanism in YOLOv3 [3] to lightweight optimizations in YOLOv8 [11], continuously improving the balance between speed and accuracy. However, these methods still face challenges in small object detection, including low feature resolution, significant information loss, and difficulties in distinguishing foreground from background.

2.2. Limitations of Existing Small Target Detection Methods

Small object detection faces two core challenges: multi-scale feature loss and severe background interference [15]. As drones typically capture images from overhead or oblique angles, this unique imaging perspective causes small objects to cluster densely with less prominent features. When downsampling data, the loss of small object information in deep networks leads to feature degradation. Moreover, uneven feature distribution makes it difficult for detection heads to accurately distinguish target regions.

To address these challenges, several approaches are commonly employed: 1. Multi-scale feature fusion. This technique, widely used in small object detection, enhances model sensitivity by extracting features across different scales. Notable implementations include the FPN [4], specifically designed to handle significant scale variations. 2. Small object feature enhancement. Given their inherent ambiguity and weak visual saliency, small objects pose detection difficulties. Attention mechanisms are introduced to prioritize critical spatial regions, while channel-specific weight learning highlights key features in small object areas. 3. High-resolution network (HRNet) [16]. HRNet maintains high-resolution feature maps [16] and employs multi-scale fusion strategies to process features across resolutions, enabling fine-grained feature extraction at multiple scales. High-resolution feature maps are crucial for small target detection, as they better preserve the local features of small targets, thereby improving detection accuracy.

To address the challenge of small target detection, existing solutions face dual limitations. First, the DS-C3k2 module in YOLOv13 [12] lacks multi-scale branches and has a limited receptive field, failing to capture the contextual information required for small targets. Its lightweight design sacrifices small target representation capacity, which is crucial for high-resolution and context-sensitive detection. Second, conventional attention modules (e.g., SE, CBAM, ECA) focus on global saliency regions, unable to distinguish small targets from complex backgrounds. To resolve this, we propose a dual-stream attention mechanism combined with ASPP. Regarding datasets, traditional image augmentation methods like Mosaic and MixUp [10] prioritize sample diversity without addressing drone-view-specific challenges. Given image sensitivity to weather and camera angles, we employ six distinct augmentation techniques to enhance dataset quality. Experimental results demonstrate that this customized combination significantly improves model performance.

In addition, the existing methods have low accuracy and poor robustness in small target detection. YOLOv1 [3] pioneered the simultaneous execution of bounding box regression and category prediction, establishing a new paradigm for real-time object detection. YOLOv2 and YOLOv3 [3] enhanced model accuracy and robustness through innovations like batch normalization [17], anchor boxes, and multi-scale prediction. YOLOv4 [10] and YOLOv5 [11] further improved performance in complex scenarios by adopting novel architectures such as CSPDarknet and PANet [5]. While some advanced algorithms excel on specific datasets, they often lack robustness in challenging environments with dramatic lighting changes, dark backgrounds, or severe object occlusion. Yang, F. et al. [18] attempt to boost speed through lightweight designs, but these compromises typically sacrifice feature representation capabilities, resulting in reduced accuracy for small object detection [19].

2.3. Baseline YOLOv13

YOLOv13 [12] integrates multiple cutting-edge deep learning techniques, including Deep Separable Convolution (DSConv) and anchor box optimization strategies, to enhance small object detection accuracy while preserving YOLO’s real-time processing capabilities. Its core innovation lies in introducing a hypergraph-enhanced adaptive visual perception mechanism, which overcomes the limitations of traditional local correlation modeling and achieves significant improvements in the balance between accuracy and efficiency. For lightweight design, YOLOv13n adopts the DS-C3k2 module. This module serves as the key to lightweight implementation by replacing traditional large-core convolutions with deep separable convolutions.

3. Methodology

This section will detail several improvement methods we propose. At the data level, we design a multi-dimensional collaborative enhancement strategy prioritizing target detection objectives, generating high-quality augmented datasets (Enhanced_Data) to provide sufficient and high-quality small object samples for model training. At the network architecture level, we propose the MSA-C2f module to replace traditional modules in the original backbone network, combined with the small object enhancer SmallTargetEnhancer (STE) to strengthen small object feature extraction capabilities. At the feature representation level, we introduce the DSAM to refine feature enhancement for the detector input, improving object–background discrimination.

Through a comprehensive design process covering data augmentation, structural optimization, and feature refinement, we systematically address issues such as feature ambiguity and background interference in small object detection. The complete architecture is shown in Figure 1 and red sections indicating modified or added modules.

3.1. Image Enhancement for Small Targets

Unmanned aerial vehicle (UAV) aerial photography faces three unique challenges in small target detection: targets are extremely small (typically 10–32 pixels) due to high-altitude shooting; complex backgrounds like buildings, trees, and clouds often obscure target features; and varying lighting conditions (e.g., backlighting or shadows) may exacerbate feature blurring. Traditional data augmentation methods, such as simple flipping and cropping, merely increase sample diversity without addressing these specific issues.

We performed the following operations on the VisDrone2019 dataset [14]: Random horizontal flipping: Flipping images horizontally with a specified probability. Center cropping: First cropping the central square region of the image with a side length equal to the minimum side of the original image, then resizing it to the target dimensions as needed. Brightness adjustment: Randomly adjusting pixel brightness within the range

[- 120, 120]

and limiting pixel values to the

[0, 1]

range. Contrast adjustment: Randomly scaling pixel contrast within the range

[0.5, 1.5]

to enhance or reduce the image’s brightness and darkness differences. Saturation adjustment: Randomly adjusting the green channel saturation within the range

[0.5, 1.5]

to affect color vibrancy. Gaussian noise addition: Adding Gaussian noise with a mean of 0 and standard deviation of 0.05 to simulate real-world image noise, thereby reducing detection errors caused by noise interference. Figure 2 presents six methods for image enhancement.

Unlike traditional augmentation methods that indiscriminately process entire images, we analyzed the annotated information from the VisDrone2019 dataset to statistically determine the density distribution of small targets in each image. Within 100 × 100 pixel regions, we prioritized cropping images containing three or more small targets. This strategy ensures retained small target samples in cropped images, effectively avoiding the loss of small targets caused by conventional random cropping. Traditional augmentation methods often rely on single or limited strategies like flipping or cropping, which fail to comprehensively cover complex scenarios with drone small targets. We implemented a combination of geometric transformations, pixel adjustments, and noise simulation techniques that complement each other. Through this enhancement process, we generated a massive dataset of 38,826 augmented images, compared to the original 6471 images. This substantial increase in sample diversity significantly improved the model’s stability in recognizing small targets under varying perspectives and lighting conditions, reduced missed detections due to camera angles or lighting changes, and enhanced contrast between small targets and backgrounds. The optimized dataset also underwent moderate contrast adjustments to better capture small target features in complex environments, thereby boosting model robustness. The object classification and quantity in the dataset are shown in Figure 3 and Figure 4 respectively, presenting statistics for the original dataset and the improved dataset.

The original dataset was insufficient for small target detection, so we enhanced the data and named it Enhanced_Data.

3.2. Small Target Enhancer—MSA-C2f

3.2.1. MSA-C2f Architecture

MSA-C2f is an enhanced deep separable convolution branch based on the C3 module, integrating deep separable convolution with multi-scale feature extraction. By splitting standard convolution into depth and point-by-point operations, it reduces computational and parameter costs while preserving more local features, providing a lightweight foundation for small object detection.

MSA-C2f dynamically routes input features based on module indices. The system first determines whether the DSC3k2 module is in use. For modules utilizing DSC3k2, features are processed through two parallel branches: when the module index is even, features are processed by the Pzconv module; when the module index is odd, features are processed by the STE module. Features from these branches are aggregated into a single list for subsequent fusion. For modules not using DSC3k2, features are directed to DSC3k2_Pzconv before entering the feature fusion process. The overall architecture of MSA-C2f is illustrated in Figure 5.

The DF-YOLOv13 model proposed in this paper adopts a three-stage architecture, consisting of a backbone network, a neck network, and a detector head. The backbone network comprises Conv, DSC3k2_MKP, DSConv, and A2C2f. We replace the DSC3k2 module with the MSA-C2f module, which demonstrates superior performance in small object detection.

3.2.2. Pzconv Module

We replaced the standard convolutional blocks in the original C3 with the Pzconv module, which maintains lightweight characteristics while expanding the receptive field for better performance in small object detection. The architecture of the Pzconv module is illustrated in Figure 6.

Pzconv further expands the receptive field while maintaining lightweight characteristics, achieving efficient feature aggregation through variable convolution kernels, mathematically described as

F_{pz} = σ (BN (W_{pz} \times X))

(1)

where

σ (\cdot)

denotes the activation function;

W_{pz}

indicates the variable convolution kernel.

3.2.3. STE Architecture

STE consists of three modules. It enhances small object features through a triple enhancement approach: High-Res Path, MultiScaleFeaturePyramid (MSFP), and SmallObjectFocusAttention (SOFA). The STE architecture is illustrated in Figure 7.

The High-Res Path module employs 1 × 1 convolution and group convolution to preserve fine details of small targets. It first applies 1 × 1 convolution to adjust channel dimensions, then performs normalization and activation. Subsequently, 3 × 3 group convolution extracts spatial features, followed by another round of normalization and activation. This process efficiently achieves feature extraction and transformation while maintaining high resolution. The workflow diagram of this module is shown in Figure 8.

To preserve edge details of small targets, STE constructs high-resolution paths using 1 × 1 and group convolutions.

H = σ (BN (W_{3} \times σ (BN (W_{1} \times X))))

(2)

where

B N

represents batch normalization;

W_{i}

refers to i × i convolution.

MSFP architecture generates multi-scale features through convolutional branches with dilation factors of 1, 2, and 4, enabling effective detection of small targets across different sizes. SOFA, leveraging channel attention and spatial attention mechanisms with multi-scale pooling, enhances feature responses in small target regions. The core design principle of this module is multi-scale feature complementarity, allowing the model to simultaneously capture fine-grained and large-scale feature information for improved task performance. The workflow diagram of this module is shown in Figure 9.

As depicted in Figure 10, SOFA is a specialized small object detection framework that integrates multi-scale pooling, channel attention, and spatial attention. As an output enhancement mechanism, it first applies multi-scale pooling with 1 × 1, 2 × 2, 4 × 4, and 8 × 8 convolution kernels, then employs an attention mechanism to amplify small object features while suppressing background noise. The workflow diagram of this module is as follows:

First, perform multi-scale pooling:

S_{i} = {Pool}_{k_{i}} (X), k_{i} = 1, 2, 4, 8

(3)

Then, calculate channel attention and spatial attention respectively:

α = σ (W_{2} δ (W_{1} GAP (X)))

(4)

where

δ

is nonlinear activation;

G A P (X)

means global average pooling.

β = σ (f ([X_{\max}; X_{avg}]))

(5)

where

f (\cdot)

denotes a convolution kernel to apply convolution to the input features;

X_{\max}

represents the output obtained by applying max pooling to input features in channel dimension; and

X_{a v g}

represents the output obtained by input features that are averaged over the channel dimension.

Finally, output the result:

F_{SOFA} = α \cdot X + β \cdot X

(6)

3.3. Feature Enhancement Module DSAM

The DSAM is a deep learning-based feature enhancement module designed for differential processing and enhancement of input features, with its core architecture focusing on fine-grained feature extraction and fusion. It incorporates an ASPP (Atrous Spatial Pyramid Pooling) structure [13], which captures multi-scale spatial information through convolution operations with varying dilation rates. Each ASPP branch consists of convolution, batch normalization [17], and activation functions, enabling feature extraction across different receptive fields. Two parallel ASPP branches are configured to process foreground- and background-related features respectively. The Pred_Layer layer further processes the fused features through convolution, batch normalization [17], and operation, ultimately outputting enhanced features. The workflow operates as follows: Input features (feat) and predictions (pred) are first processed through a 1 × 1 convolution to convert pred into a single-channel feature. This is then resized to match feat dimensions via sigmoid activation and upscaling, generating weight maps for foreground–background distinction. These weight maps are used to segment the original feature map: multiplying feat with weight maps yields foreground-related features, while multiplying with weight maps produces background-related features.

Apply 1 × 1 convolution and sigmoid to the prediction map:

W_{f} = σ (W_{1} \times pred)

(7)

W_{b} = 1 - W_{f}

(8)

where

W_{f}

refers to foreground attention weight map. The higher the value, the more likely the position is to be the target;

pred

is a result from the detection head;

W_{b}

represent background weight attention map. If a region has a higher probability of being foreground, the background weight naturally decreases; the opposite is also true.

After processing through ASPP, the features from both branches are concatenated across channel dimensions and fed into the Pred_Layer for feature fusion and enhancement, ultimately outputting the enhanced feature map. The entire process implements feature segmentation, multi-scale extraction, and fusion, which enhances the expressive power of key features and is suitable for tasks requiring fine feature differentiation. Therefore, we place this module at the end of the head network, before the detection head, enabling the detection head to more easily learn the discriminative features between targets and backgrounds. This provides the detection head with higher-quality and more discriminative input features, ultimately improving detection accuracy. Figure 11 illustrates the structure diagram of the DSAM.

4. Experiments

4.1. Experimental Environment

Hardware-wise, NVIDIA RTX4090 graphics cards (NVIDIA Corporation, Santa Clara, CA, USA) were employed. To ensure reproducible experimental results, a standardized hyperparameter configuration was applied across training, testing, and validation phases. The specific settings included: 150 training epochs for the augmented dataset, 200 epochs for the original dataset, an input image size of 640 × 640 pixels, and the use of the SGD optimizer with a learning rate (lr) of 0.01. All networks were trained without pre-trained weights.

4.2. Data Set

In this experiment, we utilized the publicly available VisDrone2019 dataset, developed by Tianjin University and other research teams for object detection tasks. The dataset comprises labeled images divided into four categories: training (6471 images), validation (548 images), testing (1610 images), and competition (1580 images). The image dimensions range from 2000 × 1500 to 480 × 360 pixels, with annotations covering 10 object types including pedestrians and various vehicles. The Enhanced_Data dataset features a larger image count and more diverse samples, simulating scenarios across different environments. For comparison, we conducted experiments on the same model using both the enhanced and original datasets to demonstrate the significant performance improvements achieved through data augmentation.

4.3. Comparative Experiment

Table 1 shows the experimental results of various models on the VisDrone2019 dataset, including mAP@0.5, number of parameters (M), and amount of computation (GFLOPs).

The performance analysis of different models through the table reveals that traditional two-stage and single-stage models (Faster-RCNN [1], RetinaNet-R50-FPN [2]) exhibit significant limitations, with mediocre performance and high resource consumption. When models of comparable parameter and computational scales are compared, the performance gap becomes apparent, where YOLOv13n [12] (30.8%) and D-Fine-N [22] (33.4%) demonstrate a clear advantage. The performance improvement is notably enhanced by incorporating the improved module, which significantly boosts the performance of lightweight models through the addition of attention mechanisms and optimized feature fusion strategies.

4.4. Ablation Experiment

Table 2 and Table 3 respectively present the ablation experiment results based on the original VisDrone-2019 dataset and the enhanced VisDrone-2019 dataset.

The experimental results demonstrate that image augmentation significantly improves model performance. The MSA-C2f module shows particularly notable enhancement in small target detection, while the DSAM provides some benefit, though its performance remains inferior to that of MSA-C2f.

Comparing the ablation results before and after image enhancement, we can clearly observe that using enhanced images for model training yields better performance. The results demonstrate that integrating the MSA-C2f and DSAM modules with enhanced data improves model accuracy by 8.4%, with the mAP@0.5 metric rising from 35.3% to 40.1% and precision increasing by 4.8 percentage points. These findings conclusively validate the effectiveness of the specialized image enhancement strategy designed for drone small object detection scenarios, which significantly enhances model feature learning capabilities and improves detection performance.

Secondly, high-quality and diverse samples enable the model to learn features with strong generalization capabilities more efficiently, achieving loss convergence and metric stabilization with fewer training rounds. Figure 12 and Figure 13 show the model training metrics before and after data augmentation, respectively.

Image enhancement significantly improves the model’s accuracy. Although the proportion of each sample in the improved dataset remains unchanged, the uneven sample distribution persists. To address this, we applied both oversampling and undersampling to different instances in the dataset, yet the experimental results showed no improvement. We will not present these results here.

The performance comparison is based on four evaluation metrics: recall, precision, average precision, average precision, and Gigaflops (GFLOPs).

The recall rate (R) formula is as follows:

Recall = \frac{T P}{T P + F N}

(9)

where TP (True Positive) represents the number of samples where both the true label and the model’s prediction are positive; FN (False Negative) indicates the number of samples where the true label is positive but the model predicts negative.

The precision (P) formula is as follows:

Precision = \frac{T P}{T P + F P}

(10)

where FP (False Positive) refers to the number of samples where the true label is negative but the model predicts positive.

The average precision (AP) formula is as follows:

AP = \int_{0}^{1} P (R) d R

(11)

This refers to the relationship on the precision–recall Curve (PR curve), where the precision value corresponds to any recall rate R. Essentially, this formula represents the area under the PR curve.

The mean average precision (mAP) formula is as follows:

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P (i)

(12)

where n is the total number of categories in the task, and AP denotes the average precision (AP) of the i-th category. The formula calculates the arithmetic mean of the AP values across all n categories.

GFLOPs denotes one billion floating-point operations per second, a metric for evaluating computational complexity in models. The calculation method for GFLOPs is as follows [23]:

GFLOPs = \frac{T o t a l F l o a t i n g P o i n t O p e r a t i o n s}{10^{9}}

(13)

This formula measures the computational complexity of the model. A higher value indicates greater computational effort required for a single inference or training cycle, thus demanding more hardware processing power.

4.5. Analysis of Experimental Results

Figure 14 and Figure 15, respectively, show the mAP@ 0.5 curve and the mAP@0.5: 0.95 curve. When the DSAM is added, the mAP metric improves further compared to the model using only data augmentation, indicating that the DSAM enhances small object features and boosts detection accuracy. The addition of the MSA-C2f module leads to a significant performance improvement, demonstrating its positive impact on small object detection through feature transfer or fusion.

When both the DSAM and MSA-C2f modules are incorporated, the mAP@0.5 and mAP@0.5:0.95 metrics reach their peak performance. This demonstrates that the two modules work synergistically to enhance small target detection, enabling the model to achieve more comprehensive performance in identifying small targets.

Figure 16 shows the loss curve All improved models demonstrate faster loss reduction and lower final loss values compared to the original YOLOv13. Notably, the YOLOv13+augdata+DSAM+MSA-C2f model exhibits the most stable loss curve and achieves the lowest final loss, indicating that this combined model demonstrates more stable convergence during training. It effectively learns features relevant to small object detection, resulting in superior training performance.

4.6. Comparison Results of Different Models

As shown in Figure 17, the YOLO-DSNet model demonstrates superior detection performance with higher overall accuracy, and can accurately identify partially occluded objects.

Analysis of the observation results reveals that the YOLO-DSNet model demonstrates superior detection performance with higher overall accuracy, including accurate recognition of partially occluded objects. It also accurately identifies objects in low-light nighttime scenes with partial occlusion. While the original YOLOv13n and RTDETR-R18 models can detect most targets and classify object types, their performance lacks precision—some objects are misidentified in low-light conditions, and certain objects remain undetected in complex images.

The performance gap between models stems from YOLO-DSNet’s threefold core enhancements targeting small object detection challenges, effectively addressing inherent limitations of traditional models. The data augmentation strategy provides diverse images simulating varied shooting angles and weather conditions, enabling better learning. The MSA-C2f module employs attention mechanisms to precisely capture small object features. Its high-resolution path maintains object edges while multi-scale feature pyramids generate scalable representations for different object sizes. The DSAM enhances object–background discrimination to accurately separate feature regions, preventing background interference. Through systematic improvements from data to algorithm, this approach ultimately achieves superior detection performance. In conclusion, our model achieves better training results and stronger performance.

5. Conclusions and Future Work

This study employs YOLOv13n as the baseline model, inheriting its HyperACE supermap adaptive association enhancement mechanism and FullPAD full-process aggregation distribution paradigm. Through module replacement and addition, along with specialized dataset optimization, it achieves a balance between improving small object detection accuracy and maintaining lightweight characteristics.

The MSA-C2f module comprises three core components. The SmallTargetEnhancer (STE) addresses three key challenges in small object detection—detail loss, poor scale adaptability, and weak feature response—through a triple enhancement approach: high-resolution path, multi-scale feature pyramid (MSFP), and small object focus attention (SOFA). This establishes a complete detection pipeline from feature extraction to enhancement and focus. The DSAM further enhances the model’s ability to capture and represent small object features by performing multi-scale extraction after feature segmentation, followed by fusion enhancement. This effectively resolves missed detection and false detection issues caused by insufficient feature information in small object detection.

Despite demonstrating strong performance on established benchmarks, this study has certain limitations. Although data augmentation expanded the sample size, it did not fully resolve the uneven distribution of small objects across categories in the VisDrone2019 dataset. We aim to further enhance the dataset using Generative Adversarial Networks (GANs) [24] to address sample imbalance. Additionally, we plan to extend validation to diverse datasets like COCO to improve model robustness under adverse weather conditions.

In the future, our research will focus on optimizing computational efficiency and model compression processing to facilitate lightweight deployment of edge devices. On the one hand, the module structure can be deeply optimized, a dynamic receptive field adjustment mechanism can be introduced in the STE module of MSA-C2f, and the small target of different sizes can be matched through adaptive convolutional kernel size. On the other hand, the redundant structure of the module can be simplified with the help of neural architecture search (NAS) technology [25], and a more efficient lightweight feature extraction unit can be used to replace part of the convolution operation, which can further reduce the computational overhead and improve efficiency while maintaining accuracy.

Author Contributions

Conceptualization, H.X. and H.H.; methodology, H.X.; software, H.X.; validation, H.X., H.H. and Q.Z.; resources, B.H.; data curation, B.H.; writing—original draft preparation, H.X.; writing—review and editing, H.H.; visualization, Q.Z.; supervision, Z.Y.; project administration, Z.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available in the publicly available VisDrone2019 dataset via the official GitHub repository at https://github.com/VisDrone/VisDrone-Dataset (accessed on 20 October 2025).

Conflicts of Interest

Author Haokun Xu and Author Bocheng Han were employed by the company Vecton AI. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

DSAM	Dual-Stream Attention Module
MSA-C2f	Multi-scale Attention C2f
UAV	Unmanned Aerial Vehicle
FPN	Feature Pyramid Networks
SE	Squeeze-and-Excitation
CBAM	Convolutional Block Attention Module
ECA	Efficient Channel Attention
ASPP	Attentional spatial pooling
R-CNN	Region-based Convolutional Neural Networks
HRNet	High-resolution network
CSPDarknet	Cross Stage Partial Darknet
DSConv	Deep Separable Convolution
STE	SmallTargetEnhancer
MSFP	MultiScaleFeaturePyramid
SOFA	SmallObjectFocusAttention
SGD	Stochastic Gradient Descent
R	Recall rate
TP	True Positive
FN	False Negative
P	Precision
FP	False Positive
AP	Average Precision
mAP	Mean Average Precision
GAN	Generative adversarial networks

References

Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 8759–8768. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Computer Vision—ECCV 2018; Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar] [CrossRef]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 11531–11539. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Yaseen, M. What is YOLOv8: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Lei, M.; Li, S.; Wu, Y.; Yang, H.; Peng, L. YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception. arXiv 2025, arXiv:2506.17733. [Google Scholar]
Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 834–848. [Google Scholar] [CrossRef] [PubMed]
Zhu, P.; Wen, L.; Bian, X.; Ling, H.; Hu, Q. Vision meets drones: A challenge. arXiv 2018, arXiv:1804.07437. [Google Scholar] [CrossRef]
Xu, W.; Sun, L.; Zhen, C.; Liu, B.; Yang, Z.; Yang, W. Deep Learning-Based Image Recognition of Agricultural Pests. Appl. Sci. 2022, 12, 12896. [Google Scholar] [CrossRef]
Wang, J.; Sun, K.; Cheng, T.; Jiang, B.; Deng, C.; Zhao, Y.; Liu, D.; Mu, Y.; Tan, M.; Wang, X. Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3349–3364. [Google Scholar] [CrossRef] [PubMed]
Ioffe, S.; Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; ACM Digital Library: New York, NY, USA, 2015; pp. 448–456. [Google Scholar]
Yang, F.; He, M.; Liu, J.; Jin, H. RMH-YOLO: A Refined Multi-Scale Architecture for Small-Target Detection in UAV Aerial Imagery. Sensors 2025, 25, 7088. [Google Scholar] [CrossRef] [PubMed]
Wu, J.; Tang, X.; Yang, Z.; Hao, K.; Lai, L.; Liu, Y. An Experimental Evaluation of LLM on Image Classification. In Australasian Database Conference; Springer: Singapore, 2024; pp. 506–518. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Huang, S.; Xu, Z.; Sun, X.; Wang, Z.; Jin, X.; Li, X.; Zhang, X. DEIM: DETR with Improved Matching for Fast Convergence. arXiv 2024, arXiv:2412.04234. [Google Scholar] [CrossRef]
Yuan, Z.; Gong, J.; Guo, B.; Wang, C.; Liao, N.; Song, J.; Wu, Q. Small Object Detection in UAV Remote Sensing Images Based on Intra-Group Multi-Scale Fusion Attention and Adaptive Weighted Feature Fusion Mechanism. Remote Sens. 2024, 16, 4265. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; ACM Digital Library: New York, NY, USA, 2019; pp. 6105–6114. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
Deng, L.; Li, G.; Han, S.; Shi, L.; Xie, Y. Model compression and hardware acceleration for neural networks: A comprehensive survey. Proc. IEEE 2020, 108, 485–532. [Google Scholar] [CrossRef]

Figure 1. YOLO-DSNet network architecture diagram.

Figure 2. Method for image enhancement of small target objects.

Figure 3. Number and types of targets in the original dataset.

Figure 4. Number and types of targets in the augmented dataset.

Figure 5. Multi-scale Attention C2f (MSA-C2f) architecture diagram.

Figure 6. Pzconv module structure diagram.

Figure 7. Small Target Enhancer (STE) architecture diagram.

Figure 8. High-res path flowchart.

Figure 9. Multi Scale Feature Pyramid (MSFP) module flowchart.

Figure 10. Small Object Focus Attention (SOFA) module flowchart.

Figure 11. Dual-stream attention module (DSAM) structure diagram.

Figure 12. Model performance before data augmentation.

Figure 13. Training performance of the data-enhanced model.

Figure 14. mAP@0.5 curve.

Figure 15. mAP@0.5: 0.95 curve.

Figure 16. Variation curve of the loss function.

Figure 17. Detection performance of small targets under different models.

Table 1. Comparison of performance and parameter count between YOLO-DSNet and other models.

Model	mAP@0.5	Params (M)	GFLOPs
Faster-RCNN [1]	25.9	41.4	292.3
RTDETR-R18 [20]	33.3	20	60
RetinaNet-R50-FPN [2]	27.6	36.5	210
DEIM-D-Fine-N [21]	32.2	3.73	7.12
D-Fine-N [22]	33.4	3.73	7.12
YOLOv8n [11]	25.9	3.0	8.1
YOLOv10n	26.1	2.28	6.5
YOLO12n	25.9	2.56	6.3
YOLOv13n [12]	30.8	2.45	6.2
YOLOv13s [12]	29.7	9.0	20.1
YOLOv13n + MSA-C2f	37.5	2.72	11.3
YOLOv13n + DSAM	36.2	2.98	13.1
YOLO-DSNet (ours)	40.1	3.23	17.8

Abbreviations: MSA-C2f: Multi-scale Attention C2f; DSAM: dual-stream attention module.

Table 2. Setting and performance comparison of YOLO-DSNet ablation experiments on raw datasets.

Group	YOLOV13n	DSAM	MSA-C2f	P%	R%	mAP@0.5	mAP@0.7
1	✓			41.6	31.3	30.8	17.7
2	✓	✓		42.0	31.9	31.5	17.9
3	✓	✓	✓	44.5	36.1	35.3	20.9

Table 3. Enhanced settings and performance comparison of YOLO-DSNet ablation experiments on datasets.

Group	YOLOV13n	DSAM	MSA-C2f	P%	R%	mAP@0.5	mAP@0.7
1	✓			45.7	33.7	34.0	19.8
2	✓	✓		47.6	35.8	36.2	21.4
3	✓	✓	✓	52.9	38.8	40.1	24.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xu, H.; He, H.; Zhi, Q.; Yang, Z.; Han, B. YOLO-DSNet for Small Target Detection. Appl. Sci. 2026, 16, 1493. https://doi.org/10.3390/app16031493

AMA Style

Xu H, He H, Zhi Q, Yang Z, Han B. YOLO-DSNet for Small Target Detection. Applied Sciences. 2026; 16(3):1493. https://doi.org/10.3390/app16031493

Chicago/Turabian Style

Xu, Haokun, Huangleshuai He, Qike Zhi, Zhengyi Yang, and Bocheng Han. 2026. "YOLO-DSNet for Small Target Detection" Applied Sciences 16, no. 3: 1493. https://doi.org/10.3390/app16031493

APA Style

Xu, H., He, H., Zhi, Q., Yang, Z., & Han, B. (2026). YOLO-DSNet for Small Target Detection. Applied Sciences, 16(3), 1493. https://doi.org/10.3390/app16031493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-DSNet for Small Target Detection

Abstract

1. Introduction

2. Related Work

2.1. The Development and Evolution of Classical Object Detection Methods

2.2. Limitations of Existing Small Target Detection Methods

2.3. Baseline YOLOv13

3. Methodology

3.1. Image Enhancement for Small Targets

3.2. Small Target Enhancer—MSA-C2f

3.2.1. MSA-C2f Architecture

3.2.2. Pzconv Module

3.2.3. STE Architecture

3.3. Feature Enhancement Module DSAM

4. Experiments

4.1. Experimental Environment

4.2. Data Set

4.3. Comparative Experiment

4.4. Ablation Experiment

4.5. Analysis of Experimental Results

4.6. Comparison Results of Different Models

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI