Author Contributions
Conceptualisation, T.T.; methodology, T.T.; software, T.T.; validation, T.T.; formal analysis, T.T.; investigation, T.T.; resources, T.T.; data curation, T.T.; writing—original draft preparation, T.T.; writing—review and editing, T.T., J.Z., H.Z., Y.Z., X.P., H.L. and Y.W.; visualisation, T.T.; supervision, T.T.; project administration, X.P.; funding acquisition, J.Z. All authors have read and agreed to the published version of the manuscript.
Figure 1.
The overall architectural framework of the proposed LDSNet. The red dashed boxes highlight the core modular innovations, including LDSDown, SRDC, and DGHead.
Figure 1.
The overall architectural framework of the proposed LDSNet. The red dashed boxes highlight the core modular innovations, including LDSDown, SRDC, and DGHead.
Figure 2.
Structure of LDSDown.
Figure 2.
Structure of LDSDown.
Figure 3.
Structure of SRDC. The orange squares in the grids denote the active sampling locations of the convolutional kernels, illustrating the progressively expanding receptive fields corresponding to different dilation rates (, , and ).
Figure 3.
Structure of SRDC. The orange squares in the grids denote the active sampling locations of the convolutional kernels, illustrating the progressively expanding receptive fields corresponding to different dilation rates (, , and ).
Figure 4.
Structure of YOLOv11 Head.In the output tensors, the 64 channels in the Box branch correspond to the Distribution Focal Loss (DFL) representation, calculated as 4 bounding box boundaries (left, top, right, bottom) discretized into 16 bins (). The channels in the Cls branch denote the classification probabilities for the total number of object classes.
Figure 4.
Structure of YOLOv11 Head.In the output tensors, the 64 channels in the Box branch correspond to the Distribution Focal Loss (DFL) representation, calculated as 4 bounding box boundaries (left, top, right, bottom) discretized into 16 bins (). The channels in the Cls branch denote the classification probabilities for the total number of object classes.
Figure 5.
Structure of DGHead. The input dimensions () for the P2 and P3 feature layers are and .
Figure 5.
Structure of DGHead. The input dimensions () for the P2 and P3 feature layers are and .
Figure 6.
Structure of Group Conv. The asterisk (*) denotes the convolution operation. The ellipses (⋮ and …) represent the omitted intermediate groups. Different colors indicate the distinct groups into which the channels are divided.
Figure 6.
Structure of Group Conv. The asterisk (*) denotes the convolution operation. The ellipses (⋮ and …) represent the omitted intermediate groups. Different colors indicate the distinct groups into which the channels are divided.
Figure 7.
HIT-UAV dataset imagery.
Figure 7.
HIT-UAV dataset imagery.
Figure 8.
VisDrone2019 dataset imagery.
Figure 8.
VisDrone2019 dataset imagery.
Figure 9.
Distribution of object sizes for (a) VisDrone2019 and (b) HIT-UAV. Definition of scales (pixels): small (< pixels), medium ( to pixels), and large (≥ pixels).
Figure 9.
Distribution of object sizes for (a) VisDrone2019 and (b) HIT-UAV. Definition of scales (pixels): small (< pixels), medium ( to pixels), and large (≥ pixels).
Figure 10.
Comparative visualisation of effective receptive field: (a) Highly concentrated ERF of Layer 5; (b) Centre-biased response of SPPF; (c) Expanded coverage of LSKA-SPPF; (d) Radiative outward expansion of the proposed SRDC module.
Figure 10.
Comparative visualisation of effective receptive field: (a) Highly concentrated ERF of Layer 5; (b) Centre-biased response of SPPF; (c) Expanded coverage of LSKA-SPPF; (d) Radiative outward expansion of the proposed SRDC module.
Figure 11.
Grad-CAM heatmap visualisation and comparative analysis across different detection models: (a) Original aerial images; (b) YOLOv11n baseline; (c) YOLOv12n baseline; and (d) the proposed LDSNet. Compared to the baselines in (b,c), the proposed LDSNet in (d) exhibits more concentrated and precise feature activation on minute object while significantly suppressing non-object background clutter. The colors in the heatmaps indicate the intensity of feature activation: warmer colors (e.g., red) represent regions with high activation where the model’s attention is primarily focused, while cooler colors (e.g., blue) correspond to low-activation background regions.
Figure 11.
Grad-CAM heatmap visualisation and comparative analysis across different detection models: (a) Original aerial images; (b) YOLOv11n baseline; (c) YOLOv12n baseline; and (d) the proposed LDSNet. Compared to the baselines in (b,c), the proposed LDSNet in (d) exhibits more concentrated and precise feature activation on minute object while significantly suppressing non-object background clutter. The colors in the heatmaps indicate the intensity of feature activation: warmer colors (e.g., red) represent regions with high activation where the model’s attention is primarily focused, while cooler colors (e.g., blue) correspond to low-activation background regions.
Figure 12.
Qualitative detection results of LDSNet under diverse illumination conditions on the VisDrone2019 dataset: (a) daytime scenes; (b) dusk scenes; and (c) nighttime scenes. The results underscore the model’s robustness and its ability to maintain stable detection performance across significant temporal and lighting fluctuations. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 12.
Qualitative detection results of LDSNet under diverse illumination conditions on the VisDrone2019 dataset: (a) daytime scenes; (b) dusk scenes; and (c) nighttime scenes. The results underscore the model’s robustness and its ability to maintain stable detection performance across significant temporal and lighting fluctuations. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 13.
Detection performance of LDSNet in representative high-challenge aerial environments: (a) dense crowd scenes; (b) busy street scenes; and (c) blurred imagery. The visualisations demonstrate that LDSNet preserves high detection integrity and localisation veracity despite cluttered backgrounds and motion-induced degradation. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 13.
Detection performance of LDSNet in representative high-challenge aerial environments: (a) dense crowd scenes; (b) busy street scenes; and (c) blurred imagery. The visualisations demonstrate that LDSNet preserves high detection integrity and localisation veracity despite cluttered backgrounds and motion-induced degradation. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 14.
Comparative visualisation of detection performance on clustered minute object: (a) YOLOv11n; (b) YOLOv12n; and (c) the proposed LDSNet. The red magnified insets highlight LDSNet’s superior recall, successfully pinpointing infinitesimal objects that are frequently missed by standard baselines. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 14.
Comparative visualisation of detection performance on clustered minute object: (a) YOLOv11n; (b) YOLOv12n; and (c) the proposed LDSNet. The red magnified insets highlight LDSNet’s superior recall, successfully pinpointing infinitesimal objects that are frequently missed by standard baselines. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects. The colored frames represent the predicted bounding boxes, indicating the precise locations and categories of the detected objects.
Figure 15.
Visualisation of detection results versus ground-truth annotations on the HIT-UAV infrared dataset: (a) LDSNet predictions; and (b) ground truth labels. The model achieves high-fidelity alignment and effective edge preservation, showcasing its sensitivity even in the absence of chromatic and textural cues. The colored frames represent the bounding boxes of the objects. Specifically, the cyan frames in (a) denote the detection results predicted by LDSNet, while the green frames in (b) indicate the ground-truth bounding boxes of the actual targets.
Figure 15.
Visualisation of detection results versus ground-truth annotations on the HIT-UAV infrared dataset: (a) LDSNet predictions; and (b) ground truth labels. The model achieves high-fidelity alignment and effective edge preservation, showcasing its sensitivity even in the absence of chromatic and textural cues. The colored frames represent the bounding boxes of the objects. Specifically, the cyan frames in (a) denote the detection results predicted by LDSNet, while the green frames in (b) indicate the ground-truth bounding boxes of the actual targets.
Figure 16.
Detailed detection performance comparison for dense and occluded infrared object: (a) YOLOv11n; (b) YOLOv12n; (c) the proposed LDSNet; and (d) ground truth labels. Compared to baselines, LDSNet exhibits greater discriminability and superior alignment in complex thermal backgrounds. The colored frames represent the bounding boxes of the objects. Specifically, the white and blue frames in (a–c) denote the detection results predicted by the respective models, while the green frames in (d) indicate the ground-truth bounding boxes. The prominent red rectangles are utilized to highlight specific comparison regions, particularly showcasing instances where the baseline models (YOLOv11n and YOLOv12n) fail to detect the targets.
Figure 16.
Detailed detection performance comparison for dense and occluded infrared object: (a) YOLOv11n; (b) YOLOv12n; (c) the proposed LDSNet; and (d) ground truth labels. Compared to baselines, LDSNet exhibits greater discriminability and superior alignment in complex thermal backgrounds. The colored frames represent the bounding boxes of the objects. Specifically, the white and blue frames in (a–c) denote the detection results predicted by the respective models, while the green frames in (d) indicate the ground-truth bounding boxes. The prominent red rectangles are utilized to highlight specific comparison regions, particularly showcasing instances where the baseline models (YOLOv11n and YOLOv12n) fail to detect the targets.
Figure 17.
Visualization of typical failure conditions for the proposed method: (a) Large objects: the restricted downsampling depth limits the global receptive field, resulting in mismatched spatial coverage and localization inaccuracy. (b) High occlusion: severe inter-object overlap and background obstruction lead to the loss of critical features, causing missed detections. The thin colored frames represent the predicted bounding boxes of the detected objects. The prominent red rectangles are specifically utilized to highlight the failure cases of the proposed LDSNet, such as inaccurate localization or missed detections in these challenging scenarios.
Figure 17.
Visualization of typical failure conditions for the proposed method: (a) Large objects: the restricted downsampling depth limits the global receptive field, resulting in mismatched spatial coverage and localization inaccuracy. (b) High occlusion: severe inter-object overlap and background obstruction lead to the loss of critical features, causing missed detections. The thin colored frames represent the predicted bounding boxes of the detected objects. The prominent red rectangles are specifically utilized to highlight the failure cases of the proposed LDSNet, such as inaccurate localization or missed detections in these challenging scenarios.
Table 1.
Experimental environment configuration.
Table 1.
Experimental environment configuration.
| Component | Configuration |
|---|
| Operating System | Ubuntu 20.04 |
| GPU | NVIDIA GeForce RTX 3090 (24 GB) |
| CPU | Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90 GHz |
| Python Version | 3.10.14 |
| Deep Learning Framework | PyTorch 2.4.0 |
| CUDA Version | 12.4 |
Table 2.
Hyperparameter configurations for model training.
Table 2.
Hyperparameter configurations for model training.
| Name | Value | Name | Value |
|---|
| Optimizer | SGD | Training Epochs | 300 |
| Input Image Size | | Data Loading Workers | 4 |
| Initial Learning Rate | 0.01 | Batch Size | 16 |
| Early Stopping | Enabled | Automatic Mixed Precision (AMP) | Enabled |
| Weight Decay | 0.0005 | Momentum Factor | 0.937 |
Table 3.
Deep ablation analysis of detection performance with various network structure adjustments.
Table 3.
Deep ablation analysis of detection performance with various network structure adjustments.
| Scheme (S-D) | P2 | P3 | P4 | P5 | Conv5 | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| S345-D32× (Baseline) | | ✓ | ✓ | ✓ | ✓ | 38.6 | 29.4 | 27.2 | 15.1 | 2.6 | 6.5 |
| S34-D16× (Minimal) | | ✓ | ✓ | | | 37.9 | 28.9 | 25.6 | 14.3 | 0.7 | 4.6 |
| S34-D16× (Standard) | | ✓ | ✓ | | ✓ | 38.6 | 31.1 | 27.4 | 15.2 | 1.8 | 5.7 |
| S234-D16× (Lite) | ✓ | ✓ | ✓ | | | 39.8 | 30.7 | 29.1 | 16.4 | 0.9 | 10.0 |
| S234-D16× (Full) | ✓ | ✓ | ✓ | | ✓ | 40.4 | 32.2 | 30.0 | 16.8 | 2.1 | 11.1 |
| S23-D16× (Lite) | ✓ | ✓ | | | | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| S23-D16× (Full) | ✓ | ✓ | | | ✓ | 40.7 | 31.8 | 29.6 | 16.5 | 1.8 | 10.2 |
Table 4.
Comparison of different downsampling modules based on YOLOv11n-lite.
Table 4.
Comparison of different downsampling modules based on YOLOv11n-lite.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| YOLOv11n-lite | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| YOLOv11n-lite-LDSDown | 39.8 | 30.7 | 28.5 | 15.8 | 0.5 | 8.2 |
| YOLOv11n-lite-HWD | 39.1 | 30.3 | 28.2 | 15.6 | 0.5 | 8.4 |
| YOLOv11n-lite-V7DS | 40.3 | 31.2 | 28.7 | 16.1 | 0.6 | 8.8 |
| YOLOv11n-lite-SPDDown | 41.4 | 31.4 | 29.7 | 16.6 | 1.3 | 12.6 |
| YOLOv11n-lite-GCDown | 40.1 | 31.4 | 29.0 | 16.4 | 0.7 | 9.1 |
Table 5.
Comparison of different SPPF modules based on YOLOv11n-lite.
Table 5.
Comparison of different SPPF modules based on YOLOv11n-lite.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| YOLOv11n-lite-SPPF | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| YOLOv11n-lite-SRDC | 41.5 | 31.6 | 29.9 | 16.9 | 0.7 | 9.1 |
| YOLOv11n-lite-RDC | 40.8 | 31.8 | 30.1 | 17.0 | 0.8 | 9.4 |
| YOLOv11n-lite-AIFI | 39.9 | 30.7 | 29.2 | 16.4 | 0.8 | 9.3 |
| YOLOv11n-lite-FMSPPF | 40.3 | 30.9 | 29.1 | 16.4 | 0.7 | 9.1 |
| YOLOv11n-lite-LSKA-SPPF | 40.9 | 31.7 | 29.7 | 16.7 | 0.7 | 9.3 |
Table 6.
Sensitivity analysis of the grouping factor g in DGHead.
Table 6.
Sensitivity analysis of the grouping factor g in DGHead.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| YOLOv11n-lite | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| YOLOv11n-lite-DGHead () | 39.3 | 30.2 | 28.0 | 15.4 | 0.5 | 4.7 |
| YOLOv11n-lite-DGHead () | 39.5 | 29.9 | 28.1 | 15.7 | 0.5 | 4.8 |
| YOLOv11n-lite-DGHead () | 39.8 | 30.4 | 28.4 | 15.8 | 0.5 | 5.0 |
| YOLOv11n-lite-DGHead () | 40.2 | 30.8 | 28.8 | 16.1 | 0.5 | 5.5 |
| YOLOv11n-lite-DGHead () | 40.3 | 31.3 | 28.8 | 16.2 | 0.6 | 6.4 |
Table 7.
Comparison of detection performance with different detection heads based on YOLOv11n-lite.
Table 7.
Comparison of detection performance with different detection heads based on YOLOv11n-lite.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| YOLOv11n-lite | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| YOLOv11n-lite-DGHead | 40.2 | 30.8 | 28.8 | 16.1 | 0.5 | 5.5 |
| YOLOv11n-lite-LADH | 38.5 | 30.5 | 27.9 | 15.5 | 0.5 | 5.9 |
| YOLOv11n-lite-LQE | 40.1 | 31.7 | 29.3 | 16.4 | 0.7 | 9.1 |
| YOLOv11n-lite-LSCD | 39.6 | 30.9 | 28.8 | 16.1 | 0.6 | 6.8 |
| YOLOv11n-lite-SEAM | 38.6 | 31.0 | 28.4 | 15.7 | 0.6 | 7.0 |
Table 8.
Ablation study of the proposed LDSNet modules on the VisDrone2019 dataset.
Table 8.
Ablation study of the proposed LDSNet modules on the VisDrone2019 dataset.
| yolov11n | Lite | LDSDown | DGHead | SRDC | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) |
|---|
| ✓ | | | | | 38.6 | 29.4 | 27.2 | 15.1 | 2.6 | 6.5 |
| ✓ | ✓ | | | | 39.6 | 31.5 | 29.0 | 16.1 | 0.7 | 9.1 |
| ✓ | ✓ | ✓ | | | 39.8 | 30.7 | 28.5 | 15.8 | 0.5 | 8.2 |
| ✓ | ✓ | ✓ | ✓ | | 39.7 | 30.1 | 28.2 | 15.7 | 0.4 | 4.6 |
| ✓ | ✓ | ✓ | ✓ | ✓ | 41.1 | 30.9 | 29.4 | 16.4 | 0.4 | 4.6 |
Table 9.
Performance comparison of various detectors on the HIT-UAV dataset.
Table 9.
Performance comparison of various detectors on the HIT-UAV dataset.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) | Inference Time (ms) |
|---|
| RT-DETR [78] | 91.0 | 89.4 | 93.0 | 58.7 | 41.9 | 125.6 | 12.1 |
| YOLOv5n [68] | 91.2 | 89.2 | 93.1 | 60.4 | 2.5 | 7.1 | 1.9 |
| YOLOv8n [69] | 92.1 | 88.6 | 93.3 | 60.9 | 3.0 | 8.1 | 1.9 |
| YOLOv10n [70] | 90.2 | 87.6 | 93.1 | 60.1 | 2.3 | 6.5 | 2.0 |
| YOLOv11n [55] | 91.1 | 89.3 | 93.3 | 61.0 | 2.6 | 6.5 | 1.6 |
| YOLOv12n [79] | 89.5 | 85.0 | 92.6 | 59.3 | 2.6 | 6.3 | 1.8 |
| YOLOv13n [80] | 90.7 | 88.1 | 93.4 | 59.8 | 2.5 | 6.1 | 1.9 |
| YOLOv26n [71] | 90.4 | 87.0 | 93.0 | 60.0 | 2.4 | 5.2 | 1.7 |
| ITD-YOLOv8 [72] | — | — | 93.5 | — | 1.8 | 6.0 | — |
| G-YOLO [73] | — | — | 91.4 | — | 0.8 | 3.7 | — |
| LRI-YOLO [75] | 90.7 | 89.1 | 94.1 | — | 1.6 | 3.8 | — |
| ELNet [5] | 91.5 | 90.1 | 94.7 | 60.5 | 0.3 | 3.1 | — |
| Ours | 91.7 | 90.3 | 94.5 | 62.0 | 0.4 | 4.6 | 1.6 |
Table 10.
Performance comparison of various detectors on the VisDrone2019 dataset.
Table 10.
Performance comparison of various detectors on the VisDrone2019 dataset.
| Methods | P (%) | R (%) | (%) | (%) | Param (M) | FLOPs (G) | Inference Time (ms) |
|---|
| RT-DETR [78] | 47.2 | 34.9 | 31.6 | 17.7 | 41.9 | 125.7 | 21.3 |
| YOLOv5n [68] | 38.4 | 30.3 | 26.4 | 14.6 | 2.5 | 7.1 | 1.8 |
| YOLOv8n [69] | 38.2 | 29.6 | 26.8 | 14.9 | 3.0 | 8.1 | 1.9 |
| YOLOv10n [70] | 38.7 | 29.9 | 26.9 | 14.8 | 2.7 | 6.5 | 2.1 |
| YOLOv11n [55] | 38.6 | 29.4 | 27.2 | 15.1 | 2.6 | 6.5 | 1.7 |
| YOLOv12n [79] | 39.9 | 28.5 | 27.1 | 15.0 | 2.5 | 6.5 | 1.9 |
| YOLOv13n [80] | 38.8 | 28.3 | 26.9 | 14.9 | 2.5 | 6.3 | 1.9 |
| YOLOv26n [71] | 37.7 | 29.8 | 26.6 | 14.7 | 2.4 | 5.4 | 1.8 |
| YOLOv5n+TDAM [74] | 38.2 | 29.04 | 27.4 | 14.2 | 1.8 | 4.4 | — |
| DLNet [76] | — | — | 26.9 | 14.3 | 1.0 | 1.6 | — |
| Drone-YOLO [77] | — | — | 31 | 17.5 | 3.1 | — | — |
| ELNet [5] | 38.6 | 31.2 | 28.4 | 15.5 | 0.3 | 3.1 | — |
| Ours | 41.1 | 30.9 | 29.4 | 16.4 | 0.4 | 4.6 | 1.8 |
Table 11.
Scale-wise detection results on VisDrone2019.
Table 11.
Scale-wise detection results on VisDrone2019.
| Methods | AP-Small (%) | AP-Medium (%) | AP-Large (%) |
|---|
| YOLOv11n [55] | 5.8 | 22.4 | 33.4 |
| YOLOv12n [79] | 5.7 | 22.1 | 33.6 |
| Ours | 7.4 | 24.4 | 32.9 |