Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey
Abstract
1. Introduction
- Small object definition in aerial datasets and fine-grained datasets.
- Specify the real-time processing approaches and analyses covering performance and the platforms used, general detectors, typical applications for each model, how to modify them to mitigate aerial detection challenges, and keep efficient, with more focus on real-time studies with limited resources.
- Systematically review real-time aerial processing approaches, including platform-level constraints, performance analyses, and the typical adaptations required to modify general-purpose detectors for aerial challenges while maintaining efficiency on resource-limited hardware.
- Analyze lightweight design strategies across the detection pipeline, covering backbone, neck, and attention mechanisms, and new research for the developed RT-DETR design for edge deployment.
- Performance evaluation of recent edge research.
- Explain additional optimization techniques, such as pruning and quantization, with details about the new method, including quantization-aware training.
- Present emerging compression and hardware-aware optimization methods, including their integration with large vision–language models and multimodal distillation, to enable efficient deployment on UAV and edge devices.
- We identify the key limitations in current real-time aerial object detection research and discuss open challenges, offering insights and future research directions to advance onboard, real-time UAV perception.
2. Datasets and Recent Real-Time Research Applications
| Dataset | Description | Images | Instances/Objects | Classes | Size/Resolution | Annotation Type/Notes | References |
|---|---|---|---|---|---|---|---|
| VisDrone [76,77] | Drone-captured images and videos | 263 video clips with 179,264 frames and additional 10,209 static images | 540k | 10 | 765 × 1360 to 1050 × 1400 | HBB; high proportion of small, occluded and truncated objects | [11,12,33,34,35,36,39,41,44,45,46,47,51,52,57,61,62,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98] |
| DOTA [70,99] | Aerial/satellite images | 2806 images (v1) | 188,282 instances (v1) | 15 | From 800 × 800 to about 4k × 4k | OBB; rotated and multi-scale objects | [14,38,40,82,84,100] |
| UAVDT [101] | UAV-based vehicle detection, | 80k images (Video) | 841.5k vehicles | 4 | 1080 × 540 | HBB; low-altitude, high density, small object | [79,87,88,95,102] |
| CARPK [103] | Parking lot vehicle counting | 1448 images | 89,777 vehicles | 1 | 1280 × 720 | HBB; small-to-medium vehicles | [6,91,104,105] |
| DroneVehicle [106] | RGB-Infrared vehicle detection | 28,439 RGB-Infrared pairs | 953,087 both modalities | 5 Vehicles | 840 × 712 | OBB; multimodal RGB-IR | [40,46,85] |
| NWPU VHR-10 [107] | High-resolution remote sensing | 715 from google earth images 85 images from Vaihingen data set | 2.934k instances | 10 | from 533 × 597 px up to 1728 × 1028 | HBB; planes, ships, vehicles | [46,105] |
| DIOR [108] | Optical remote sensing images | 23,463 images | 192,472 instances | 20 | 800 × 800, 0.5–3 0 m | HBB; large-scale, cross-sensor variety | [14,33,84] |
| UA-DETRAC | Vehicle detection | 10 h of video (140,000 frames) | 1.21 million vehicles | 4 | 960 × 540 | HBB; traffic videos | [45] |
| UCAS-AOD | Aerial vehicle detection | 1510 images | 2500 instances | 2 | 1000 × 1000 | HBB | [84] |
| SODA-A [109] | Small object detection in aerial images | 24,000 images | 338,000 small instances | 10 | 800 × 800 | HBB; tiny objects | [38] |
| SEVE [110] | Small object detection | 17,992 pairs of images and labels | - | 10 | 1920 × 1080 | HBB; Special vehicles in construction sites | [110] |
| FSD | small target scenes of fire and smoke | 7534 images | N/A | N/A | N/A | HBB; 3 Fire hazard scenario, 3 non-hazard scenario | [42] |
| PVD | Photovolatic point defects (PDs) and line defects (LDs) | 1581 images | 2721 | 1 | N/A | HBB | [79] |
| SAR-SD | Ship Detection dataset | 1160 images | 2456 | 1 | resolutions (1 to 15 m). | HBB | [111] |
| AU-Air | Low altitude traffic survellance | 32,823 images | 132,034 | 8 | 1920 × 1080 | HBB | [6] |
| Traffic-Net [112] | Traffic sign detection | 4400 images | 15,000 signs | 1 | 512 × 512 | HBB | [52] |
| Rail-FOD23 | Railway infrastructure detection | 2000 images | 10,000 instances | 1 | 512 × 512 | HBB | [43] |
| Dead Trees | Forest dead tree detection | 3000 images | 15,000 instances | 1 | 512 × 512 | HBB | [7] |
| Mar-20 | Remote-sensing image dataset for fine-grained military aircraft recognition | 3842 images | 22,341 instances | 20 | 800 × 800 | HBB; Fine grained | [50] |
| VEDAI [113] | Vehicle detection in parking lots | 1210 images | 3640 vehicles | 9 | (12.5 cm × 12.5 cm per pixel, 1024 × 1024 | HBB, multimodal, visible and Near infrared | [114] |
| UAV Image | UAV-based detection | 10,000–50,000 images | 50,000+ instances | Various | 1024 × 1024 | HBB | [115] |
| PVEL-AD | Photovoltaic panel defect detection | 5000 images | 20,000 defects | 1 | 512 × 512 | HBB; defect-focused | [116] |
| RSOD | Remote sensing small object detection | 886 images | 5000 instances | 4 | 600 × 600 | HBB; small objects | [105] |
| SeaDronesSee [74] | maritime search and rescue | 54,000 | 400,000 | 6 | 1280 × 960 to 5456 × 3632 | HBB; 91% small objects, finegrained | [117] |
| AFOs | maritime search and rescue | 3647 images | 39,991 | 1 | 1280 × 720 to 3840 × 2160 | HBB; small objects in open water | [117] |
- Aerial Object Detection Applications
2.1. Surveillance and General Object Detection
2.2. Environmental Monitoring
2.3. Remote Sensing and High-Resolution Geospatial Object Detection
2.4. Industrial and Public Safety
2.5. Search and Rescue (SAR) Operations
3. Real-Time Processing Platforms
3.1. Cloud Computing
3.2. Edge AI Object Detection
3.3. Embedded-Onboard Object Detection
Field-Programmable Gate Arrays (FPGAs) and Onboard Aerial Object Detection
4. Aerial Real-Time Deep Learning Object Detection Algorithms
4.1. Two-Stage and Single Stage Detectors
4.1.1. Anchor Based Methods
4.1.2. Anchor Free Methods
4.2. Lightweight Networks
4.3. Neck Network
4.4. Attention Modules
4.5. Real-Time Aerial Object Detectors
5. Optimization Methods
5.1. Lightweight Design for Real-Time Aerial Detection
5.1.1. Lightweight Backbone Networks
- A.
- Convolution-Based Lightweight Designs
- B.
- Attention-Enhanced Lightweight Modules
- C.
- Reparameterization and Structural Re-Design
- D.
- Lightweight Pooling, Downsampling and Feature Fusion Modules
5.1.2. Efficient Neck Networks
- A.
- Enhancements to Classical FPN Structures for UAV Detection
- B.
- Attention- and Module-Based Feature Fusion Enhancements
- C.
- Lightweight Neck Variants for Embedded and UAV Platforms
- D.
- Detection-Oriented Modifications for Small and Tiny Objects
- E.
- Specialized Lightweight Integration Modules
5.1.3. Head Optimization Strategies for Small Object Detection
5.1.4. Loss Function
5.1.5. Lightweight Transformers for Real-Time Aerial Detection
5.2. Pruning
5.2.1. Unstructured Pruning
5.2.2. Structured Pruning
5.3. Quantization
5.3.1. Post Training Quantization
5.3.2. Quantization-Aware Training (QAT)
5.4. Knowledge Distillation
5.5. Neural Architecture Search (NAS) and Real Time
5.5.1. Differentiable (Hardware-Aware) NAS
5.5.2. Direct Hardware-Aware Search
5.5.3. One-Shot/Weight-Sharing and Specialized Deployment Once-for-All (OFA)
5.5.4. Search Efficiency and Evaluation Methods
5.5.5. Other Real-Time Research
| Ref. | Base Model | Backbone | Neck | Head | Loss | Opt. | Platform |
|---|---|---|---|---|---|---|---|
| [46] | yolov7 | The DSDM-LFIM backbone enhances small object detection by combining efficient deep-shallow feature extraction (DSD) with lightweight dual-branch feature fusion (LFI) | Original multi-scale feature fusion | Adds a high-resolution P2 branch to improve small-object detection. Uses K-means to optimize anchors per detection head for the VisDrone dataset | Same as the Base model | Scaling the number of channels by 0.2 | 1.4 M parameters, mAP50 33.4%, 36.6 FPS (VisDrone) inference on edge Devices (Atlas 200I DK A2) |
| [5] | YOLOv3 | IRFM expands the receptive field using multi-scale dilated convolutions and fuses outputs via learnable weights and shortcut connections for better small object detection | Adaptively Spatial Feature Fusion (ASFF) to enhance multi-scale representation | Anchor optimization integrated into training loop using dynamic anchor generation to maximize IoU | N/A | Deconvolution replaces nearest-neighbor upsampling. Coordinate decoding handled outside the model. Quantized to 8-bit using the uds710 tool. NMS and decoding are implemented in C++ for execution on the NPU. | mAP50 89.7%, 35.7 FPS inference on T710 NPU (Neural Processing Unit) (UAV car custom datast) |
| [115] | YOLOv5 | LFM: Uniform multi-branch design to reduce computation. ECTB: CMHSA for global feature capture and occlusion handling | Eliminates redundant nodes (e.g., P1tm, P4tm) to simplify the structure. Adds shortcut connections for better feature propagation. Introduces learnable weighted fusion to adaptively emphasize informative features. | The attention prediction head (APH) is designed based on the NAM attention mechanism to improve the ability of the model to extract attention regions in complex scenarios | As the base model | N/A | 21.7 M parameters, 33.4 inference FBS on NVIDIA Jeston TX2 (UAV air custom dataset) |
| [50] | YOLOv5n | – Combines ShuffleNet v2 with YOLOv5n.– Introduces a Coordinate Attention (CA) module at the end of the backbone to enhance spatial and orientation – Includes a custom CBRM module—composed of Conv, BatchNorm, ReLU, and MaxPool layers—for efficient feature extraction. | Same as Base model | Same as Base model | Replaces CIoU with EIoU to improve bounding box regression and accelerate convergence, especially for small-scale aircraft targets. | N/A | 0.9 M parameters, mAP50 84.8%, 22.6 FPS (post + preprocessing + inference) NVIDIA Jetson Xavier NX (MAR 20 dataset) |
| [6] | YOLOv7 | – Replaces certain ELAN modules in YOLOv7 with the lightweight G-FasterNet, combining FasterNet and GhostNet to reduce parameters and memory usage. – GhostConv is used in place of standard Conv to preserve feature extraction efficiency with lower computational cost. | – Replaces certain ELAN modules in YOLOv7 with the lightweight G-FasterNet, combining FasterNet and GhostNet to reduce parameters and memory usage.– GhostConv is used in place of standard Conv to preserve feature extraction efficiency with lower computational cost. | SimAM attention in the head, faster net use partial convolution (PConv) scenarios | Same as Base model | N/A | 13.38 M parameters mAP50 95.04% on Jetson Nano, Jetson Xavier NX NC S2 (WAID dataset) |
| [102] | YOLOv4 | Same as Base model | – E-FPN constructs a 4-level pyramid (F2–F5) for enhanced multiscale feature exchange. – An Enhance Block at the input improves semantic representation by splitting features into low- and high-resolution branches with depthwise conv and CBAM – A Refine Attention module at the output mitigates aliasing effects from repeated fusion, improving detection of small objects across scales. | – PixED Head uses a spatial-channel encoder–decoder with pixel-encode (PE) and pixel-decode (PD) to boost tiny object detection efficiency. – A Feature Extraction Module (FEM) with depthwise conv, pointwise conv, and CBAM refines features. – An auxiliary head (Aux Head) aids sample assignment during training only, adding no inference cost. | – Improved SimOTA label assigner with CIoU and Focal Loss addresses class and aspect-ratio imbalance. PLA loss aligns features between heads. | N/A | 0.7 M parameters, mAP50 22.7%, 103 FPS on NVIDIA Jetson Xavier NX, and 24.3 FPS on Jetson Nano GPU |
| [49] | YOLOv4 | MobileNetV3 is used to replace the original feature extraction network CSPDarkNet53 network | – SPP+PAN+YOLO Head structure of YOLOv4 is still used in neck and head. – A portion of the original 3 × 3 standard convolutions in PANet is replaced with depthwise separable convolutions. This substitution reduces both computational cost and the number of parameters to approximately one-fourth of those in YOLOv4. | MobileNetV3 and self-attention are integrated to enhance feature extraction, while Softer-NMS replaces DIoU-NMS to address the mismatch between classification confidence and localization accuracy. Rather than discarding overlapping boxes, Softer-NMS reduces their confidence scores and predicts localization confidence, resulting in more accurate detections | Softer-NMS replaces DIoU-NMS | N/A | 23.8 FPS Nvidia Jetson TX2, 9.6 FPS on Raspberry Pi 4B |
| [48] | YOLOv8 | Use Rep-ShuffleNet, based on ShuffleNetv2 to improve the original backbone of YOLOv8s, add the lightweight channel attention mechanism ECANet, | Same as Base model | Same as Base model | Binary Cross-Entropy (BCE) Loss is used for classification, while box regression combines Distribution Focal Loss (DFL) and CIoU Loss. – To enhance accuracy, CIoU is improved to BIoU Loss, which directly compares the actual aspect ratios of predicted and ground truth boxes, rather than relying on a relative aspect ratio similarity. This shift from approximate to precise comparison improves the model’s localization accuracy. | Building on edge intelligence and federated learning concepts, the FI framework and the multilayer collaborative federated learning (MLC-FL) algorithm for efficient federated learning are introduced. – By using asynchronous communication and low-frequency data exchange, MLC-FL enables local models to optimize automatically and efficiently. This upgrades the traditional coal mine IoVT system into an intelligent, self-learning system. | 7.8M parameters, mAP50 94.6%, 21.6 FPS on NVIDIA Jeston AGX Xavier (CMUOD, survillance) |
| [92] | YOLOv8 | Deformable Separable Convolution Block (DSCBlock), separating feature channels, a channel weighting module is proposed. This module calculates weights for the separated feature map, facilitating information exchange across channels and resolutions. Moreover, it compensates for the effect of point-wise (1 × 1) convolutions. A 3D channel weighting module is introduced to efficiently extract features by applying weighting operations along the channel dimension, avoiding the high cost of 1 × 1 convolutions, compensating for the accuracy loss with the efficient feature modeling capability of DCNv2 | The PA-FPN-CSPD framework introduces adaptive sampling and a novel channel weighting module to enhance feature interaction. Instead of using costly pointwise (1 × 1) convolutions, the channel weighting module operates along the third (channel) dimension, enabling efficient filtering of key features while reducing the impact of deformations. To strengthen information exchange across layers, it calculates adaptive weights for separated feature maps. Additionally, the newly designed lightweight network structure, named Cross-Stage Partially Deformable Network (CSPDBlock), built around the DSCBlock, further establishes multidimensional feature correlations, improving the representation and robustness of each layer. | Same as Base model | Same as Base model | N/A | 8.6 Mparameter, mAP50 34.2%, 24.7 FPS, Jetson Xavier NX (VisDrone dataset) |
| [51] | YOLOv5n | optimization of the entire network using DSConv. Additionally, integrating SPD Conv, which is sensitive to small targets, using Cross-Space Learning Multi-Head Self-Attention mechanism, enhancing the C3 module by using (Distribution Shifting Convolution). DSConv achieves lower memory consumption and higher computational speed. | Sparsely Connected Asymptotic Feature Pyramid Network (SCAFPN), introduces a sparse, asymptotic fusion strategy. It starts by merging adjacent low-level features and gradually incorporates higher-level features using an intermediate “Mild” module, which performs upsampling, weighted fusion, and 1 × 1 convolutions. This design limits fusion to neighboring layers, reduces parameter redundancy, preserves semantic integrity | label assignment strategy using SimOTA | Use the SimOTA label assignment strategy | employed the Tensor Runtime engine (NVIDIA TensorRT) to perform FP16 quantization | 5.18 M parameter, mAP5034.8%, 35 FPS on NVIDIA Jetson Xavier Nx edge (VisDrone dataset) |
| [11] | YOLOv7 | Same as Base model | Same as Base model | using a decoupled regression detection head | Combining Generalized IoU loss for precise localization, and balanced cross-entropy losses for objectness and classification to handle class imbalance. It also introduces a Hybrid Random Loss strategy during training to improve the detection of small objects | This paper combines both lossy reduction and lossless reduction (re-parametrization). To enhance small object detection, a scaling and stitching approach is proposed in data augmentation and redesigns the loss function to focus more on small objects, FP16-precision with TensorRT | 40.5 M parameter, mAP50 44.8%, FPS inference on Nvidia Jetson AGX Xavier (VisDrone dataset) |
| [121] | Mask R-CNN-ResNet18, YOLOv8, SSD-MobileNet | Same as base model | Same as base model | Same as base model | Pre-processing with Sobel–Feldman filter, Enhances contrast along object boundaries, emphasizes edges for better feature extraction, reduces background noise, and improves the visibility of small objects in complex scenes | N/A | 83.3 FPS inference for YOLOv8, 47.6 FPS for mask R-CNN, 62.5 FPS inference for SSD-MobileNet on NVIDIA Jetson Nano, 2.56 FPS inference for YOLOv8, 17.86 FPS for mask R-CNN, 2.87 FPS inference for SSD-MobileNet on Rspbirry PI 4B |
| [34] | YOLOv4 and YOLOv7 | Same as Base model | Same as Base model | Same as Base model | Same as Base model | – A cloud–edge hybrid architecture in which AI tasks are handled locally at the edge, while the cloud is used for data storage, processing, and visualization. – TensorRT accelerator | 38–40 FPS on Jetson Xavier AGX edge (2688 × 1512 resolution) and 8–10 FPS for (3840 × 2160) resolution custom dataset |
| [100] | YOLOv7-Tiny | Same as Base model | Same as Base model | truncated NMS (Non-Maximum Suppression) is used | Manhattan Intersection over Union (MIOU) loss | Satellite images are first divided into smaller tiles, and cloud-covered regions are filtered using the PID-Net method. The remaining clear tiles are then processed using a YOLOv7-Tiny model enhanced with MIOU loss to detect remote sensing objects. Finally, the results are mapped back to their original positions in the full image. Use TensorRT | mAP50 76.9%, TensorRT-FP16 160 FPS on NVIDIA Jetson AGX Orin. Cloud (latency 8.3 ms object detection 6.3 ms, Post Processing 31.6 ms, Total 21.6 FPS (DOTA dataset) |
| [119] | SSD | mobilenet replaces VGG-16 or ResNet | Same as Base model | Same as Base model | Same as Base model | N/A | mAP50 92.7%, 26 FPS on NVIDIA Jetson Nano, 18 FPS on Raspberry Pi 3 B Fire detection |
| [93] | YOLOv5, YOLOv6 | EfficientRep, (RepBlock, RepConv) | Rep-PAN (RepBlock, RepConv) | Same as Base model | Same as Base model | N/A | N/A |
| [125] | YOLOv8 | MaxPooling+Ghost Convolution | PAFPN, CoordBlock includes coordinate attention and CoordConv to enhance features and reduce the loss of spatial information. Additionally, Partial Convolution (PConv) directly before the detection head. | Extra head for small objects | N/A | nano variant achives 1.2M parameters, mAP50 39.7%, 56 FPS with FP16 b = 1, 147FPS FP16 B = 16 inference on Jetson AGX Xavier. (VisDrone) | |
| [57] | YOLOv3 | Same as base model | Same as base model | The anchor boxes are resized proportionally to match different input resolutions, ensuring optimal performance for each model configuration. For example, an anchor box of (2, 5) at a 416 × 416 resolution would be adjusted to (4, 10) for an 832 × 832 resolution. | Same as base model | TensorRT inference engine with 16-bit quantization. | less than 0.1 ms latency on NVIDIA Jetson Xavier NX. (VisDrone) |
| [56] | YOLOv3 | Same as base model | Same as base model | Same as base model | Same as base model | – Joint Quantization – Reduces the precision of weights and activations to lower bit-widths, thereby minimizing memory and computational requirements.– Tiling – Splits high-resolution images into smaller tiles to enable processing on limited hardware without sacrificing detection accuracy. Quantization | quantization speed-up baseline by 1.35 on Jetson TX2. NVIDIA Jetson TX2 (352 × 352 input) |
| [120] | YOLOv4 | Shallower CPSDarkNet53, parameters are shared between the object detection and semantic segmentation tasks | Same as base model | adding a segmentation head to an object detector backbone | Same as base model | N/A | Nvidia Xavier NX |
| [35] | YOLOv8 | – Integrating the Ghost module and dynamic convolution into the CSP Bottleneck with two convolutions (C2f). – Spatial Pyramid Pooling with Enhanced Local Attention Network (SPPELAN) replaces Spatial Pyramid Pooling Fast (SPPF) to expand the receptive field | Multi-Scale Ghost Convolution (MSGConv) and Multi-Scale Generalized Feature Pyramid Network (MSGPFN). – Triple Attention39 is applied at the end of each information transfer branch to enhance the extraction of small-target information before sending the features to the network head | DyHead enhances detection precision for small targets. By incorporating three self-attention mechanisms into the detection head, DyHead redefines the four-dimensional tensor L×H×W×C as a three-dimensional tensor L×S×C. This approach applies scale-aware, space-aware, and task-aware attention in the L, S, and C dimensions, respectively. | Same as base model | N/A | 2.6 M parameters, mAP50 45.2%, 24.6 FPS on Nvidia Jetson Orin Nano |
| [118] | YOLOv5 | The base model structure. use input Pixel-level and spatial-level augmentations, along with object mosaic and background fusion. | Same as base model | Same as base model | Same as base model | Use TensorRT | 24-33 FPS inference on NVIDIA Jetson AGX Xavier (Outfall) |
| [214] | RT-DETR | Partition Split Spatial Attention (PSA) replaces global self-attention → local ROI attention with high/low-frequency decomposition. | Bidirectional Dynamic Feature Fusion Pyramid Network (BDFPN) adds multi-scale bidirectional fusion with learnable dynamic weights, supplies denser supervision for small targets | Same as base model | remains RT-DETR-like but equipped with Inner-MPDIoU instead of L1 + GIoU (with a small-object penalty). The MPDIoU loss combines three components: an IoU-based overlap term, a minimum point–distance metric, and a normalization factor. | same as base model | 10.29 M parameters, mAP50 53%, 68 FPS inference on NVIDIA Jetson AGX Xavier (VisDrone) |
5.6. Leveraging CLIP Embeddings for Aerial Real-Time Visual Recognition
CLIP Optimization Techniques and Related Studies
- Knowledge Distillation and efficient backbone replacementOne approach is to train a smaller “student” model using knowledge from a larger pre-trained CLIP “teacher.” Techniques like those explored in CLIP-Knowledge Distillation (KD) CLIP-KD [232] show that guiding the student to replicate the teacher’s feature embeddings (feature-level mimicry) is particularly effective. This allows the student model to maintain strong cross-modal alignment for zero-shot tasks while requiring far fewer resources.CLIP’s original design uses large Vision Transformers or ResNets, which are computationally heavy. By distilling knowledge into a student model, it is possible to adopt a lighter backbone, such as MobileViT or compact convolutional networks [232]. This flexibility is critical for aerial deployment, as smaller backbones drastically reduce computational cost while preserving the model’s ability to align images and text.
- Quantization and PruningFurther efficiency can be achieved by reducing the precision of model weights (quantization) and removing unnecessary parameters (pruning). For instance, TernaryCLIP [233] is a computationally efficient framework that reduces the precision of both the vision and text encoder weights in CLIP by representing them in ternary format, rather than using standard full-precision or floating-point values. It compresses both vision and text encoders to ternary weights while applying distillation-aware training to maintain performance. Combining pruning with quantization forms a multi-stage compression pipeline, yielding a compact, fast, and energy-efficient model capable of real-time operation.
5.7. Integrating NAS Methods with LLM Models for Edge Devices
6. Limitation, Open Challenges and Future Directions in Onboard Real-Time Aerial Object Detection
6.1. Limitations of the Current Research
6.2. Open Challenges
- Small-Object Detection and Fine-Grained Discrimination Small objects in aerial imagery often lose essential visual details due to high-altitude imaging, leading to blurred edges, weak textures, and low-resolution feature maps. As a result, detectors struggle to capture fine-grained cues and maintain high recall.
- Multiscale Variation and Object Orientation Aerial scenes exhibit extreme scale variation and arbitrary object orientations, in which tiny targets lose texture during downsampling and large objects dominate feature learning. At the same time, rotated objects frequently misalign with horizontal bounding boxes, reducing localization accuracy. These factors jointly complicate feature extraction, multiscale fusion, and stable regression, making real-time onboard detection particularly challenging for lightweight models.
- Loss Function Limitations for Tiny Objects Conventional detection losses (e.g., IoU-based and focal variants) tend to favor large objects, produce weak or unstable gradients for tiny targets, and often fail to assign reliable positives to small boxes. As a result, small objects are frequently overlooked or poorly localized.
- Transformer complexity and limited edge adaptation Transformers provide strong global context but are computationally expensive, particularly for multi-scale features and high-resolution inputs. Only a limited number of works have redesigned attention mechanisms or used hybrid CNN–Transformer backbones that are efficient enough for real-time UAV deployment.
- Lack of unified compression and adaptive modeling strategies Quantization, pruning, NAS, and CLIP-based distillation are often applied individually and post-hoc. However, unified compression-first pipelines are rarely investigated, such as quantization during training rather than post-training.
- High-resolution and dense-scene processing
- Multimodal datasets gap and onboard multimodal detection are still an open challenge.
- Hardware and energy constraints on UAV processors. UAV onboard processors, such as Jetson Nano/Orin, NPUs, and embedded GPUs, face strict power limits (5–15 W) and limited memory bandwidth and cache sizes. Achieving real-time detection at ≥30 FPS under these constraints requires hardware-aware design, which remains insufficiently explored in many publications.
- Limited aerial datasets for fine-grained, multimodal detection and general datasets like COCO for aerial detection.
6.3. Future Directions
- Lightweight convolutional backbones. Backbones will continue moving toward operators such as depthwise convolution, Ghost modules, PConv, channel-split and shuffle, SPDConv and DSPConv.
- Efficient multiscale and attention mechanisms. Developing more efficient attention modules (ECA, Coordinate Attention, NAM, SimAM) and lightweight multiscale fusion blocks will play a growing role in addressing dense and tiny objects. These modules preserve spatial–channel interactions with minimal computational overhead.
- Lightweight and transformer detectors. Recent advances in Transformer-based object detection show a clear shift toward architectures that blend CNNs with ViTs to balance fine-grained edge information and global context, incorporate modules that retain crucial high-frequency details through techniques such as wavelet transforms or frequency-domain processing, streamline the decoding stage with lighter attention operations, and introduce gradual multi-scale feature fusion strategies to minimize semantic gaps while keeping computation manageable.The rise of lightweight and edge-oriented Vision Transformers (AUHF-DETR) is helping narrow the accuracy efficiency gap between CNN-based and attention-based models, and ongoing advances in structured pruning and quantization-aware training are expected to reduce this gap even further.
- Improved training-time optimization. Methods such as SimOTA assignment, NMS-free training, balanced sample selection, and convergence-aware scheduling will become standard, as they enhance recall and robustness without adding to inference cost.
- Structured pruning is an effective approach for removing redundant layers, filters, or channels without significantly degrading performance. In addition, compact transformer modules can be integrated into lightweight backbones to further strengthen global feature modeling.
- Anchor-free detectors such as YOLOv8 and YOLOv11 offered for aerial scenarios due to their strong small-object sensitivity, but practical onboard deployment typically relies on their variants, which often need further optimization through pruning and quantization to meet real-time constraints.
- Knowledge distillation for large models Distillation from large vision or multimodal models is used to transfer semantic richness into smaller models. This improves robustness and fine-grained discrimination while keeping inference lightweight for design. The CLIP-KD algorithm introduces a more efficient version compared to the original CLIP model.
- Hardware-aware NAS integrated with compression. Neural Architecture Search will evolve toward pipelines that simultaneously account for quantization, pruning, operator fusion, and memory access patterns. This produces architectures that are inherently optimized for real-time edge execution rather than compressed post hoc.
- Advanced hardware-aware deployment. Quantization-aware training, structured and channel-level pruning, and TensorRT/NPU acceleration. These methods jointly reduce latency, model size, and energy consumption for onboard UAV processors.
- Cloud–edge tiling for high-resolution processing. High-resolution aerial and remote-sensing imagery will increasingly use tiling pipelines that split large frames into parallel patches. This enables cloud-side acceleration while the edge device maintains fast decision-making for navigation and safety-critical tasks.
- Federated edge learning for adaptive UAV perception.UAV systems will benefit from federated learning, updating models locally and sharing only compressed gradients. This reduces bandwidth usage, improves privacy, and allows adaptation to new environments without transmitting sensitive aerial data.
- Semantic enhancement via vision–language models. Integrating CLIP-based visual semantic embeddings and compressed multimodal backbones will improve cross-scene generalization and fine-grained recognition, even with limited aerial labels or domain shifts. It will need to integrate other compression techniques.
- Unified compression-first model design. Combine NAS, pruning, quantization, and reparameterization from the start, producing architectures that naturally satisfy latency, memory, and power constraints rather than requiring heavy post-processing.
- One-shot and weight-sharing architecture exploration. Future UAV detectors will increasingly rely on one-shot NAS and weight-sharing search spaces to rapidly evaluate large numbers of lightweight architectures. This avoids full training for each candidate and enables hardware-specific optimization for Jetson, NPU, or FPGA platforms.
- Integrating CLIP embedding with NAS and optimization (quantization or pruning) helps in reducing the size, with the potential of increasing the performance.
- Super-Resolution Branches Although not yet widely adopted, using a lightweight auxiliary super-resolution (SR) branch during training is emerging as a promising strategy for aerial small-object detection. The idea is to enhance high-resolution texture and edge information temporarily, lost during downsampling, without increasing inference cost, because the SR branch is removed during deployment (as in SuperYOLO). This approach improves feature quality for tiny objects while keeping the model efficient for onboard UAV hardware.
7. Conclusions
Author Contributions
Funding
Conflicts of Interest
References
- Nex, F.; Armenakis, C.; Cramer, M.; Cucci, D.A.; Gerke, M.; Honkavaara, E.; Kukko, A.; Persello, C.; Skaloud, J. UAV in the advent of the twenties: Where we stand and what is next. ISPRS J. Photogramm. Remote Sens. 2022, 184, 215–242. [Google Scholar] [CrossRef]
- Aposporis, P. Object detection methods for improving UAV autonomy and remote sensing applications. In Proceedings of the 2020 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), The Hague, The Netherlands, 7–10 December 2020; pp. 845–853. [Google Scholar] [CrossRef]
- Revathi, A.; Madhuvanthi, T.; Addakula, Y.; Maurya, A.; Ravi, V. Obstacle Detection in Path Planning for Unmanned Aerial Vehicles based on YOLO. In Proceedings of the 2024 International Conference on Communication, Computing and Internet of Things (IC3IoT), Chennai, India, 17–18 April 2024; pp. 1–6. [Google Scholar] [CrossRef]
- Chang, B.R.; Tsai, H.F.; Lyu, J.L. Drone-aided path planning for unmanned ground vehicle rapid traversing obstacle area. Electronics 2022, 11, 1228. [Google Scholar] [CrossRef]
- Liu, M.; Tang, L.; Li, Z. Real-Time Object Detection in UAV Vision Based on Neural Processing Units. In Proceedings of the 2022 IEEE 6th Information Technology and Mechatronics Engineering Conference (ITOEC), Chongqing, China, 4–6 March 2022; Volume 6, pp. 1951–1955. [Google Scholar] [CrossRef]
- Mou, C.; Zhu, C.; Liu, T.; Cui, X. A Novel Efficient Wildlife Detecting Method with Lightweight Deployment on UAVs Based on YOLOv7. IET Image Process. 2024, 18, 1296–1314. [Google Scholar] [CrossRef]
- Wang, X.; Zhao, Q.; Jiang, P.; Zheng, Y.; Yuan, L.; Yuan, P. LDS-YOLO: A Lightweight Small Object Detection Method for Dead Trees from Shelter Forest. Comput. Electron. Agric. 2022, 198, 107035. [Google Scholar] [CrossRef]
- Luo, J.; Chang, K.; Huang, J.; Sun, X.; Ji, Y. A UAV aerial image small object detection algorithm based on fine-grained feature preservation and multi-scale feature pyramid balancing. Complex Intell. Syst. 2026, 12, 12. [Google Scholar] [CrossRef]
- Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A Survey of Object Detection for UAVs Based on Deep Learning. Remote Sens. 2023, 16, 149. [Google Scholar] [CrossRef]
- Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
- Liu, S.; Zha, J.; Sun, J.; Li, Z.; Wang, G. EdgeYOLO: An edge-real-time object detector. In Proceedings of the 2023 42nd Chinese Control Conference (CCC), Tianjin, China, 24–26 July 2023; pp. 7507–7512. [Google Scholar] [CrossRef]
- Qu, J.; Tang, Z.; Zhang, L.; Zhang, Y.; Zhang, Z. Remote sensing small object detection network based on attention mechanism and multi-scale feature fusion. Remote Sens. 2023, 15, 2728. [Google Scholar] [CrossRef]
- Liu, S.; Zhu, M.; Tao, R.; Ren, H. Fine-grained Feature Perception for Unmanned Aerial Vehicle Target Detection Algorithm. Drones 2024, 8, 181. [Google Scholar] [CrossRef]
- Xu, Z.; Wang, X.; Huang, K.; Chen, R. Low resolution remote sensing object detection with fine grained enhancement and swin transformer. Sci. Rep. 2025, 15, 24183. [Google Scholar] [CrossRef]
- Liu, Z.; Gao, G.; Sun, L.; Fang, Z. HRDNet: High-resolution detection network for small objects. arXiv 2020, arXiv:2006.07607. [Google Scholar] [CrossRef]
- Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super resolution assisted object detection in multimodal remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5605415. [Google Scholar] [CrossRef]
- Bai, C.; Zhang, K.; Jin, H.; Qian, P.; Zhai, R.; Lu, K. SFFEF-YOLO: Small object detection network based on fine-grained feature extraction and fusion for unmanned aerial images. Image Vis. Comput. 2025, 156, 105469. [Google Scholar] [CrossRef]
- Ding, J.; Xue, N.; Xia, G.S.; Bai, X.; Yang, W.; Yang, M.Y.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; et al. Object Detection in Aerial Images: A Large-Scale Benchmark and Challenges. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7778–7796. [Google Scholar] [CrossRef] [PubMed]
- Muzammul, M.; Li, X. Comprehensive Review of Deep Learning-Based Tiny Object Detection: Challenges, Strategies, and Future Directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
- Yu, X.; Lin, M.; Lu, J.; Ou, L. Oriented object detection in aerial images based on area ratio of parallelogram. J. Appl. Remote Sens. 2022, 16, 034510. [Google Scholar] [CrossRef]
- Li, Z.; Hou, B.; Wu, Z.; Ren, B.; Yang, C. FCOSR: A Simple Anchor-Free Rotated Detector for Aerial Object Detection. Remote Sens. 2023, 15, 5499. [Google Scholar] [CrossRef]
- Wang, Z.; Bao, C.; Cao, J.; Hao, Q. AOGC: Anchor-free oriented object detection based on Gaussian centerness. Remote Sens. 2023, 15, 4690. [Google Scholar] [CrossRef]
- Dong, Y.; Xie, X.; An, Z.; Qu, Z.; Miao, L.; Zhou, Z. NMS-free oriented object detection based on channel expansion and dynamic label assignment in UAV aerial images. Remote Sens. 2023, 15, 5079. [Google Scholar] [CrossRef]
- Liang, S.; Wang, W.; Chen, R.; Liu, A.; Wu, B.; Chang, E.C.; Cao, X.; Tao, D. Object Detectors in the Open Environment: Challenges, Solutions, and Outlook. arXiv 2024, arXiv:2403.16271. [Google Scholar] [CrossRef]
- Wang, A.; Sun, Y.; Kortylewski, A.; Yuille, A.L. Robust Object Detection under Occlusion with Context-Aware CompositionalNets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12645–12654. [Google Scholar] [CrossRef]
- Xu, X.; Chen, Z.; Zhang, X.; Wang, G. Context-aware content interaction: Grasp subtle clues for fine-grained aircraft detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641319. [Google Scholar] [CrossRef]
- Fonseca, J.; Douzas, G.; Bacao, F. Improving imbalanced land cover classification with k-means smote: Detecting and oversampling distinctive minority spectral signatures. Information 2021, 12, 266. [Google Scholar] [CrossRef]
- Cui, Y.; Jia, M.; Lin, T.Y.; Song, Y.; Belongie, S. Class-balanced loss based on effective number of samples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9268–9277. [Google Scholar] [CrossRef]
- An, Y.; Sun, Y.; Li, Y.; Yang, Y.; Yu, J.; Zhu, Z. Sample imbalance remote sensing small target detection based on discriminative feature learning and imbalanced feature semantic enrichment. Expert Syst. Appl. 2025, 273, 126753. [Google Scholar] [CrossRef]
- Zhao, B.; Wu, Y.; Guan, X.; Gao, L.; Zhang, B. An improved aggregated-mosaic method for the sparse object detection of remote sensing imagery. Remote Sens. 2021, 13, 2602. [Google Scholar] [CrossRef]
- Hong, S.; Kang, S.; Cho, D. Patch-level augmentation for object detection in aerial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 127–134. [Google Scholar] [CrossRef]
- Alsharabi, N. Real-time object detection overview: Advancements, challenges, and applications. J. Amran Univ. 2023, 3, 12. [Google Scholar] [CrossRef]
- Hu, M.; Li, Z.; Yu, J.; Wan, X.; Tan, H.; Lin, Z. Efficient-Lightweight YOLO: Improving Small Object Detection in YOLO for Aerial Images. Sensors 2023, 23, 6423. [Google Scholar] [CrossRef]
- Koubaa, A.; Ammar, A.; Abdelkader, M.; Alhabashi, Y.; Ghouti, L. AERO: AI-Enabled Remote Sensing Observation with Onboard Edge Computing in UAVs. Remote Sens. 2023, 15, 1873. [Google Scholar] [CrossRef]
- Xiao, L.; Li, W.; Yao, S.; Liu, H.; Ren, D. High-Precision and Lightweight Small-Target Detection Algorithm for Low-Cost Edge Intelligence. Sci. Rep. 2024, 14, 23542. [Google Scholar] [CrossRef]
- Liu, C.; Yang, D.; Tang, L.; Zhou, X.; Deng, Y. A Lightweight Object Detector Based on Spatial-Coordinate Self-Attention for UAV Aerial Images. Remote Sens. 2022, 15, 83. [Google Scholar] [CrossRef]
- Liu, S.; Cao, L.; Li, Y. Lightweight Pedestrian Detection Network for UAV Remote Sensing Images Based on Strideless Pooling. Remote Sens. 2024, 16, 2331. [Google Scholar] [CrossRef]
- Xie, S.; Zhou, M.; Wang, C.; Huang, S. CSPPartial-YOLO: A Lightweight YOLO-Based Method for Typical Objects Detection in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 388–399. [Google Scholar] [CrossRef]
- Cao, J.; Bao, W.; Shang, H.; Yuan, M.; Cheng, Q. GCL-YOLO: A GhostConv-Based Lightweight YOLO Network for UAV Small Object Detection. Remote Sens. 2023, 15, 4932. [Google Scholar] [CrossRef]
- Yu, C.; Jiang, X.; Wu, F.; Fu, Y.; Pei, J.; Zhang, Y.; Li, X.; Fu, T. A Multi-Scale Feature Fusion Based Lightweight Vehicle Target Detection Network on Aerial Optical Images. Remote Sens. 2024, 16, 3637. [Google Scholar] [CrossRef]
- Chen, H.; Wang, D.; Fang, J.; Li, Y.; Xu, S.; Xu, Z. LW-YOLOv8: An Lightweight Object Detection Algorithm for UAV Aerial Imagery. In Proceedings of the 2024 6th International Conference on Natural Language Processing (ICNLP), Xi’an, China, 22–24 March 2024; pp. 446–450. [Google Scholar]
- He, Y.; Sahma, A.; He, X.; Wu, R.; Zhang, R. FireNet: A Lightweight and Efficient Multi-Scenario Fire Object Detector. Remote Sens. 2024, 16, 4112. [Google Scholar] [CrossRef]
- Xiang, Y.; Du, C.; Mei, Y.; Zhang, L.; Du, Y.; Liu, A. BN-YOLO: A Lightweight Method for Bird’s Nest Detection on Transmission Lines. J. Real Time Image Process. 2024, 21, 194. [Google Scholar] [CrossRef]
- Yue, M.; Zhang, L.; Huang, J.; Zhang, H. Lightweight and Efficient Tiny-Object Detection Based on Improved YOLOv8n for UAV Aerial Images. Drones 2024, 8, 276. [Google Scholar] [CrossRef]
- Dong, C.; Jiang, X.; Hu, Y.; Du, Y.; Pan, L. EL-Net: An Efficient and Lightweight Optimized Network for Object Detection in Remote Sensing Images. Expert Syst. Appl. 2024, 255, 124661. [Google Scholar] [CrossRef]
- Xiao, Y.; Di, N. SOD-YOLO: A Lightweight Small Object Detection Framework. Sci. Rep. 2024, 14, 25624. [Google Scholar] [CrossRef]
- Chen, N.; Li, Y.; Yang, Z.; Lu, Z.; Wang, S.; Wang, J. LODNU: Lightweight Object Detection Network in UAV Vision. J. Supercomput. 2023, 79, 10117–10138. [Google Scholar] [CrossRef]
- Wu, J.; Zheng, R.; Jiang, J.; Tian, Z.; Chen, W.; Wang, Z.; Yu, F.R.; Leung, V.C. A Lightweight Small Object Detection Method Based on Multilayer Coordination Federated Intelligence for Coal Mine IOVT. IEEE Internet Things J. 2024, 11, 20072–20087. [Google Scholar] [CrossRef]
- Liu, J.; Hu, C.; Zhou, J.; Ding, W. Object Detection Algorithm Based on Lightweight YOLOv4 for UAV. In Proceedings of the 2022 7th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 15–17 April 2022; pp. 425–429. [Google Scholar] [CrossRef]
- Wang, J.; Bai, Z.; Zhang, X.; Qiu, Y. A Lightweight Remote Sensing Aircraft Target Detection Network Based on Improved YOLOv5n. In Proceedings of the 2023 9th International Conference on Computer and Communications (ICCC), Chengdu, China, 8–11 December 2023; pp. 1678–1683. [Google Scholar] [CrossRef]
- Xue, C.; Xia, Y.; Wu, M.; Chen, Z.; Cheng, F.; Yun, L. EL-YOLO: An Efficient and Lightweight Low-Altitude Aerial Objects Detector for Onboard Applications. Expert Syst. Appl. 2024, 256, 124848. [Google Scholar] [CrossRef]
- Liu, Z.; Chen, C.; Huang, Z.; Chang, Y.C.; Liu, L.; Pei, Q. A Low-Cost and Lightweight Real-Time Object-Detection Method Based on UAV Remote Sensing in Transportation Systems. Remote Sens. 2024, 16, 3712. [Google Scholar] [CrossRef]
- Vandersteegen, M.; Van Beeck, K.; Goedemé, T. Super Accurate Low Latency Object Detection on a Surveillance UAV. In Proceedings of the 2019 16th International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 27–31 May 2019; pp. 1–6. [Google Scholar] [CrossRef]
- Golcarenarenji, G.; Martinez-Alpiste, I.; Wang, Q.; Alcaraz-Calero, J.M. Efficient Real-Time Human Detection Using Unmanned Aerial Vehicles Optical Imagery. Int. J. Remote Sens. 2021, 42, 2440–2462. [Google Scholar] [CrossRef]
- Nascimento, M.G.d.; Fawcett, R.; Prisacariu, V.A. DSConv: Efficient Convolution Operator. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5148–5157. [Google Scholar] [CrossRef]
- Plastiras, G.; Siddiqui, S.; Kyrkou, C.; Theocharides, T. Efficient Embedded Deep Neural-Network-Based Object Detection via Joint Quantization and Tiling. In Proceedings of the 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 31 August–2 September 2020; pp. 6–10. [Google Scholar] [CrossRef]
- Suo, J.; Zhang, X.; Shi, W.; Zhou, W. E 3-UAV: An Edge-Based Energy-Efficient Object Detection System for Unmanned Aerial Vehicles. IEEE Internet Things J. 2023, 11, 4398–4413. [Google Scholar] [CrossRef]
- Fang, G.; Ma, X.; Song, M.; Mi, M.B.; Wang, X. DepGraph: Towards Any Structural Pruning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 16091–16101. [Google Scholar] [CrossRef]
- Guan, T.; Zhang, C. Mixed Pruning Method for Vehicle Detection. In Proceedings of the 2020 IEEE 11th International Conference on Software Engineering and Service Science (ICSESS), Beijing, China, 16–18 October 2020; pp. 232–235. [Google Scholar] [CrossRef]
- Ko, H.; Kang, J.K.; Kim, Y. An Efficient and Fast Filter Pruning Method for Object Detection in Embedded Systems. In Proceedings of the 2024 IEEE 6th International Conference on AI Circuits and Systems (AICAS), Abu Dhabi, UAE, 22–25 April 2024; pp. 204–207. [Google Scholar] [CrossRef]
- Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A Novel Lightweight Object Detection Network for Unmanned Aerial Vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
- Ning, T.; Wu, W.; Zhang, J. Small Object Detection Based on YOLOv8 in UAV Perspective. Pattern Anal. Appl. 2024, 27, 103. [Google Scholar] [CrossRef]
- Jobaer, S.; Tang, X.s.; Zhang, Y.; Li, G.; Ahmed, F. A Novel Knowledge Distillation Framework for Enhancing Small Object Detection in Blurry Environments with Unmanned Aerial Vehicle-Assisted Images. Complex Intell. Syst. 2025, 11, 63. [Google Scholar] [CrossRef]
- Li, Z.; Xu, P.; Chang, X.; Yang, L.; Zhang, Y.; Yao, L.; Chen, X. When Object Detection Meets Knowledge Distillation: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 10555–10579. [Google Scholar] [CrossRef]
- Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
- Pagire, V.; Chavali, M.; Kale, A. A Comprehensive Review of Object Detection with Traditional and Deep Learning Methods. Signal Process. 2025, 237, 110075. [Google Scholar] [CrossRef]
- Hua, W.; Chen, Q. A Survey of Small Object Detection Based on Deep Learning in Aerial Images. Artif. Intell. Rev. 2025, 58, 162. [Google Scholar] [CrossRef]
- Wen, L.; Cheng, Y.; Fang, Y.; Li, X. A Comprehensive Survey of Oriented Object Detection in Remote Sensing Images. Expert Syst. Appl. 2023, 224, 119960. [Google Scholar] [CrossRef]
- Hozhabr, S.H.; Giorgi, R. A Survey on Real-Time Object Detection on FPGAs. IEEE Access 2025, 13, 38195–38238. [Google Scholar] [CrossRef]
- Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
- Zhao, D.; Shao, F.; Liu, Q.; Yang, L.; Zhang, H.; Zhang, Z. A small object detection method for drone-captured images based on improved YOLOv7. Remote Sens. 2024, 16, 1002. [Google Scholar] [CrossRef]
- Kisantal, M.; Wojna, Z.; Murawski, J.; Naruniec, J.; Cho, K. Augmentation for small object detection. arXiv 2019, arXiv:1902.07296. [Google Scholar] [CrossRef]
- Wenqi, Y.; Gong, C.; Meijun, W.; Yanqing, Y.; Xingxing, X.; Xiwen, Y.; Junwei, H. MAR20: A benchmark for military aircraft recognition in remote sensing images. Natl. Remote Sens. Bull. 2024, 27, 2688–2696. [Google Scholar] [CrossRef]
- Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 2260–2270. [Google Scholar] [CrossRef]
- Partheepan, S.; Sanati, F.; Hassan, J. Autonomous unmanned aerial vehicles in bushfire management: Challenges and opportunities. Drones 2023, 7, 47. [Google Scholar] [CrossRef]
- Du, D.; Zhu, P.; Wen, L.; Bian, X.; Lin, H.; Hu, Q.; Peng, T.; Zheng, J.; Wang, X.; Zhang, Y.; et al. VisDrone-DET2019: The vision meets drone object detection in image challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 213–226. [Google Scholar] [CrossRef]
- Cao, Y.; He, Z.; Wang, L.; Wang, W.; Yuan, Y.; Zhang, D.; Zhang, J.; Zhu, P.; Van Gool, L.; Han, J.; et al. VisDrone-DET2021: The Vision Meets Drone Object Detection Challenge Results. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2847–2854. [Google Scholar] [CrossRef]
- Chen, H.; Yang, W.; Zhou, G.; Zhang, G.; Nian, Z. MFRENet: Efficient Detection of Drone Image Based on Multiscale Feature Aggregation and Receptive Field Expanded. Pattern Anal. Appl. 2024, 27, 120. [Google Scholar] [CrossRef]
- Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale feature fusion small object detection network for UAV aerial images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
- Muzammul, M.; Algarni, A.; Ghadi, Y.Y.; Assam, M. Enhancing UAV Aerial Image Analysis: Integrating Advanced SAHI Techniques with Real-Time Detection Models on the VisDrone Dataset. IEEE Access 2024, 12, 21621–21633. [Google Scholar] [CrossRef]
- Xue, H.; Wang, X.; Xia, Y.; Tang, Z.; Li, L.; Wang, L. Enhanced YOLOv8 for Small Object Detection in UAV Aerial Photography: YOLO-UAV. In Proceedings of the 2024 International Joint Conference on Neural Networks (IJCNN), Yokohama, Japan, 30 June–5 July 2024; pp. 1–8. [Google Scholar] [CrossRef]
- Ni, J.; Zhu, S.; Tang, G.; Ke, C.; Wang, T. A Small-Object Detection Model Based on Improved YOLOv8s for UAV Image Scenarios. Remote Sens. 2024, 16, 2465. [Google Scholar] [CrossRef]
- Niu, K.; Yan, Y. A Small-Object-Detection Model Based on Improved YOLOv8 for UAV Aerial Images. In Proceedings of the 2023 2nd International Conference on Artificial Intelligence and Intelligent Information Processing (AIIIP), Hangzhou, China, 27–29 October 2023; pp. 57–60. [Google Scholar] [CrossRef]
- Wu, S.; Lu, X.; Guo, C.; Guo, H. Accurate UAV Small Object Detection Based on HRFPN and EfficentVMamba. Sensors 2024, 24, 4966. [Google Scholar] [CrossRef] [PubMed]
- Ren, L.; Lei, H.; Li, Z.; Yang, W. AF-DETR: Efficient UAV small object detector via Assemble-and-Fusion mechanism. Pattern Anal. Appl. 2024, 27, 135. [Google Scholar] [CrossRef]
- Dong, D.; Li, J.; Liu, H.; Deng, L.; Gu, J.; Liu, L.; Li, S. EA-YOLO: An Efficient and Accurate UAV Image Object Detection Algorithm. IEEJ Trans. Electr. Electron. Eng. 2025, 20, 61–68. [Google Scholar] [CrossRef]
- Sang, M.; Tian, S.; Yu, L.; Wang, G.; Peng, Y. Environmentally adaptive fast object detection in UAV images. Image Vis. Comput. 2024, 148, 105103. [Google Scholar] [CrossRef]
- Xu, H.; Zheng, W.; Liu, F.; Li, P.; Wang, R. Unmanned aerial vehicle perspective small target recognition algorithm based on improved YOLOv5. Remote Sens. 2023, 15, 3583. [Google Scholar] [CrossRef]
- Qi, S.; Song, X.; Shang, T.; Hu, X.; Han, K. MSFE-YOLO: An improved YOLOv8 network for object detection on drone view. IEEE Geosci. Remote Sens. Lett. 2024, 21, 6013605. [Google Scholar] [CrossRef]
- Li, Y.; Fan, Q.; Huang, H.; Han, Z.; Gu, Q. A modified YOLOv8 detection network for UAV aerial image recognition. Drones 2023, 7, 304. [Google Scholar] [CrossRef]
- Hamzenejadi, M.H.; Mohseni, H. Real-time vehicle detection and classification in UAV imagery using improved YOLOv5. In Proceedings of the 2022 12th International Conference on Computer and Knowledge Engineering (ICCKE), Mashhad, Iran, 17–18 November 2022; pp. 231–236. [Google Scholar] [CrossRef]
- Hua, W.; Chen, Q.; Chen, W. A new lightweight network for efficient UAV object detection. Sci. Rep. 2024, 14, 13288. [Google Scholar] [CrossRef]
- Parkavi, A.; Alex, S.A.; Pushpalatha, M.; Shukla, P.K.; Pandey, A.; Sharma, S. Drone-based intelligent system for social distancing compliance using YOLOv5 and YOLOv6 with euclidean distance metric. SN Comput. Sci. 2024, 5, 972. [Google Scholar] [CrossRef]
- Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial photography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
- Xiao, Y.; Xu, T.; Xin, Y.; Li, J. FBRT-YOLO: Faster and Better for Real-Time Aerial Image Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 8673–8681. [Google Scholar] [CrossRef]
- Ben Rouighi, I.; Chtioui, H.; Jegham, I.; Alouani, I.; Ben Khalifa, A. FRD-YOLO: A faster real-time object detector for aerial imagery. J. Real Time Image Process. 2025, 22, 169. [Google Scholar] [CrossRef]
- Weng, S.; Wang, H.; Wang, J.; Xu, C.; Zhang, E. YOLO-SRMX: A Lightweight Model for Real-Time Object Detection on Unmanned Aerial Vehicles. Remote Sens. 2025, 17, 2313. [Google Scholar] [CrossRef]
- Song, Z.; Zhang, Y.; Abd Al Rahman, M. Ednet: Edge-optimized small target detection in UAV imagery—faster context attention, better feature fusion, and hardware acceleration. In Proceedings of the 2024 IEEE Smart World Congress (SWC), Denarau Island, Fiji, 2–7 December 2024; pp. 829–838. Available online: https://arxiv.org/html/2501.05885v1.
- Berner, C.; Brockman, G.; Chan, B.; Cheung, V.; Dębiak, P.; Dennison, C.; Farhi, D.; Fischer, Q.; Hashme, S.; Hesse, C.; et al. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv 2019, arXiv:1912.06680. [Google Scholar] [CrossRef]
- Shen, Y.; Liu, D.; Chen, J.; Wang, Z.; Wang, Z.; Zhang, Q. On-board multi-class geospatial object detection based on convolutional neural network for High Resolution Remote Sensing Images. Remote Sens. 2023, 15, 3963. [Google Scholar] [CrossRef]
- Du, D.; Qi, Y.; Yu, H.; Yang, Y.; Duan, K.; Li, G.; Zhang, W.; Huang, Q.; Tian, Q. The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
- Min, X.; Zhou, W.; Hu, R.; Wu, Y.; Pang, Y.; Yi, J. Lwuavdet: A lightweight UAV object detection network on edge devices. IEEE Internet Things J. 2024, 11, 24013–24023. [Google Scholar] [CrossRef]
- Hsieh, M.R.; Lin, Y.L.; Hsu, W.H. Drone-Based Object Counting by Spatially Regularized Regional Proposal Network. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4145–4153. [Google Scholar] [CrossRef]
- Wu, Z.; Wang, X.; Jia, M.; Liu, M.; Sun, C.; Wu, C.; Wang, J. Dense object detection methods in RAW UAV imagery based on YOLOv8. Sci. Rep. 2024, 14, 18019. [Google Scholar] [CrossRef]
- Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
- Sun, Y.; Cao, B.; Zhu, P.; Hu, Q. Drone-Based RGB-Infrared Cross-Modality Vehicle Detection via Uncertainty-Aware Learning. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 6700–6713. [Google Scholar] [CrossRef]
- Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
- Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
- Duan, R.; Deng, H.; Tian, M.; Deng, Y.; Lin, J. SODA: A Large-Scale Open Site Object Detection Dataset for Deep Learning in Construction. Autom. Constr. 2022, 142, 104499. [Google Scholar] [CrossRef]
- Qiu, Z.; Bai, H.; Chen, T. Special vehicle detection from UAV perspective via YOLO-GNS based deep learning network. Drones 2023, 7, 117. [Google Scholar] [CrossRef]
- Gautam, V.; Prasad, S.; Sinha, S. Joint-YODNet: A light-weight object detector for UAVs to achieve above 100 fps. In Computer Vision and Image (Proceedings of the 8th International Conference, CVIP 2023), Chandigarh, India, 16–18 December 2023; Communications in Computer and Information Science; Springer: Cham, Switzerland, 2024; Volume 2010, pp. 567–578. [Google Scholar] [CrossRef]
- Cao, F.; Chen, S.; Zhong, J.; Gao, Y. Traffic Condition Classification Model Based on Traffic-Net. Comput. Intell. Neurosci. 2023, 2023, 7812276. [Google Scholar] [CrossRef] [PubMed]
- Razakarivony, S.; Jurie, F. Vehicle detection in aerial imagery: A small target detection benchmark. J. Vis. Commun. Image Represent. 2016, 34, 187–203. [Google Scholar] [CrossRef]
- Zeng, H.; Li, J.; Qu, L. Lightweight low-altitude UAV object detection based on improved YOLOv5s. Int. J. Adv. Netw. Monit. Control. 2024, 9, 87–99. [Google Scholar] [CrossRef]
- Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-time object detection network in UAV-vision based on CNN and transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
- Saeed, F.; Aldera, S.; Al-Shamma’a, A.A.; Farh, H.M.H. Rapid adaptation in photovoltaic defect detection: Integrating CLIP with YOLOv8n for efficient learning. Energy Rep. 2024, 12, 5383–5395. [Google Scholar] [CrossRef]
- Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, L.; Zhu, Y. OWRT-DETR: A novel real-time transformer network for small object detection in open water search and rescue from UAV aerial imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
- Xu, H.; Huang, Q.; Yang, Y.; Li, J.; Chen, X.; Han, W.; Wang, L. UAV-ODS: A real-time outfall detection system based on UAV remote sensing and edge computing. In Proceedings of the 2022 IEEE International Conference on Unmanned Systems (ICUS), Guangzhou, China, 28–30 October 2022; pp. 1–9. [Google Scholar] [CrossRef]
- Nguyen, A.; Nguyen, H.; Tran, V.; Pham, H.X.; Pestana, J. A visual real-time fire detection using single shot multibox detector for UAV-based fire surveillance. In Proceedings of the 2020 IEEE Eighth International Conference on Communications and Electronics (ICCE), Phu Quoc Island, Vietnam, 13–15 January 2021; pp. 338–343. [Google Scholar] [CrossRef]
- Antonakakis, M.; Tzavaras, A.; Tsakos, K.; Spanakis, E.G.; Sakkalis, V.; Zervakis, M.; Petrakis, E.G. Real-time object detection using an ultra-high-resolution camera on embedded systems. In Proceedings of the 2022 IEEE International Conference on Imaging Systems and Techniques (IST), Kaohsiung, Taiwan, 21–23 June 2022; pp. 1–6. [Google Scholar] [CrossRef]
- Vasishta, M.S.; Amireddy, A.T.R.; Shrivastava, P.; Raghuram S; Talasila, V.; Shastry, P.N. Small Object Detection for UAVs Using Deep Learning Models on Edge Computing: A Comparative Analysis. In Proceedings of the 2024 5th International Conference on Circuits, Control, Communication and Computing (I4C), Bangalore, India, 4–5 October 2024; pp. 106–112. [Google Scholar] [CrossRef]
- Lyu, M.; Zhao, Y.; Huang, C.; Huang, H. Unmanned aerial vehicles for search and rescue: A survey. Remote Sens. 2023, 15, 3266. [Google Scholar] [CrossRef]
- Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. Sod-YOLO: Small-object-detection algorithm based on improved YOLOv8 for UAV images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
- Yuan, Y.; Gao, S.; Zhang, Z.; Wang, W.; Xu, Z.; Liu, Z. Edge-cloud collaborative UAV object detection: Edge-embedded lightweight algorithm design and task offloading using fuzzy neural network. IEEE Trans. Cloud Comput. 2024, 12, 306–318. [Google Scholar] [CrossRef]
- Nghiem, V.Q.; Nguyen, H.H.; Hoang, M.S. LEAF-YOLO: Lightweight Edge-Real-Time Small Object Detection on Aerial Imagery. Intell. Syst. Appl. 2025, 25, 200484. [Google Scholar] [CrossRef]
- Guo, S.; Zhao, C.; Wang, G.; Yang, J.; Yang, S. Ec2detect: Real-time online video object detection in edge-cloud collaborative IoT. IEEE Internet Things J. 2022, 9, 20382–20392. [Google Scholar] [CrossRef]
- Ganesh, P.; Chen, Y.; Yang, Y.; Chen, D.; Winslett, M. YOLO-ReT: Towards high accuracy real-time object detection on edge GPUs. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 3267–3277. [Google Scholar] [CrossRef]
- Guo, Y.; Tong, X.; Xu, X.; Liu, S.; Feng, Y.; Xie, H. An anchor-free network with density map and attention mechanism for multiscale object detection in aerial images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 6516705. [Google Scholar] [CrossRef]
- Khan, W.Z.; Ahmed, E.; Hakak, S.; Yaqoob, I.; Ahmed, A. Edge computing: A survey. Future Gener. Comput. Syst. 2019, 97, 219–235. [Google Scholar] [CrossRef]
- Rey, L.; Bernardos, A.M.; Dobrzycki, A.D.; Carramiñana, D.; Bergesio, L.; Besada, J.A.; Casar, J.R. A performance analysis of You Only Look Once models for deployment on constrained computational edge devices in drone applications. Electronics 2025, 14, 638. [Google Scholar] [CrossRef]
- Imran, H.A.; Mujahid, U.; Wazir, S.; Latif, U.; Mehmood, K. Embedded development boards for edge-AI: A comprehensive report. arXiv 2020, arXiv:2009.00803. [Google Scholar] [CrossRef]
- Chen, N.; Chen, Y. Anomalous vehicle recognition in smart urban traffic monitoring as an edge service. Future Internet 2022, 14, 54. [Google Scholar] [CrossRef]
- Ren, X.; Sun, M.; Zhang, X.; Liu, L.; Zhou, H.; Ren, X. An improved mask-RCNN algorithm for UAV TIR video stream target detection. Int. J. Appl. Earth Obs. Geoinf. 2022, 106, 102660. [Google Scholar] [CrossRef]
- Shin, D.J.; Kim, J.J. A deep learning framework performance evaluation to use YOLO in Nvidia Jetson platform. Appl. Sci. 2022, 12, 3734. [Google Scholar] [CrossRef]
- Menshchikov, A.; Shadrin, D.; Prutyanov, V.; Lopatkin, D.; Sosnin, S.; Tsykunov, E.; Iakovlev, E.; Somov, A. Real-time detection of hogweed: UAV platform empowered by deep learning. IEEE Trans. Comput. 2021, 70, 1175–1188. [Google Scholar] [CrossRef]
- Balamuralidhar, N.; Tilon, S.; Nex, F. MultEYE: Monitoring system for real-time vehicle detection, tracking and speed estimation from UAV imagery on edge-computing platforms. Remote Sens. 2021, 13, 573. [Google Scholar] [CrossRef]
- Xiong, Y.; Liu, H.; Gupta, S.; Akin, B.; Bender, G.; Wang, Y.; Kindermans, P.J.; Tan, M.; Singh, V.; Chen, B. MobileDets: Searching for Object Detection Architectures for Mobile Accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Online, 19–25 June 2021; pp. 3825–3834. [Google Scholar]
- Hossain, S.; Lee, D.j. Deep learning-based real-time multiple-object detection and tracking from aerial imagery via a flying robot with GPU-based embedded devices. Sensors 2019, 19, 3371. [Google Scholar] [CrossRef] [PubMed]
- Nguyen, H.H.; Tran, D.N.N.; Jeon, J.W. Towards real-time vehicle detection on edge devices with nvidia jetson tx2. In Proceedings of the 2020 IEEE International Conference on Consumer Electronics-Asia (ICCE-Asia), Seoul, Republic of Korea, 1–3 November 2020; pp. 1–4. [Google Scholar] [CrossRef]
- Tijtgat, N.; Van Ranst, W.; Goedeme, T.; Volckaert, B.; De Turck, F. Embedded real-time object detection for a UAV warning system. In Proceedings of the IEEE International Conference on Computer Vision Workshops, Venice, Italy, 22–29 October 2017; pp. 2110–2118. [Google Scholar] [CrossRef]
- Chen, C.; Min, H.; Peng, Y.; Yang, Y.; Wang, Z. An intelligent real-time object detection system on drones. Appl. Sci. 2022, 12, 10227. [Google Scholar] [CrossRef]
- Sarkar, D.; Gunturi, S.K. Online health status monitoring of high voltage insulators using deep learning model. Vis. Comput. 2022, 38, 4457–4468. [Google Scholar] [CrossRef]
- Sali, S.M.; Meribout, M.; Majeed, A.A. Real Time FPGA Based CNNs for Detection, Classification, and Tracking in Autonomous Systems: State of the Art Designs and Optimizations. arXiv 2025, arXiv:2509.04153. [Google Scholar] [CrossRef]
- Carranza-García, M.; Torres-Mateo, J.; Lara-Benítez, P.; García-Gutiérrez, J. On the performance of one-stage and two-stage object detectors in autonomous vehicles using camera data. Remote Sens. 2020, 13, 89. [Google Scholar] [CrossRef]
- Zhou, X.; Wang, D.; Krähenbühl, P. Objects as points. arXiv 2019, arXiv:1904.07850. [Google Scholar] [CrossRef]
- Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
- Chen, W.; Luo, J.; Zhang, F.; Tian, Z. A review of object detection: Datasets, performance evaluation, architecture, applications and current trends. Multimed. Tools Appl. 2024, 83, 65603–65661. [Google Scholar] [CrossRef]
- Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: A simple and strong anchor-free object detector. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1922–1933. [Google Scholar] [CrossRef] [PubMed]
- Yang, X.; Yan, J.; Feng, Z.; He, T. R3det: Refined single-stage detector with feature refinement for rotating object. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3163–3171. [Google Scholar] [CrossRef]
- Wang, P.; Niu, Y.; Wang, J.; Ma, F.; Zhang, C. Arbitrarily oriented dense object detection based on center point network in remote sensing images. Remote Sens. 2022, 14, 1536. [Google Scholar] [CrossRef]
- Yang, X.; Yang, J.; Yan, J.; Zhang, Y.; Zhang, T.; Guo, Z.; Sun, X.; Fu, K. Scrdet: Towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8232–8241. [Google Scholar] [CrossRef]
- Cheng, G.; Wang, J.; Li, K.; Xie, X.; Lang, C.; Yao, Y.; Han, J. Anchor-free oriented proposal generator for object detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5625411. [Google Scholar] [CrossRef]
- Yang, Z.; Liu, S.; Hu, H.; Wang, L.; Lin, S. Reppoints: Point set representation for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9657–9666. [Google Scholar] [CrossRef]
- Liu, J.; Li, S.; Zhou, C.; Cao, X.; Gao, Y.; Wang, B. SRAF-Net: A scene-relevant anchor-free object detection network in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5405914. [Google Scholar] [CrossRef]
- Law, H.; Deng, J. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
- Sun, Z.; Lin, M.; Sun, X.; Tan, Z.; Li, H.; Jin, R. MAE-DET: Revisiting maximum entropy principle in zero-shot NAS for efficient object detection. arXiv 2021, arXiv:2111.13336. [Google Scholar] [CrossRef]
- Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar] [CrossRef]
- Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar] [CrossRef]
- Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar] [CrossRef]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar] [CrossRef]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. ShuffleNet v2: Practical guidelines for efficient CNN architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Gholami, A.; Kwon, K.; Wu, B.; Tai, Z.; Yue, X.; Jin, P.; Zhao, S.; Keutzer, K. SqueezeNext: Hardware-Aware Neural Network Design. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 1638–1647. [Google Scholar] [CrossRef]
- Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar] [CrossRef]
- Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1580–1589. [Google Scholar] [CrossRef]
- Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar] [CrossRef]
- Tang, Y.; Han, K.; Guo, J.; Xu, C.; Xu, C.; Wang, Y. GhostNetv2: Enhance cheap operation with long-range attention. Adv. Neural Inf. Process. Syst. 2022, 35, 9969–9982. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar] [CrossRef]
- Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar] [CrossRef]
- Ghiasi, G.; Lin, T.Y.; Le, Q.V. NAS-FPN: Learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar] [CrossRef]
- Chen, Q.; Su, X.; Zhang, X.; Wang, J.; Chen, J.; Shen, Y.; Han, C.; Chen, Z.; Xu, W.; Li, F.; et al. Lw-detr: A transformer replacement to yolo for real-time detection. arXiv 2024, arXiv:2406.03459. [Google Scholar] [CrossRef]
- Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
- Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
- Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar] [CrossRef]
- Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. Gcnet: Non-local networks meet squeeze-excitation networks and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar] [CrossRef]
- Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar] [CrossRef]
- Li, X.; Hu, X.; Yang, J. Spatial group-wise enhance: Improving semantic feature learning in convolutional networks. arXiv 2019, arXiv:1905.09646. [Google Scholar] [CrossRef]
- Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar] [CrossRef]
- Liu, Y.; Shao, Z.; Teng, Y.; Hoffmann, N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
- Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar] [CrossRef]
- Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 2235–2239. [Google Scholar] [CrossRef]
- Yang, L.; Zhang, R.Y.; Li, L.; Xie, X. Simam: A simple, parameter-free attention module for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 11863–11874. [Google Scholar]
- Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar] [CrossRef]
- Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
- Jocher, G. Ultralytics YOLOv5, Release v6.1. GitHub Repository, 2022. Available online: https://github.com/ultralytics/yolov5/releases/tag/v6.1 (accessed on 25 October 2025).
- Ultralytics. YOLOv5 vs YOLOv8: Evolution of Real Time Object Detection. 2023. Available online: https://docs.ultralytics.com/compare/yolov5-vs-yolov8/ (accessed on 29 November 2025).
- Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
- Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
- Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
- Jocher, G. Ultralytics YOLOv8. GitHub Repository. 2023. Available online: https://docs.ultralytics.com/models/yolov8/ (accessed on 25 October 2025).
- Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning what you want to learn using programmable gradient information. In Computer Vision – ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar] [CrossRef]
- Ultralytics. YOLOv10 vs. YOLOv9: A Comprehensive Technical Comparison. 2024. Available online: https://docs.ultralytics.com/compare/yolov10-vs-yolov9/ (accessed on 29 November 2025).
- Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-time end-to-end object detection. Adv. Neural Inf. Process. Syst. 2024, 37, 107984–108011. [Google Scholar] [CrossRef]
- Ultralytics. Ultralytics YOLO: GitHub Repository. GitHub. 2025. Available online: https://github.com/ultralytics/ultralytics (accessed on 25 October 2025).
- Tian, Y.; Ye, Q.; Doermann, D. Yolov12: Attention-centric real-time object detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
- Vina, A. YOLO12 Explained: Real-World Applications and Use Cases. 2025. Available online: https://www.ultralytics.com/blog/yolo12-explained-real-world-applications-and-use-cases (accessed on 29 November 2025).
- Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold YOLO: Efficient Object Detector via Gather and Distribute Mechanism. In Proceedings of the Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
- Alif, M.A.R.; Hussain, M. YOLOv1 to YOLOv10: A comprehensive review of YOLO variants and their application in the agricultural domain. arXiv 2024, arXiv:2406.10139. [Google Scholar] [CrossRef]
- He, Z.; Cao, L. SOD-YOLO: Small Object Detection Network for UAV Aerial Images. IEEJ Trans. Electr. Electron. Eng. 2025, 20, 431–439. [Google Scholar] [CrossRef]
- Bakirci, M. Performance evaluation of low-power and lightweight object detectors for real-time monitoring in resource-constrained drone systems. Eng. Appl. Artif. Intell. 2025, 159, 111775. [Google Scholar] [CrossRef]
- Pudasaini, D.; Abhari, A. Scalable object detection, tracking and pattern recognition model using edge computing. In Proceedings of the 2020 Spring Simulation Conference (SpringSim), Fairfax, VA, USA, 18–21 May 2020; pp. 1–11. [Google Scholar] [CrossRef]
- Du, X.; Lin, T.Y.; Jin, P.; Ghiasi, G.; Tan, M.; Cui, Y.; Le, Q.V.; Song, X. Spinenet: Learning scale-permuted backbone for recognition and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11592–11601. [Google Scholar] [CrossRef]
- Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Liu, F.; Chen, R.; Zhang, J.; Xing, K.; Liu, H.; Qin, J. R2YOLOX: A lightweight refined anchor-free rotated detector for object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5632715. [Google Scholar] [CrossRef]
- Lyu, Z.; Yu, T.; Pan, F.; Zhang, Y.; Luo, J.; Zhang, D.; Chen, Y.; Zhang, B.; Li, G. A survey of model compression strategies for object detection. Multimed. Tools Appl. 2024, 83, 48165–48236. [Google Scholar] [CrossRef]
- Han, S.; Jiang, X.; Wu, Z. An improved YOLOv5 algorithm for wood defect detection based on attention. IEEE Access 2023, 11, 71800–71810. [Google Scholar] [CrossRef]
- Thudumu, S.; Nguyen, H.; Du, H.; Duong, N.; Rasool, Z.; Logothetis, R.; Barnett, S.; Vasa, R.; Mouzakis, K. The M-factor: A Novel Metric for Evaluating Neural Architecture Search in Resource-Constrained Environments. arXiv 2025, arXiv:2501.17361. [Google Scholar] [CrossRef]
- Li, J.; Xie, C.; Wu, S.; Ren, Y. UAV-YOLOv5: A Swin-transformer-enabled small object detection model for long-range UAV images. Ann. Data Sci. 2024, 11, 1109–1138. [Google Scholar] [CrossRef]
- Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar] [CrossRef]
- Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar] [CrossRef]
- Mehta, S.; Rastegari, M. Separable self-attention for mobile vision transformers. arXiv 2022, arXiv:2206.02680. [Google Scholar] [CrossRef]
- Zhou, Y.; Wei, Y. UAV-DETR: An enhanced RT-DETR architecture for efficient small object detection in UAV imagery. Sensors 2025, 25, 4582. [Google Scholar] [CrossRef]
- Guo, H.; Wu, Q.; Wang, Y. AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sens. 2025, 17, 1920. [Google Scholar] [CrossRef]
- Ge, X.; Qi, L.; Yan, Q.; Sun, J.; Zhu, Y.; Zhang, Y. Enhancing Real-Time Aerial Image Object Detection with High-Frequency Feature Learning and Context-Aware Fusion. Remote Sens. 2025, 17, 1994. [Google Scholar] [CrossRef]
- Wang, H.; Gao, J. Sf-detr: A scale-frequency detection transformer for drone-view object detection. Sensors 2025, 25, 2190. [Google Scholar] [CrossRef]
- Deng, J.; Shi, Z.; Zhuo, C. Energy-efficient real-time UAV object detection on embedded platforms. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2019, 39, 3123–3127. [Google Scholar] [CrossRef]
- Galliera, R.; Suri, N. Object detection at the edge: Off-the-shelf deep learning capable devices and accelerators. Procedia Comput. Sci. 2022, 205, 239–248. [Google Scholar] [CrossRef]
- Elsken, T.; Metzen, J.H.; Hutter, F. Neural architecture search: A survey. J. Mach. Learn. Res. 2019, 20, 1–21. [Google Scholar]
- Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. Fbnet: Hardware-aware efficient convnet design via differentiable neural architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 10734–10742. [Google Scholar] [CrossRef]
- Cai, H.; Zhu, L.; Han, S. Proxylessnas: Direct neural architecture search on target task and hardware. arXiv 2018, arXiv:1812.00332. [Google Scholar] [CrossRef]
- Cai, H.; Gan, C.; Wang, T.; Zhang, Z.; Han, S. Once-for-all: Train one network and specialize it for efficient deployment. arXiv 2019, arXiv:1908.09791. [Google Scholar] [CrossRef]
- Song, X.; Xie, X.; Lv, Z.; Yen, G.G.; Ding, W.; Lv, J.; Sun, Y. Efficient evaluation methods for neural architecture search: A survey. IEEE Trans. Artif. Intell. 2024, 5, 5990–6011. [Google Scholar] [CrossRef]
- Kang, Y.; Zheng, B.; Shen, W. Research on Oriented Object Detection in Aerial Images Based on Architecture Search with Decoupled Detection Heads. Appl. Sci. 2025, 15, 8370. [Google Scholar] [CrossRef]
- Team, R. YOLO-NAS by deci achieves state-of-the-art performance on object detection using neural architecture search. Deci AI Blog 2023. [Google Scholar]
- Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
- van den Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar] [CrossRef]
- Jia, C.; Yang, Y.; Xia, Y.; Chen, Y.T.; Parekh, Z.; Pham, H.; Le, Q.; Sung, Y.H.; Li, Z.; Duerig, T. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the International Conference on Machine Learning, Online, 18–24 July 2021; pp. 4904–4916. [Google Scholar]
- Gan, Z.; Li, L.; Li, C.; Wang, L.; Liu, Z.; Gao, J. Vision-language pre-training: Basics, recent advances, and future trends. Found. Trends® Comput. Graph. Vis. 2022, 14, 163–352. [Google Scholar] [CrossRef]
- Liu, F.; Chen, D.; Guan, Z.; Zhou, X.; Zhu, J.; Ye, Q.; Fu, L.; Zhou, J. Remoteclip: A vision language foundation model for remote sensing. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5622216. [Google Scholar] [CrossRef]
- Li, Y.; Guo, W.; Yang, X.; Liao, N.; He, D.; Zhou, J.; Yu, W. Toward open vocabulary aerial object detection with clip-activated student-teacher learning. In Computer Vision – ECCV 2024, Proceedings of the European Conference, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland; pp. 431–448. [CrossRef]
- Yang, C.; An, Z.; Huang, L.; Bi, J.; Yu, X.; Yang, H.; Diao, B.; Xu, Y. Clip-kd: An empirical study of clip model distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16-22 June 2024; pp. 15952–15962. [Google Scholar] [CrossRef]
- Zhang, S.H.; Tang, W.C.; Wu, C.; Hu, P.; Li, N.; Zhang, L.J.; Zhang, Q.; Zhang, S.Q. TernaryCLIP: Efficiently Compressing Vision-Language Models with Ternary Weights and Distilled Knowledge. arXiv 2025, arXiv:2510.21879. [Google Scholar] [CrossRef]
- Cheng, T.; Song, L.; Ge, Y.; Liu, W.; Wang, X.; Shan, Y. Yolo-world: Real-time open-vocabulary object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16901–16911. [Google Scholar] [CrossRef]
- Kong, F.; Shan, X.; Hu, Y.; Li, J. Automated UAV Object Detector Design Using Large Language Model-Guided Architecture Search. Drones 2025, 9, 803. [Google Scholar] [CrossRef]






| Reference | Year | Main Contributions |
|---|---|---|
| [65] | 2023 | Reviews the evolution of object detection from early computer vision approaches in the 1990s to modern deep learning driven techniques, including milestone detectors, datasets, metrics, speedup strategies, and recent state-of-the-art methods. |
| [68] | 2023 | Provides details on algorithms designed for object-oriented aerial detection tasks. |
| [39] | 2023 | Reviews real-time processing algorithms, performance analysis, and sensor utilization for UAV-based object detection. |
| [9] | 2024 | Highlights challenges in aerial object detection and progress in deep learning algorithms addressing them, along with commonly used public aerial datasets. |
| [66] | 2025 | Classifies object detection methods into traditional and deep-learning paradigms, analyzing one-stage, two-stage, transformer-based, and lightweight models. Identifies gaps in lightweight research and compares algorithm performance using various metrics. |
| [67] | 2025 | Focuses on small object detection algorithms in aerial imagery, discussing techniques for improving precision and feature extraction. |
| [69] | 2025 | Focuses on real-time FPGA-based approaches and their performance. |
| This Survey | This survey examines the key challenges of aerial object detection and outlines lightweight design solutions across all stages of the detection pipeline for real-time and onboard design. It details quantization methods, hardware-aware NAS and its emerging design trends, and the growing use of CLIP-style visual semantic embeddings for fine-grained recognition and improved generalization to unseen aerial scenes. The discussion also covers optimization strategies for real-time deployment, efficient Transformer architectures, the associated design trade-offs, and practical model recommendations, tracing the evolution of lightweight detectors from 2020 to 2025. | |
| Ref. | Base Model | mAP50 | Params (M) | Inference FPS/Latency ms | Platform |
|---|---|---|---|---|---|
| LODNU [47] | YOLOv4 | 31.4 | 8.7 | 68 FPS | NVIDIA RTX 3060 |
| MFFSODNet [79] | YOLOv5 | 45.5 | 4.5 | 70 FPS | NVIDIA TITAN RTX |
| EA-YOLO [86] | YOLOv5s | 39.9 | 5.9 | N/A | NVIDIA RTX 3060 |
| EL-YOLO [33] | YOLOv5s | N/A | 7.6 | 29.4 FPS | NVIDIA RTX 3080 Ti |
| GCL-YOLO [39] | YOLOv5n | 31.7 | 0.4 | 58 FPS | NVIDIA RTX 3060 |
| YOLOv5s | 39.6 | 1.6 | 53 FPS | ||
| YOLOv5m | 43.2 | 4.3 | 48 FPS | ||
| YOLOv5l | 45.7 | 8.8 | 42 FPS | ||
| AMMFN [12] | YOLOv5s | 48.1 | 7.7 | 84.3 FPS | NVIDIA GTX 4090 Ti |
| EL-Net [45] | YOLOv7-tiny | 38.7 | 2.0 | N/A | NVIDIA RTX 3080 |
| SOD-YOLO [46] | YOLOv7 | 50.7 | 30.3 | 72.5 FPS | NVIDIA RTX 4060 |
| SOD-YOLO [123] | YOLOv8n | 33.0 | 0.6 | 145 FPS | NVIDIA RTX 3090 |
| YOLOv8s | 42.0 | 1.8 | 126 FPS | ||
| [83] | YOLOv8s | 31.0 | 6.0 | 128 FPS | NVIDIA RTX 4090 |
| [82] | YOLOv8s | 47.1 | 10.2 | N/A | NVIDIA RTX 3090 |
| LW-YOLOv8 [41] | YOLOv8m | 42.3 | 13.4 | 72.3 FPS | NVIDIA Titan XP ×4 |
| HRMamba-YOLO [84] | YOLOv8-m | N/A | 33.5 | 31.0 ms | NVIDIA RTX 3090 |
| MSFE-YOLO [89] | YOLOv8-s | 41.4 | N/A | 101.0 FPS | GTX 3080 Ti |
| YOLOv8n | 33.8 | N/A | 149.3 FPS | ||
| MFRENet [78] | YOLOv8m | 53.4 | 27.1 | 50.5 FPS | NVIDIA RTX 2080 |
| Processing Environment | RTT Latency (ms) | Model Processing Latency (ms) | Communication Latency (ms) |
|---|---|---|---|
| Edge | 35.09 | 32.59 | 2.50 |
| Cloud | 348.21 | 6.82 | 341.41 |
| Embedded/Edge Platform | Base Model | Reference |
|---|---|---|
| NVIDIA Jetson Xavier NX | YOLOv5 | [50] |
| YOLOv7 | [115] | |
| YOLOv8 | [92] | |
| YOLOv5 | [51] | |
| YOLOv3 | [57] | |
| NVIDIA Jetson AGX Xavier | YOLOv8 | [48] |
| YOLOv7 | [11] | |
| YOLOv5 | [118] | |
| NVIDIA Jetson TX2 | YOLOv4 | [49] |
| YOLOv5 | [115] | |
| YOLOv3 | [56] | |
| NVIDIA Jetson Nano | YOLOv7 | [6] |
| YOLOv4 | [102] | |
| Mask R-CNN-ResNet18 | [121] | |
| YOLOv8 | [121] | |
| SSD-MobileNet | [121] | |
| SSD | [119] | |
| NVIDIA Jetson AGX Orin | YOLOv7-Tiny | [100] |
| NVIDIA Jetson Orin Nano | YOLOv8 | [35] |
| Huawei Atlas 200I DK A2 | YOLOv5n | [46] |
| T710 NPU | YOLOv3 | [5] |
| Raspberry Pi 3 B | SSD | [119] |
| Network | Conv Type / Module | Key Features |
|---|---|---|
| SqueezeNet (2016) [157] | Fire module (1 × 1 conv replacing 3 × 3) | Reduced parameters and computation; flexible channel dimensions |
| MobileNetV1 (2017) [158] | Depthwise separable convolution | Efficient alternative to standard convolution; reduced computation |
| MobileNetV2 (2018) [159] | Inverted residual + pointwise conv | Expand-reduce feature maps; shortcut connections and linear activations |
| ShuffleNetV1 (2018) [160] | Pointwise group conv + channel shuffle | Bottleneck design; group-wise processing with shuffle to improve information flow |
| ShuffleNetV2 (2018) [161] | Split channels + conv + shuffle | Balanced computation; simplified structure; reduced element-wise ops |
| SqueezeNext (2018) [162] | Fire + depthwise separable conv | Improved parameter efficiency and model compactness |
| MobileNetV3 (2019) [163] | SE blocks + NAS + hard-swish | Enhanced efficiency and accuracy; optimized using neural architecture search |
| GhostNet (2020) [164] | Ghost modules with linear ops | Generate more features via cheap operations; reduce redundancy |
| FasterNet (2023) [165] | Partial convolution (PConv)+ PWConv | Low memory access and FLOPs; high throughput with efficient spatial feature extraction |
| Attention Type | Year | Key Features | Limitations |
|---|---|---|---|
| SE (Squeeze-and-Excitation) | 2018 | Channel-wise attention using global average pooling and FC layers | Ignores spatial information; captures only channel dependencies |
| CBAM (Convolutional Block Attention Module) | 2018 | Sequential channel and spatial attention using convolutional operations | Limited to local context; ineffective for long-range dependencies |
| BAM (Bottleneck Attention Module) | 2018 | Parallel channel and spatial attention in a bottleneck structure | Convolution-based locality limits global context modeling |
| GCNet (Global Context Network) | 2019 | Global spatial attention via pooling and transform functions | Constrained by convolutional design; high computational load |
| ECA-Net (Efficient Channel Attention) | 2020 | Lightweight channel attention using 1D convolution without dimensionality reduction | Lacks spatial attention; only considers channel importance |
| SGE (Spatial Group-wise Enhance) | 2020 | Divides channels into groups for localized spatial attention; low overhead | Ignores inter-group channel dependencies; limited fusion |
| CA (Coordinate Attention) | 2021 | Positional encoding with efficient computation; suitable for mobile networks | Simplified spatial encoding may miss fine-grained relationships |
| NAM (Normalization-based Attention Module) | 2021 | Normalization-enhanced selection of important features | Underutilizes joint spatial-channel interactions |
| Triple Attention | 2021 | Simultaneous spatial and channel attention with low cost | Slightly higher complexity than lightweight modules |
| SA (Shuffle Attention) | 2021 | Integrates channel and spatial attention using grouped Shuffle Units | Needs explicit shuffling; moderate implementation overhead |
| SimAM (Simple Attention Module) | 2021 | Parameter-free 3D neuron-level attention via energy-based formulation | Simple structure; lacks tunable attention depth |
| Model (Year) | Anchors | Backbone | Neck | Head | Key Advancements/Limitations | Typical Use | Platform |
|---|---|---|---|---|---|---|---|
| YOLOv1 (2016) [183] | No anchors (direct regression) | Custom CNN (24 conv + 2 FC layers) | None | Fully connected detection head | Real-time speed with end-to-end training; limited small object detection and struggles with multiple objects per grid cell. | Real-time video streaming | Conventional and Desktop GPUs such as the Titan X GPU |
| YOLOv2 (2017) [184] | Anchor-based (predefined) | Darknet-19 | None (passthrough layer) | Convolutional detection head | Introduced anchor boxes and batch normalization; improved accuracy; limited handling of overlapping objects. | Real-time video streaming | Conventional and Desktop GPUs, Titan X GPU |
| YOLOv3 (2018) [185] | Anchor-based (k-means clusters) | Darknet-53 | FPN-like (multi-scale) | Convolutional detection head | Multi-scale prediction enhances small object detection; deeper backbone boosts accuracy but increases complexity. | Real-time video streaming, Industry. Yolov3 tiny is suitable for real-time onboard | Conventional, and Desktop GPUs. Only Yolov3 tiny is suitable for real-time onboard |
| YOLOv4 (2020) [146] | Anchor-based | CSPDarknet53 | PAN + SPP | YOLOv3-style conv head | CSP backbone with Bag of Freebies and Specials; fast and accurate; complex training pipeline. | Real-time detection in production systems | Conventional and desktop GPUs such as (1080 Ti or 2080 Ti) |
| YOLOv5 (2020) [186,187] | Anchor-based (auto learning) | Modified CSPDarknet53 | CSP-PAN | Decoupled head (cls, reg separate) | Deployment-focused modular design; widely adopted despite no official paper. | Powering security alarm systems or traffic monitoring, where low latency is non-negotiable. Some older NPU (Neural Processing Unit) drivers have highly optimized support specifically for the YOLOv5 architecture. YOLOv5n is suited for applications that demand extremely fast CPU inference with minimal latency. | Conventional GPUS Tesla T4 GPUs and desktop GPUs |
| YOLOX (2021) [188] | Anchor-free | CSPDarknet53 variant | PAN | Decoupled head | Anchor-free design simplifies training; decoupled head improves accuracy and convergence; uses SimOTA for label assignment. | Industrial applications | Conventional and desktop GPUs, YOLOXnano |
| YOLOv6 (2022) [189] | Anchor-free | EfficientRep (RepVG) | Rep-PAN | BiC-enhanced decoupled head | Self-distillation and Task Alignment Learning; Anchor-Aided Training (AAT) enhances lightweight accuracy without slowing inference. | Industrial applications in diverse scenarios, autonomous delivery robots | Conventional GPUS Tesla T4 GPUs and desktop GPUs |
| YOLOv7 (2022) [190] | Anchor-based | E-ELAN (Extended Efficient Layer Aggregation Network) | Extended PANet | Decoupled head with dynamic labels | Re-parameterization improves training; introduces coarse-to-fine label assignment and compound scaling. | General-purpose object detection | Conventional GPUs, V100 GPU and desktop GPus, YOLOv7-tiny is an edge GPU-oriented |
| YOLOv8 (2023) [187,191] | Anchor-free | CSPDarknet variant with CSPBottleneck and C2f module | enhanced FPN+PAN | Decoupled head | Improved modular design and contextual feature extraction; enhanced scalability across model sizes. | Keypoint detection (e.g., sports analytics), (small object detection). High-risk applications where every object must be detected, such as autonomous driving or security monitoring. | Conventional and desktop GPUs |
| YOLOv9 (2024) [192,193] | Anchor-free | Lightweight backbone with GELAN, CSPNet, RepConv | PAN-FPN | Decoupled head | Transformer modules enrich context; handles complex scenes effectively but with higher computation. | high-precision industrial inspection where false negatives are costly, small object detection in fields like satellite imagery or medical imaging, and complex scenes with occlusion or clutter that require preserving maximum feature information | Conventional and Desktop GPUs |
| YOLOv10 (2025) [193,194] | Anchor-free | Enhanced CSPNet; spatial-channel decoupled downsampling; large-kernel conv + partial self-attention | PANet | Dual head: one-to-many (training), one-to-one (inference) | NMS-free training and inference. | High-FPS video analysis, such as traffic or sports monitoring, and supports real-time robotics requiring low-latency navigation and obstacle avoidance | Conventional and desktop GPUs, NVIDIA 3090, YOLOv10n: Suitable for extremely resource-constrained environments |
| YOLOv11 (2025) [195] | Anchor-free | Enhanced backbone with C3k2 modules (a specification of GELAN) | PAN-FPN | Decoupled head | C3k2 improves gradient flow and semantics; 22% fewer parameters than YOLOv8m with higher mAP; versatile task support. | Real-time, high-accuracy tasks such as autonomous driving, edge-based detection, medical image analysis, and satellite imagery, where fast and precise object recognition is essential. | TensorRT10 FP16 on an NVIDIA T4 GPU, |
| YOLOv12 (2025) [196,197] | Anchor-free | Attention-centric backbone (FlashAttention) | PAN-FPN | Decoupled head with area attention | Attention-centric design with CNN-like speed; includes R-ELAN and position-aware attention; FlashAttention limits hardware compatibility. | Ideal for applications where precision is more important than real-time speed, medical imaging, quality control in manufacturing | FlashAttention is compatible only with GPUs from the Turing, Ampere, Ada Lovelace, or Hopper architectures, such as the T4, Quadro RTX series, RTX 20/30/40 series, RTX A5000/A6000, A30/A40, A100, and H100. |
| YOLO-GOLD (2025) [198] | Anchor-free | Enhanced CSPDarknet with grouped attention and NAS search | PAN-FPN + Selective Fusion | Lightweight hybrid head | Introduces Gather-and-Distribute (GD) mechanism for improved multi-scale fusion; combines NAS and attention for real-time UAV and embedded use; low-latency with competitive accuracy. | Target small and medium-size objects | Conventional GPUs, NVIDIA Tesla T4 GPU with TensorRT, and desktop GPUs. |
| Base Models | References |
|---|---|
| YOLOv8 | [17,35,41,42,43,44,48,52,61,62,78,81,82,83,84,87,89,90,92,94,95,96,97,98,104,105,116,121,123,200] |
| YOLOv5 | [7,12,14,33,36,37,39,40,50,51,79,86,88,91,93,111,114,115,118,120] |
| YOLOv7 | [6,11,34,45,46,100,110,201] |
| YOLOv4 | [34,47,49,102,120,124,127] |
| YOLOv3 | [5,56,57,202] |
| YOLOv10 | [201] |
| YOLOv9 | [201] |
| RT-DETR | [80,85] |
| SSD | [121] |
| R-CNN | [121] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Habash, N.; Alqumsan, A.A.; Zhou, T. Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey. Sensors 2025, 25, 7563. https://doi.org/10.3390/s25247563
Habash N, Alqumsan AA, Zhou T. Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey. Sensors. 2025; 25(24):7563. https://doi.org/10.3390/s25247563
Chicago/Turabian StyleHabash, Nadin, Ahmad Abu Alqumsan, and Tao Zhou. 2025. "Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey" Sensors 25, no. 24: 7563. https://doi.org/10.3390/s25247563
APA StyleHabash, N., Alqumsan, A. A., & Zhou, T. (2025). Recent Real-Time Aerial Object Detection Approaches, Performance, Optimization, and Efficient Design Trends for Onboard Performance: A Survey. Sensors, 25(24), 7563. https://doi.org/10.3390/s25247563

