Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance
Abstract
1. Introduction
1.1. Background and Motivation
1.2. What Are Small Objects?
- Small objects: Area < 32 × 32 pixels;
- Medium objects: 32 × 32 pixels ≤ area < 96 × 96 pixels;
- Large objects: Area ≥ 96 × 96 pixels [7].
1.3. What Are the Major Challenges in Small-Object Detection?
- Feature information loss: Deep neural networks almost always incorporate multiple successive pooling layers or strides into convolutional layers that intentionally reduce the spatial resolution while capturing a hierarchy of increasing semantic features. This does not typically pose a problem for larger objects; however, fine-grained spatial information and weak features of small objects are generally lost in deeper layers with a smaller pixel count or resolution, rendering them indistinguishable from the background [10].
- Scale mismatch/imbalance: Small objects are often viewed in tandem with large objects in the same scene, leading to a serious scale variation problem. Feature pyramid networks (FPNs) and variations are attempts to remediate this problem, and fusing feature representations involving vastly different resolutions without diminishing small-target representations is an active area of research. Furthermore, many datasets have class imbalance, where small object(s) are far less frequent compared with large-object class(es), which leads to biased experiences in training [11].
- Low signal-to-noise ratios: Because of the small pixel count, small objects have poorer quality metrics in comparison with larger objects and are thus subject to noise, blurriness, and other image degradations. The appearance of small objects may be easily confused with background textures or sensor noise; these confusions lead to high rates of false positives and negatives [8].
- Ambiguity in context: While context is critical for object detection, it is sometimes counterproductive for small objects. For example, in dense scenes, such as crowds or cluttered aerial views, small objects may be proximate to each other in shared contexts where they may be occluded or are difficult to localize as individual instances [12].
- The annotation problem: The manual annotation of small objects is labor-intensive, expensive, and subject to error. All bounding-box annotations are subject to labeling noise, and small-object labels are especially subject to both an increased error rate and baseline inconsistencies in the generation of labels [13].
1.4. Survey Scope and Contributions
- Systematic and reproducible methodology: We employed PRISMA to conduct a systematic literature search and screening process that was comprehensive, measurable, and reproducible [14] to reduce selection bias and simultaneously provide a substantial, evidence-based foundation for our conclusions.
- Critical taxonomy of contemporary methods: We have generated a critical taxonomy that classifies contemporary SOD approaches into five distinct yet coherent categories: (1) multiscale feature learning; (2) transformer networks; (3) context-aware methods; (4) data augmentation; (5) architectural improvements in standard detectors. This systematic approach affords a critical perspective on the progression of these efforts and helps clarify how contemporary SOD methods solve the original challenges.
- Comprehensive quantitative and qualitative examination: In this literature review, we not only summarize previously reported results but also provide a comprehensive comparison table (a master comparison table) that compiles accuracy/efficiency performance data on various key benchmarks. Moreover, we provide qualitative strengths and weaknesses for different methods, as well as discussions on which datasets or evaluation protocols have most contributed to the field to date.
- Practical recommendations for practitioners: We acknowledge the gap that often exists between the academic understanding of a given topic and how it is then enacted in practice. Therefore, we have included a dedicated section to share practical suggestions, including a decision matrix for modeling selection based on application constraints (accuracy vs. latency), scenario-based playbooks for common use cases for SOD postmodeling, and deployment checklists.
1.5. Paper Overview
2. Methodology
2.1. Search Strategy
- IEEE Xplore;
- ACM Digital Library;
- SpringerLink;
- ScienceDirect (Elsevier);
- Scopus;
- arXiv (for preprints of the most notable top-tier conference articles and journal articles).
2.2. Inclusion/Exclusion Criteria
- Timeframe: We screened for articles published or preprinted between 1 January 2023, and 15 September 2025.
- Main contribution: The main contribution of the article needed to be a novel method, dataset, benchmark, or literature review that addresses the small-object detection problem.
- Methodological requirement: The article needed to be based on a deep learning method or methods.
- Evaluation requirement: The method needed to have been quantitatively evaluated on at least one public benchmark dataset (e.g., MS COCO, DOTA, VisDrone, or SODA-D).
- Language and publication type: The article needed to be in English and published as a full-length conference paper or journal article, or as a technical preprint on arXiv.
- Articles published outside the timeframe;
- Small-object detection was only a small part of the work and not a main goal (e.g., general object tracking or image segmentation);
- Works based on classical computer vision methods (e.g., non-deep learning);
- Articles that do not have any quantitative evaluation or are purely theoretical;
- Short papers, abstracts, posters, tutorials, and articles that are not in English;
- Patents and book chapters.
2.3. Screening and Data Extraction (PRISMA)
- Initial de-duplication: All records returned by the databases were combined, and duplicates were removed using a reference manager.
- Title and abstract screening: The titles and abstracts of the remaining articles were screened based on our inclusion/exclusion criteria, and any articles that were clearly not relevant were excluded in this screening.
- Full-text review: The full texts of the articles that were retained after the title and abstract screening were gathered and read in detail to verify their relevancy and make the final decision on their inclusion.
2.4. Quality and Bias Assessment
2.5. Quantitative Synthesis
3. Methods (2023–2025)—Critical Taxonomy
3.1. Multiscale Feature Learning and Fusion
3.2. Transformer-Based Models and Attention Mechanisms
3.3. Context-Aware Detection Strategies
3.4. Data Augmentation and Generation
3.5. Architectural Improvements for Mainstream Detectors (e.g., YOLO Series)
4. Datasets for SOD
4.1. General-Purpose Datasets with Small Objects
- COCO (Common Objects in Context): COCO is still the most influential benchmark for general object detection. With a bounding-box area of less than 32 × 32 pixels to define “small” objects [9], COCO has an extremely complex aspect with 80 categories of objects, dense scenes, and substantial size variation between object class sizes. The metric established for small objects, the average precision for small objects (AP_S), represents a lower bound and informs readers of the model’s ability to accurately detect small objects. Most landmark advances in SOD cite their results on COCO test-dev or validation sets to indicate some measure of generalizability.
- LVIS (Large Vocabulary Instance Segmentation): LVIS expands the COCO challenge with a larger vocabulary of over 1200 categories designed from the long-tailed distribution. This presents the challenge of small-object detection, even as rare occurrences in the training data. For SOD research, LVIS is important for examining the way that models pinpointing small objects interact with the data and effectiveness presentation of scale in the object size. Recent work exploring zero-shot or few-shot models have just begun to explore the ability to detect pursued terms in small objects that are mostly untracked in the training dataset [39].
- Objects365: Objects365 offers a larger scale than COCO, with 365 categories, over 600,000 images, and more than 10 million bounding boxes in the category. Its scale and diversity make it an extraordinary resource, and it can be pretrained on a general-purpose dataset and then tuned on a more specific SOD dataset. The massive number of instances, which include a large number of small objects, helps with increasing the robustness and generalizability of the features and also mitigates the risk of overfitting to a smaller category to evaluate the dataset for biases [40].
4.2. Specialized SOD Datasets (2023–Current)
- SODA-D: This dataset was specifically developed to detect small objects in driving scenes and uses 2K–4K-resolution images from a vehicle’s view of the image. The small-object detection task for traffic situations focuses on nine of the more common traffic-applicable categories—pedestrian, cyclist, and traffic sign—and the scenario depicts the small-object density in complex, cluttered urban environments, representing a realistic application of autonomous driving [41].
- SODA-A: The SODA-A is the reverse perspective of the SODA-D [39] and looks at small objects in aerial images. The “top-down”, “birds-eye” perspective of this type of aerial image dataset alleviates the difficulty of detecting objects for similar categories because vehicles, pedestrians, and boats appear significantly smaller when using remote sensing and UAV surveillance. The SODA-A dataset provides a useful tool for the research and development of models that could provide similar benchmark results for aerial image models.
- PKUDAVIS-SOD: This dataset is intended for salient object detection (SOD), which is close to but not synonymous with SOD and is sometimes referred to in the context of supersets of detection [42]. Even though it contains primarily salient objects with the goal of detecting visually prominent objects, this dataset contains images with small, salient objects that showcase the interaction of saliency and small scale and force a model to identify the object based on not only its size but also its contextual prominence.
- Sod-UAV: This dataset is geared toward small-object detection from unmanned aerial vehicles (UAVs) and provides images taken from low-altitude flights, which is a typical operational case, as most UAVs are limited to less than 400 feet [43]. The Sod-UAV dataset contains a diverse range of small objects of interest for surveillance and monitoring, such as people and vehicles, within different environment types. Test datasets such as Sod-UAV are key to validating models for realistic deployment scenarios for aerial platforms.
4.3. Domain-Specific Datasets (Aerial, Medical, Maritime/Underwater, IR/Thermal)
- Aerial and remote sensing: This is one of the most active domains for SOD. The DOTA (Dataset for Object Detection in Aerial Images) and DIOR (Dataset for Object Detection in Optical Remote Sensing Images) are the primary datasets in this domain and support the widest range of object categories, extreme image resolution (often gigapixels), and extreme-object-scale parameters. Tiny-object detection is essential for traffic monitoring applications, urban design and planning, and security surveillance, and models built for this domain must be capable of detection while accounting for rotational variance, dense clutters, and complex backgrounds [44,45].
- Medical imaging: In the medical imaging domain, SOD is important in the identification of small-scale pathologies, such as small microaneurysms detected in retinal fundus images, small polyps in colonoscopy images, and small cancerous lesions in radiology scans for cancer staging. Datasets in this domain are often private due to patient confidentiality and serve as training data in the development of clinical decision support systems. This is a highly valuable area to be working in; however, it has its own challenges, including poor contrast, ambiguous object boundaries, and high intraclass variation [46].
- Maritime and underwater surveillance: Detecting small objects such as debris, buoys, small vessels, or even people who have fallen overboard from a ship or an aerial platform has important implications for maritime safety. The challenges extend to underwater datasets, with poor visibility, light scattering, and the color distortion of objects (backscattering) presenting additional challenges. This domain forces models to perform robustly under rampant environmental degradation [47].
- IR/thermal imagery: Thermal sensors are essential for detection in low-light or adverse weather conditions, and datasets based on IR camera sensor data are applicable for pedestrian detection, wildlife monitoring, and industrial inspection. Small objects in thermal imagery lack all color and texture information and force models to rely solely on thermal signatures and shapes, which could be ambiguous and solely depend on the environment [48].
4.4. Curation and Annotation Challenges
- High annotation costs: The manual annotation of small objects is laborious, time-consuming, and subject to user error. Annotating small objects requires great attention to detail, often zooming in enough to accurately and precisely place a bounding box at the pixel level. This substantial annotation cost limits to scale the number of datasets manually annotated to support SOD [2].
- Annotation ambiguity and inconsistency: Often, small, blurry, and low-resolution object bounding boxes are poorly defined, leading to inconsistency in the labeled objects between annotators. This label noise could disrupt model training by over-complexifying the problem on models that are sensitive to precise object localization [49].
- Class imbalance: Most datasets usefully naturally contain small objects fewer times than larger-class objects, and thus this area of study must address the class imbalance problem. This class imbalance problem may arise not only for individual categories but also between categorized objects of varying scales and cause an inherent bias in the model performance to better detect larger instances [50].
- Limited dataset diversity: Many of the current datasets used for SOD are collected under specific conditions (e.g., certain weather conditions or regions). To assess the robustness and generalization of SOD models in real-world situations, we need datasets that include diversity across weather, lighting, season, and sensor types. One area of ongoing research is the use of generative AI and simulation platforms to augment possible dataset diversity through synthetically generated datasets; however, there is still the open issue of the domain gap between synthetic and real data [51].
5. Benchmarks and Experimental Protocols
5.1. Evaluation Metrics
- Primary accuracy metric (AP_S): The most important SOD metric is the average precision for small objects (AP_S), according to the COCO evaluation protocol. This metric calculates the mean average precision (mAP) over many IoU (Intersection over Union) thresholds (0.50–0.95) for objects with areas less than 32 × 32 pixels [9]. The AP_S value measures the ability of a model to classify small objects correctly in addition to localizing those targets correctly, which is the primary definition for measuring the SOD performance.
- Associated recall metrics (AR_S): In addition to the AP_S, the average recall for small objects (AR_S) has also been used to interpret the SOD performance. The AR_S measures the fraction of all ground-truth small objects detected by the model, averaged over IoU thresholds. A high AR_S is critical for tasks in which the missed detection of a small object could have severe consequences (e.g., medical diagnosis or security) [53].
- General performance metrics (mAP, AP_50, AP_75): While the AP_S is the primary metric of interest, we also provide the overall mAP (averaged over all object sizes), AP_50 (AP at IoU = 0.5), and AP_75 (AP at IoU = 0.75), which are useful for the overall context so that we know whether improvements in the small-object performance come at the expense of the medium- or large-object detection performance. Ideally, we can improve the AP_S by decreasing or stabilizing the performance on a certain object scale for a well-rounded model [9].
- Efficiency/resource consumption metrics: For the implementation of models in near-real-time settings, such as deployment on edge devices, metrics based only on the accuracy of the output subnet are not a complete measure. A complete benchmark suite must include a pipeline for the following suite efficiency metrics:
- GFLOPs (Giga Floating Point Operations per Second): This metric is a measure of the computational complexity of a model, quantified by the number of multiply–accumulates for a forward pass on a single image. GFLOPs are intended to provide a measure of computational use that is agnostic regarding the underlying hardware platform [56,57].
- Inference speed (FPS): The inference speed is measured in frames per second, which is the number of images processed per second. FPS is a valuable real-time measure; however, it is critical that it is reported with the actual GPU or CPU that was used for testing (e.g., NVIDIA A100, RTX 4090, Jetson Orin) [44,54].
- Latency: The latency is a measure of the time required to process a single image (usually given in milliseconds (ms)) and is the inverse of FPS but is typically more relevant to applications that require immediate responses. The latency should be measured for the model, including preprocessing and postprocessing, to reproduce and evaluate the full system [45,58].
5.2. Fair Protocols
- All models should be trained consistently: All models are compared on the same training dataset with either the same number of epochs or iterations. If models are optimized using similar optimizers (e.g., AdamW, SGD), learning rate schedules, and data augmentation pipelines, then these decisions should remain consistent unless the goal is to compare specific optimizers, learning rate schedules, or data augmentation methods. Models should also be trained to a specific baseline, e.g., MMDetection or YOLOv8 baseline training, and reported to provide an explanation [59].
- Input resolution should be consistent: The input resolution sometimes has a substantial impact on the detection accuracy, especially regarding small objects, as well as costs. Models and their performances should be evaluated at a consistent input resolution (e.g., 640 × 640 or 1280 × 1280). If a method requires a certain input resolution or is able to function at a variable input resolution, then the method should be compared with baselines that were evaluated at that same input resolution [58].
- Report hardware/software environment: Because the performance metrics (e.g., FPS and latency) are highly dependent on the testing environment, the hardware (e.g., GPU model, CPU, RAM) and software (e.g., CUDA version, deep learning framework (e.g., PyTorch or TensorFlow), library versions) stacks used for evaluation must be reported. This transparency is essential for reproducing results [60].
- Open-source implementation: The most reliable (and reproducible) approach to reproducible research is the provision of a public source code for the proposed method and pretrained model weights [43], allowing others to verify and build upon the results. The use of platforms such as GitHub for sharing code and experimental configurations is common practice and a sign of quality research [60]. Comprehensive benchmarks that provide toolkits for standardized evaluations can help in this regard [52,61].
5.3. Systematic Ablations
- Component contributions: When a new method consists of multiple novel components (e.g., a new attention mechanism, a feature fusion module, and a new loss function), ablation studies should be carried out in a systematic process, starting with a strong baseline and adding each component one by one. The incremental improvement in the AP_S and any relevant metric should be reported for each stage, clearly showing which components contributed to the performance improvement [62].
- Hyperparameter sensitivity: Most models include important hyperparameters that impact performance, and systematic ablations should show the model sensitivity to these hyperparameters. For example, if the new loss function has a weighting term, then the performance should be assessed across all possible values. This also provides practical advice for others wanting to implement or utilize the method for adaptation [63].
- Baseline model: The baseline used in an ablation study is important, and it should be an established, strong model (e.g., YOLOv8, RT-DETR). Reporting improvements over a weak or outdated baseline can be distinguishing. The purpose of ablation studies is to iterate toward a clear presentation of how the proposed innovations add genuine value to existing state-of-the-art architectures.
6. Comparative Quantitative Analysis
6.1. Master Comparison Table
6.2. Visual Performance Summaries
6.3. Trend and Significance Analysis
- Hybrid architectures lead the way: The highest-performing models (e.g., SOD-YOLOv8) follow a clear trend of augmenting existing high-performing CNN-based backbones (such as those in the YOLO series) with additional modules. These changes often comprise multiscale feature fusion, some types of attention mechanisms, and/or sets of loss functions designed specifically for small objects [24,56]. This hybrid approach optimizes the powerful feature extraction capabilities of CNNs while using relatively sophisticated mechanisms to maintain and expand fine-grained features, leading to large AP_S increases.
- The emergence of efficient transformers: Even though the cost of transformer-based architectures was once prohibited the deployment of real-time detection, architectures such as RT-DETR and, more specifically, AUHF-DETR show that these architectures provide greater option pools. In the case of AUHF-DETR, this model can obtain an intriguing combination of lower total numbers of parameters and GFLOPS while retaining relevant detection accuracy [54]. Achieved through combinations of lightweight backbones, spatial attention, and architectural optimizations, these advancements present acceptable options for deployment in UAVs and embedded devices. Similarly, there are efficient advancements being made with DETR with applications for autonomous driving [57].
- Accuracy–efficiency tradeoff remains key: The visualizations make it clear that there is no single “best” model, as the best model depends on the application.
- High-accuracy cases: For applications that require the highest level of accuracy and when computational resources are available (e.g., processing offline satellite imagery), the SOD-YOLOv8 model is the best overall, as it achieves the highest AP_S but has the highest latency and GFLOPS values [56].
- Real-time cases: For high-framerate applications such as robotics and surveillance, models such as YOLOv8-S or domain-specific, efficient architectures such as ORSISOD-Net [55] are superior as FPS values are favored, resulting in some small-object accuracy loss.
- Edge-constrained cases: Models must be efficient and lightweight for deployment on resource-limited drones and embedded devices. AUHF-DETR represents a suitable model for the edge-constrained use case, as it has both low GFLOPS and a low total number of parameters and has high FPS, and it is actionable under strict latency constraints [32].
- High-resolution processing is effective: The issue of small-object detection is related to detection within the image resolution. ESOD even highlights the efficient processing of relatively high-resolution images, as downsampling images destroys small-object detection information [58]. This approach typically results in a decrease in the FPS; however, it allows for the continued high accuracy rate of native, high-resolution inputs, which is important for applications in remote sensing and quality inspection.
7. Edge and Resource-Constrained Evaluation
7.1. Testbeds and Setup
- NVIDIA Jetson AGX Orin: This high-performance module is often cited as the preferred hardware for demanding edge applications requiring heavy parallel processing capabilities and is often used to validate the real-time inference capabilities of complex models in applications such as aerial segmentation and precision agriculture [64,65,66].
- NVIDIA Jetson Xavier Series: This platform was common for real-time object detection research prior to the advent of the Orin series and continues to be a cited benchmark for legacy platforms [69].
- Raspberry Pi 5: This performant GPU is limited compared with the Jetson series but is noted for its performance capabilities with the CPU in common tests [67]. The easy access to this platform continues to have relevance in some surveillance and monitoring use cases, particularly those rooted in CPU-bound performance models or heavy model optimization.
7.2. Practical Optimizations
7.2.1. Quantization
- FP16 quantization: The first step toward making inference/training quicker is conversion from FP32 to FP16. FP16 quantization is often the easiest and fastest path to improved inference on modern GPUs (including those on the Jetson platform), with very little cost to accuracy [70]. This is typically accomplished with the NVIDIA TensorRT optimization engine. Researchers used INT8 quantization in a study and found a ~100× increase in the inference for five frames/second without a decrease in accuracy in classifying weeds from an agricultural drone [66].
- INT8 quantization: Aggressive quantization via INT8 can provide the greatest performance gains, and this is particularly true for deployment on edge devices optimized for integer arithmetic, though it does require an intermediate calibration step with a representative dataset to develop scaling factors that map the float value range to the 8-bit integer range. Some research has successfully quantified a model down to INT8 for inference; thus, quantization is a powerful concept for efficient deployment to edge devices [67].
7.2.2. Pruning
- Unstructured vs. structured pruning: Unstructured pruning removes individual weights by comparing their magnitudes, which may result in a sparse weight matrix necessitating dedicated hardware or a library to run efficiently. The structured pruning specialty simply prunes entire channels, filters, or layers, whatever a layer/final model needs to be a smaller, dense model that can automatically run on standard hardware.
- Prune strategically: More intentionally, advanced methods prune layers based on their perceived contributions to the underlying task, for example, the successful “minimal pruning of important layers and the extreme pruning of less” significant layers [66]. Each component always included must maintain essential feature extraction characteristics, particularly for small objects.
7.2.3. Knowledge Distillation
7.2.4. Low-Rank Decomposition
7.3. Deployment Case Notes
- Integrated optimization pipelines: The best deployments often combine different optimization techniques. For example, the model can be pruned to eliminate unused parameters and then quantized to INT8 with an engine such as NVIDIA TensorRT for maximum acceleration on a Jetson AGX Orin [66].
- Hardware–software codesign: The model architecture and optimization strategy are chosen based on the target hardware. Jetson devices have consistently proven to be suitable choices because they have GPUs that are powerful and built-in support for optimization libraries such as TensorRT [65].
- Application-specific tradeoffs: The acceptable tradeoff between accuracy and speed is highly dependent on the application. For instance, the inference speed is prioritized when building emergency safe landing systems for drones for the sake of a quick and reliable performance even at the expense of reduced detection accuracy [64]. Conversely, medical diagnostic tools favor maximum accuracy and tolerate somewhat higher latency. Lightweight models designed to solve specific tasks, such as identifying the best palm fruit on harvesting machinery, are designed specifically for the tolerant constraints posed by the applications [72].
8. Applications and Data-Driven Case Studies
8.1. Remote Sensing and Aerial Imagery
- Search and rescue and surveillance: Drones are regularly utilized for monitoring large areas. SOD models enable small-target detection (e.g., individuals or vehicles) from aerial imagery, which is important for emergency response and security. Specialized architectures and algorithms, such as the proposed SOD-YOLO, are rapidly being developed to improve the small-object detection performance in UAV scene images [38].
- Precision agriculture: In agriculture, drones equipped with SOD models can perform tasks related to weed monitoring, pest detection, and crop health [5]. For example, when drones identify small insects or the early onset of brightness on leaves from a distance, interventions can be targeted to reduce costs and environmental impacts [1]. Researchers have focused on developing specialized SOD models, such as DEMNet, for detecting small instances of tea leaf blight from slightly blurry UAV images for industry quality control [73].
- Infrastructure inspection: Drones are used to efficiently inspect large-scale infrastructures (e.g., wind turbines and energy transmission service lines) and ensure the safety of inspectors. SOD is necessary for small-defect detection (e.g., small cracks, small corrosion areas, and damage from flying particles) on wind turbine blades [7,74]. Detecting these small defects requires UAV imagery taken by small quadcopters under varying weather and light installations.
8.2. Autonomous Driving and Maritime Surveillance
- Drone and ship-based detection: Both UAVs and ship-based camera systems are used for surveillance. Detecting other drones or small boats from these platforms is a key application for security and situational awareness [47].
- Challenges: The maritime environment creates specific challenges, with issues such as wave clutter, sun glare, and weather conditions that easily obscure or mimic small objects. Models need to be resilient to the specific environmental challenges.
8.3. Industrial and Manufacturing Defects
- Defect targets: Defects include scratches, cracks, pinholes, and contamination on the surface of any material (e.g., semiconductors, textiles, metals, etc.). These small and sometimes subtle defects are precisely why humans have a difficult time consistently detecting them over long durations in manufacturing environments.
- Operational benefits: The operational benefits of SOD systems that utilize computer vision are the objective, repeatable, and high-throughput inspection capabilities that are achievable, and this is especially relevant in wind turbine manufacturing, where automated quality control deep learning-based systems are now used to identify small defects/imperfections in the blades prior to deployment [65]. We all recognize that the cost of failure and maintenance down the road is far greater than the initial cost of identifying potential issues early on, allowing us to be proactive. However, these detection models must achieve high accuracy/precision, as false positives can result in the needless disposal of good products in the inspection processes, while false negatives compromise the quality.
8.4. Medical Imaging and Agriculture
- Lesion and nodule detection: SOD models have been trained to identify small lesions via computerized tomography (CT), small polyps in colonoscopy videos, and microcalcifications in mammograms. These small characteristics are often the first indicators of cancer and can be easily missed by the naked eye in early-stage screening assessments.
- Cellular analysis: In the area of digital pathology, SOD techniques are used to count or classify small bodies/cells in high-resolution images of tissues, which is valuable for the determination of the disease grade, as well as for research purposes.
- Architectural developments: The task of detecting small, often ambiguous lesions has fueled the need for new architectures. Transformer-based models, specifically, have shown potential for the detection of subtle rib fractures, which further support the increase in long-range contextual attention to detection tasks to achieve higher detection performances [75]. Much like agriculture, the field of medical imaging clearly demonstrates the importance of high recall, as not detecting a small malignant lesion in an important aspect negatively affects the patient [67]. Other advancements in object detection leveraging general deep learning will ultimately feed into MOD (medical object detection) improvements [75].
- Pest and disease identification: Similar to aerial applications, ground robots and/or stationary cameras integrated with SOD models can be utilized to identify small pests, tiny insects, or disease spots on the leaves of plants that are potentially difficult to recognize unless physically observed [63].
- Automated harvesting: For some crops with commodification potential, robotic harvesters may leverage vision systems with SOD to identify and localize individual fruits or vegetables for picking, wherein the small-target detection must be accurate. While the future of automation in harvesting may be driven globally, it is vital to realize that small, lightweight sensing networks are developed for integration with mobile harvesting equipment, which does not possess higher computational abilities than those of a not-so-powerful robotic earthworm [25].
8.5. Case Study Performance Summary
9. Environmental Adaptation and Multisensor Fusion
9.1. Weather, Terrain, and Illumination Challenges
9.2. Multisensor Fusion
- Early (data-level) fusion: Raw or minimally processed data from different sensors are combined as input, retaining all of the information; however, it assumes that the sensor calibration and synchronization are perfect and produces a highly dimensional dataset that can be expensive to process.
- Intermediate (feature-level) fusion: Features are extracted independently from each sensor stream and concatenated or fused together in the intermediate layer of a neural network. This is a popular method because it retains information and is inexpensive to process, and it allows the network to learn complex cross-modal correlations.
- Late (decision-level) fusion: Each sensor stream is processed by an independent detection model, and the resulting outputs (e.g., bounding boxes, class probabilities) are fused together at the end, providing modularity and relatively easy implementation; however, it can be less effective because it loses valuable low-level correlations between sensor modalities.
- RGB + LiDAR: This is one of the most common combinations for autonomous driving applications. LiDAR can provide accurate 3D spatial information (distance, geometry) that is invariant to illumination changes and is comparatively less affected by some weather than RGB cameras. The fusion of LiDAR point clouds and RGB accurately localizes small objects in 3D space, including disambiguating small objects from background clutter [75].
- RGB + thermal/infrared: Thermal cameras sense emitted heat radiation instead of reflected light and are particularly useful because they remain effective in low light and at night and can detect animate objects (e.g., pedestrians, animals) against cooler backgrounds, overcoming limitations for 24/7 operation in low-light and obfuscating conditions (e.g., smoke, light fog). Multiperspective RGB and infrared datasets are being established to facilitate research, with a focus on UAV detection in low-visibility conditions [78,79].
- RGB + radar: Radar is exceptionally robust to adverse weather conditions, including heavy rain, fog, and snow. Radar provides precise velocity information and can detect objects at long distances. Although radar data are generally spatially sparse, the fusion of RGB imagery provides valuable object detection and tracking capability in scenarios where vision and LiDAR may fail, which is critical for applications such as adaptive cruise control and early collision warning systems [80].
- RGB + depth (RGB-D): Depth sensors (e.g., stereo cameras, structured light) provide per-pixel depth information that can help the system segment objects from the background and improve scale estimation. This modality fusion is particularly effective in indoor robotics and outdoor applications requiring short-range depth estimation [79].
10. Field Taxonomy and Conceptual Maps
10.1. Concept Map of SOD in 2023–2025
- Core node (central): “Small Object Detection (2023–2025)” is the central theme of the concept map.
- Foundations: The “Core Challenges” represent the foundational problems that are the basis of all research in the area, such as low resolution, background clutter, scale differences, and data imbalance.
- Methods and architectures: The major branch on the concept map presents the main algorithmic solutions. For example, a “Multi-Scale Feature Learning” (with FPN, PANet, etc.) strategy directly addresses feature loss and scale differences. The “Transformer & Attention” strategies in models have made progress toward improving feature representation. The “Context Aware Strategies” have made progress toward addressing background clutter. Finally, some “YOLO Series Enhancements”, such as the addition of a P2 small-object head, are specific architectural responses to low resolution.
- Data and datasets: This central pillar is more about the data-centric aspect of SOD research. “Data Augmentation” (both geometric and generative) is a direct response to the data imbalance issue. The field has made progress by developing “Specialized Datasets”, which indicate that the quality of data will need to improve before progress can be made.
- Evaluation and benchmarking: This node refers to the necessity of evaluation and connects the methods and the data, instead of the simplified evaluation metrics that can be used (both accuracy, such as the AP_S, and efficiency, such as FPS), and the evaluation protocols that allow for fair comparisons.
- Applications and deployment: These last two pillars bring us to real-world applications. Applications such as remote sensing and autonomous driving help to categorize the types of use case scenarios and their resulting requirements, and the deployment aspect addresses practical constraints. For example, projects need to be “Edge Computing”-aware, employ model compression, and use multisensor fusion to enhance robustness. This is illustrated on the map via the autonomous driving example and the explicit relationship to multisensor fusion, as this is the application requirement.
10.2. Unified Practical Pipeline
- Problem and constraint definition:
- Define “Small”.
- Quantify the absolute and relative (pixel size) values of the target objects.
- Identify environment.
- Characterize the operational conditions (e.g., environment weather, lighting, and background).
- Define the performance KPIs.
- Define the initial target metrics (e.g., AP_S > 0.4, Latency < 30 ms, etc.).
- Specify constraints.
- Define the hardware (e.g., deployed on a Jetson Orin device), power, memory, etc.
- Data collection and curation:
- Collect data.
- Collect representative data from the target environment and use actual sensors if climate/environmental robustness is the goal.
- Annotate properly.
- Proper bounding boxes; poor annotation is detrimental to SOD.
- Create splits.
- Fixed training, validation, and test splits; these should be maintained throughout group projects, campaigns, etc.
- Baseline Model selection:
- Start simple.
- Select a robust and performant detector to start as a baseline (YOLOv8-n/s; YOLOv9-c, etc.).
- Optional pretrained weights.
- Pretrained weights on a large-scale dataset, such as COCO, might help as a strong initialization, even if domain-specific.
- Data augmentation strategy:
- Standard augmentations.
- Geometric and photometric (scaling, rotation, color jitter, etc.).
- SOD-specific augmentations.
- Mosaic, MixUp, or copy–paste (to boost small-object frequency/diversity in training scenes).
- Iterative training and tuning:
- Train the baseline.
- Train the prior baseline model and establish a performance baseline.
- Hyperparameter tuning.
- Tune the key parameters (optimizer, learning rate, loss function weights, etc.), paying close attention to the anchor generation or matching strategy.
- Architecture refinement (if needed):
- Evaluate failure cases.
- Utilize validation set performance from the baseline to identify weaknesses (for example, missed detections in low light, false positives in cluttered scenes, etc.).
- Introduce advances.
- Add features based on the above analysis, such as an extra detection head for feature resolution issues (e.g., P2) and attention or context-aware detectors for missing context. Scale variance issues necessitate experimenting with advanced FPN structures (e.g., BiFPN).
- Quantitative and qualitative assessment:
- Primary metrics.
- Determine the AP, AP_S, and AR_S performance metrics, defined on the test set.
- Efficiency metrics.
- FPS, latency, and resource vs. requirements on target hardware.
- Qualitative assessment.
- Visualize predictions against difficult examples; extract analysis on false positives and negatives to inform another iteration and return to consider stage 6 for another iteration if needed; however, move on to stage 8 and optimization if the target metrics are achieved.
- Model optimization for deployment:
- Compression.
- Quantization optimizations (e.g., quan8, structured pruning, etc.).
- Inference engine.
- Compile the model using an optimized runtime (e.g., TensorRT for NVIDIA GPU).
- Deployment and in-field monitoring:
- Deploy.
- Seamlessly integrate the optimized model into the targeted application.
- Monitor.
- Continuously monitor the performance in the real word; gather hard cases and additional data to retrain the model at intervals (periodically) or enhance the model overtime in its lifecycle.
11. Practical Guidance
11.1. Model Selection Plan
- Identify the problem:
- Research/cloud based: Choose this if maximum accuracy is the only concern and use high-end GPU capabilities (A100 or similar). Latency is not a concern. Example: the offline analysis of analyzed satellite imagery.
- Balanced/versatile: This is the sweet spot for many applications that need decent performance on modern hardware (e.g., RTX 40 series) without needing it in strict real time. Example: a manufacturing quality control system.
- Edge/real time: Choose the model for deployment on resource-constrained devices, such as NVIDIA Jetson or FPGAs, where high FPS and low power consumption are key. Example: onboard drone-based detection.
- Niche/legacy: This will generally be avoided, as it represents models with low performances and efficiencies.
- Select a starting model: Select a model that can be used as a baseline for the experiments. If the experiment is edge-based, for example, starting with YOLOv8-S would be a better starting point than starting with a large DETR model.
- Consider nuances: This plan is a guide. Ease of use, support from frameworks, and community documentation are also important. For instance, the YOLO series is very favorable in this regard and, as such, is a common choice for practical implementations.
11.2. Scenario Playbooks
11.2.1. Playbook 1: Aerial Imagery (UAV/Satellite)
- Challenge: Detecting widely dispersed, extremely small-density objects (e.g., cars, people) from high altitudes and multiple perspectives.
- Playbook:
- Data preprocessing: Utilize tiled inference. Slice high-resolution image (e.g., 8K) into smaller overlapping patches (e.g., 1280 × 1280 pixels due to the zoom-in effect and effectively resizing smaller objects).
- Model choice: Choose a model from the “Balanced” or “Research” quadrant (e.g., YOLOv9-C/E). When running inference, you are likely to run offline, and thus, this is more of a concern than accuracy.
- Architectural tweak: The model needs to run on high-resolution feature maps for detection. If it is YOLO, make sure it has a P2 head (detecting on a stride-4 feature map), which is non-negotiable for this use case.
- Augmentation: Use scale-aware augmentation liberally. Implement copy–paste by adding objects from the dataset to synthetically increase the density of small objects in training scenes. Use Mosaic augmentation with large canvas sizes.
- Postprocessing: After you run inference through all tiles, use non-maximum suppression (NMS) or the advanced NMS variant on the joint detections to combine duplicate boxes from overlapping areas.
11.2.2. Playbook 2: Autonomous Driving (Embedded Device)
- Challenge: The real-time detection of distance pedestrians, vehicles, and traffic signs on an embedded device (e.g., NVIDIA Jetson Orin) with dynamic weather and lighting conditions.
- Playbook:
- Model choice: Use a model from the “Edge/Real-Time” quadrant (e.g., YOLOv8-S or RT-DETR-R18). The only constraint is the latency.
- Sensor fusion: Do not rely on RGB as a stand-alone sensor; use integrated LiDAR/radar. Use an intermediate fusion approach where camera and LiDAR features are fused in the network backbone to leverage spatial and visual features (thermal camera fusion is also highly recommended for all weather).
- Optimization: Use the NVIDIA TensorRT framework for the model conversion and optimization. Use INT8 quantization for fast inference; however, make sure to provide a representative calibration dataset to minimize accuracy loss.
- Training data: Ensure the training dataset is composed of diverse scenarios: night, rain, fog, and glare. If real-scene data are limited, leverage simulation environments such as CARLA or generative models to create synthetic examples of these cases to include in the training.
- Metrics: In addition to the AP_S, track the end-to-end detection latency (e.g., sensor input to output bounding box) and power consumption of the device.
11.2.3. Playbook 3: Industrial Defect Detection
- Challenge: The identification of small, low-contrast defects (scratches, cracks, pinholes, etc.) on uniform surfaces, typically under high-speed production conditions.
- Playbook:
- Environment control: Before the model is touched, optimize the physical environment. Use controlled, consistent lighting and high-resolution, industrial cameras. Sometimes just changing the angle of the light can make a low-contrast defect highly visible.
- Data strategy: This is often a few-shot or one-class problem. You will have many images of normal products and very few defect images.
- Approach A (detection): Use stronger augmentation (copy–paste) to place a visual of the defects onto normal backgrounds to artificially create a larger training set.
- Approach B (anomaly detection): If the defects vary greatly, you can train an anomaly detection model (e.g., PatchCore, PaDiM) on normal samples only. The model will learn to identify anything that deviates from the norm.
- Model choice: A lightweight CNN model or a small YOLO model is typically sufficient. The background is simple, so complex context aggregation is often overkill.
- Input resolution: Do not downsample original inputs if possible. Run the image through the model at original (native resolution) or tiled to ensure that defect pixels are not lost.
11.3. Deployment Checklist
- The final model performance (AP_S, AR_S) is evaluated on a held-out test dataset that was not used at any point in the training or tuning.
- The performance is evaluated across important subgroups (e.g., across object classes, day/night, types of weather).
12. Discussion: Limitations and Future Directions
12.1. Current Limitations
- Benchmarks and metrics: Although useful for standardization, the current benchmarks and metrics often do not fully capture the complexities of real-world SOD. For example, for SOD, the most common metric is the average precision for small objects (AP_S), which is used to evaluate performance without consideration to the absolute size, such as 5 × 5 and 30 × 30 pixels. As such, the AP_S under the SOD abbreviation masks variation in the assessment, such as a model that was proficient at detecting larger “small” objects, such as 30 × 30 pixels, or larger small objects, while suffering from the tiny or small “size” of object such as could be defined as 0 × 0 pixels. Another layer of complexity can be added to evaluate performance based not only on size or standard metrics but also on general-use models that evaluate the practical cost of false positives or negatives, specifically when applied to critically responsible applications such as autonomous driving or medical diagnosis.
- Generalization gap: Most of the recent SOD models usually provide excellent performances on a single benchmark or domain, for example, aerial, but do not extrapolate generalization, especially in adverse conditions or unseen domains. For example, the overfitting on dataset-specific properties such as consistent lighting, the consistent density of objects in the scene, or consistent camera angles; basically, the generalization or non-generalization performances of these models are again a bottleneck for applying them in real-world applications. The transfer of a static and simple dataset to a dynamic and uncertain application is a critical SOD performance hurdle.
- Computational cost versus performance tradeoffs: High-performance SOD models, especially those using high-resolution input images, complex backbone feature fusion, and/or large transformer backbones for implementation, are computationally heavy. There is a tradeoff cost between the performance and deployment efficiency when considering resource-constrained devices such as UAVs or edge or embedded devices where latency and power consumption are critical constraints [83].
- Annotated object scarcity bottlenecks and quality: The performances of deep learning models are mostly tied to available datasets (the quality and quantity of data), and for SOD applications, the severe data scarcity is the bottleneck. Small objects will always benefit from quality, quantity, diversity, and time and costs for their accurate annotation, as this is inherently difficult and time-consuming, especially considering that SOD is always a qualitative approach. Therefore, few-shot or zero-shot SOD is even tougher, given object scarcity at large scales, which makes it difficult to operate “robustly.”
12.2. Promising Directions
- Contextual and explainable AI (XAI) models: Models of the future should further develop from a simple feature extraction mechanism to improve the contextual understanding of scenes, object–object relationships, and common-sense-like reasoning to reduce ambiguity across SOD applications. Additionally, integrating GNNs or structured knowledge bases when applied in the detection of cluttered-scene objects is particularly useful for resolving ambiguity. Finally, the development of excellent XAI is needed for SOD to debug models that typically fail in real application and would be huge and build trust in critical applications to explain model decision making.
- Cross-domain generalization/adaptation: Future research into UDA and DG, particularly as it relates to SOD methods, will be necessary to balance high benchmark performance and real performance observation. Future techniques could consider addressing UDA, particularly with detected objects closely to limit labeled data. The detection of new environments by models with limited labeled data could suggest an increase in high-performance SOD methods for practical applications to support speed in constrained applications in autonomous surveillance and environmental monitoring, which are dynamic environments.
- Efficient architectures and hardware codesign: There is an urgent need for new lightweight SOD architectures for edge devices, including exploring neural architecture search (NAS) methods for efficient SOD backbones, current and future model quantization and pruning methods, and hardware codesign wherein algorithms are developed with hardware accelerators for performance and efficiency. Hybrid models that combine CNN- and transformer-based structures, such as RT-DETR, represent exciting developments in this space [83].
- Multimodal and multisensor fusion: The fusion of data from multiple sensors and modalities (e.g., RGB, thermal/infrared, LiDAR, radar) can provide complementary information to surpass the deficiencies of relying on a single type of sensor for the overall objective. For example, thermal data can be used to detect small objects in dark conditions, while LiDAR data can provide depth information. A robust and efficient fusion strategy will be necessary to build trustable SOD systems to operate all day in all weather in domains such as autonomous driving and maritime observation [1].
12.3. The Role of Generative AI and Foundation Models
- Data augmentation and synthesis: Generative models (e.g., diffusion models and GANs) provide a potential solution to the data scarcity problem by generating realistic and varied synthetic training data from which large quantities of sparse or small objects can be created to augment existing data. The result is a more balanced classification distribution and robustness to environmental condition variability.
- Foundation models for vision: Large foundation models (e.g., Vision Transformers (ViT) and Vision-Language Models (VLMs)) pretrained on web-scale datasets have exhibited marvelous abilities to learn rich, generalizable visual representations that can be fine-tuned to SOD tasks [86,87,88], which may lead to significant performance improvements, especially for few-shot or open-world detection [89]. In addition, they may provide more interactive and controllable detection systems that understand semantics from text prompts. The union of generative and discriminative visual foundation models is anticipated to further bolster these capabilities into models that could generate and then detect and ultimately reason about the visual world in a more holistic sense [84,90]. The impact of these powerful models is already being explored in medical imaging, where they have reportedly identified small-scale anomalies, such as rib fractures, with success [91].
13. Conclusions
Supplementary Materials
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| AR | Average Recall |
| CNN | Convolutional neural network |
| YOLO | You Only Look Once |
| COCO | Common Objects in Context |
| IOU | Intersection over Union |
| PRISMA | Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
| FPN | Feature pyramid network |
| UAV | Unmanned aerial vehicle |
| VIT | Vision Transformer |
| DETR | Detection Transformer |
| AUHF-DETR | Adaptive UAV Hardware-Focused Detection Transformer |
| GAN | Generative Adversarial Network |
| AP | Average precision |
| mAP | Mean average precision |
| FPS | Frames Per Second |
| GFLOP | Giga Floating Point Operations per Second |
| BiFPN | Bidirectional Feature Pyramid Network |
| PANet | Path Aggregation Network |
| RAM | Random Access Memory |
| SOD | Small object detection |
| GNN | Graph Neural Network |
| NAS | Neural architecture search |
| VLM | Vision Language Model |
| XAI | Explainable Artificial Intelligence |
| NMS | Non-maximum suppression |
| LiDAR | Light Detection and Ranging |
References
- Wang, J.; Su, J. A review of object detection techniques in IoT-based intelligent transportation systems. Comput. Mater. Contin. 2025, 84, 125–152. [Google Scholar] [CrossRef]
- Muzammul, M.; Li, X. Comprehensive review of deep learning-based tiny object detection: Challenges, strategies, and future directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
- Khan, Z.; Shen, Y.; Liu, H. ObjectDetection in Agriculture: A Comprehensive Review of Methods, Applications, Challenges, and Future Directions. Agriculture 2025, 15, 1351. [Google Scholar] [CrossRef]
- Iqra; Giri, K.J.; Javed, M. Small object detection in diverse application landscapes: A survey. Multimed. Tools Appl. 2024, 83, 88645–88680. [Google Scholar] [CrossRef]
- Mu, J.; Su, Q.; Wang, X.; Liang, W.; Xu, S.; Wan, K. A small object detection architecture with concatenated detection heads and multi-head mixed self-attention mechanism. J. Real-Time Image Process. 2024, 21, 184. [Google Scholar] [CrossRef]
- Memari, M.; Shakya, P.; Shekaramiz, M.; Seibi, A.C.; Masoum, M.A.S. Review on the advancements in wind turbine blade inspection: Integrating drone and deep learning technologies for enhanced defect detection. IEEE Access 2024, 12, 33236–33282. [Google Scholar] [CrossRef]
- Liu, X.; Liu, B. EGFE-Net: An edge-guided and feature elimination network for small object detection. Expert Syst. Appl. 2025, 299 Pt A, 129989. [Google Scholar] [CrossRef]
- Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Ebrahimi Moghaddam, M. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
- Tian, J.; Jin, Q.; Wang, Y.; Yang, J.; Zhang, S.; Sun, D. Performance analysis of deep learning-based object detection algorithms on COCO benchmark: A comparative study. J. Eng. Appl. Sci. 2024, 71, 76. [Google Scholar] [CrossRef]
- Yang, J.; Zhang, X.; Song, C. Research on a small target object detection method for aerial photography based on improved YOLOv7. Vis. Comput. 2025, 41, 3487–3501. [Google Scholar] [CrossRef]
- Chen, Z.; Ji, H.; Zhang, Y.; Zhu, Z.; Li, Y. High-resolution feature pyramid network for small object detection on drone view. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 475–489. [Google Scholar] [CrossRef]
- Zhao, Z.; Du, J.; Li, C.; Fang, X.; Xiao, Y.; Tang, J. Dense tiny object detection: A scene context guided approach and a unified benchmark. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5606913. [Google Scholar] [CrossRef]
- Zhu, H.; Xu, C.; Yang, W.; Zhang, R.; Zhang, Y.; Xia, G.S. Robust tiny object detection in aerial images amidst label noise. arXiv 2024, arXiv:2401.08056. [Google Scholar] [CrossRef]
- Rao, M.K.; Kumar23, P.A. Exploring the advancements and challenges of object detection in video surveillance through deep learning: A systematic literature review and outlook. J. Theor. Appl. Inf. Technol. 2025, 103, 6. [Google Scholar]
- Dalal, M.; Mittal, P. A systematic review of deep learning-based object detection in agriculture: Methods, challenges, and future directions. Comput. Mater. Contin. 2025, 84, 57–91. [Google Scholar] [CrossRef]
- Albuquerque, C.; Henriques, R.; Castelli, M. Deep learning-based object detection algorithms in medical imaging: Systematic review. Heliyon 2025, 11, e41137. [Google Scholar] [CrossRef] [PubMed]
- Gonsalves, A.P.; Yadav, R.K. A systematic review of pedestrian detection techniques using deep learning. Intell. Comput. Commun. Tech. 2025, 1, 448–455. [Google Scholar]
- Sun, Y.; Zhang, C.; Li, X.; Jing, X.; Kong, H.; Wang, Q.-G. MDSF-YOLO: Advancing object detection with a multiscale dilated sequence fusion network. IEEE Trans. Neural Netw. Learn. Syst. 2025, early access. 1–12. [Google Scholar] [CrossRef] [PubMed]
- Mao, Y.; Zhang, H.; Li, R.; Zhu, F.; Sun, R.; Ji, P. HSF-DETR: Hyper Scale Fusion Detection Transformer for Multi-Perspective UAV Object Detection. Remote Sens. 2025, 17, 1997. [Google Scholar] [CrossRef]
- Lai, D.; Kang, K.; Xu, K.; Ma, X.; Zhang, Y.; Huang, F.; Chen, J. Enhancing UAV object detection with an efficient multi-scale feature fusion framework. PLoS ONE 2025, 20, e0332408. [Google Scholar] [CrossRef]
- Tang, Z.; Fang, L.; Sun, S.; Gong, Y.; Li, Q. ML-DETR: Multiscale-Lite Detection Transformer for identification of mature cherry tomatoes. IEEE Trans. Instrum. Meas. 2025, 74, 2547018. [Google Scholar] [CrossRef]
- Zhang, H.; Xiao, P.; Yao, F.; Zhang, Q.; Gong, Y. Fusion of multi-scale attention for aerial images small-target detection model based on PARE-YOLO. Sci. Rep. 2025, 15, 4753. [Google Scholar] [CrossRef]
- Wu, Z.; Zhen, H.; Zhang, X.; Bai, X.; Li, X. SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation. Remote Sens. 2025, 17, 1917. [Google Scholar] [CrossRef]
- Liu, Y.; Zhao, J.; Xu, C.; Hou, Y.; Jiang, Y. YOLO-Pika: A lightweight improved model of YOLOv8n incorporating Fusion_Block and multi-scale fusion FPN and its application in the precise detection of plateau pikas. Front. Plant Sci. 2025, 16, 1607492. [Google Scholar] [CrossRef]
- Sundaralingam, H. Advancing Object Detection Models: An Investigation Focused on Small Object Detection in Complex Scenes. Ph.D. Thesis, Lakehead University, Thunder Bay, ON, Canada, 2025. [Google Scholar]
- Tong, Y.; Ye, H.; Yang, J.; Yang, X. ACD-DETR: Adaptive Cross-Scale Detection Transformer for Small Object Detection in UAV Imagery. Sensors 2025, 25, 5556. [Google Scholar] [CrossRef] [PubMed]
- Yu, Y.; Huang, M.; Wang, K.; Tang, X.; Bao, J.; Fan, Y. LS-YOLO: A lightweight small-object detection framework with region scaling loss and self-attention for intelligent transportation systems. Signal Image Video Process. 2025, 19, 1005. [Google Scholar] [CrossRef]
- Rekavandi, A.M.; Rashidi, S.; Boussaid, F.; Hoefs, S.; Akbas, E.; Bennamoun, M. Transformers in small object detection: A benchmark and survey of state-of-the-art. ACM Comput. Surv. 2025, 58, 64. [Google Scholar] [CrossRef]
- Ding, G.; Liu, J.; Li, D.; Fu, X.; Zhou, Y.; Zhang, M.; Li, W.; Wang, Y.; Li, C.; Geng, X. A Cross-Stage Focused Small Object Detection Network for Unmanned Aerial Vehicle Assisted Maritime Applications. J. Mar. Sci. Eng. 2025, 13, 82. [Google Scholar] [CrossRef]
- Patel, R.; Chandalia, D.; Nayak, A.; Jeyabose, A.; Jijo, D. CGI-based synthetic data generation and detection pipeline for small objects in aerial imagery. IEEE Access 2025, 13, 61192–61206. [Google Scholar] [CrossRef]
- Chen, Y.; Yan, Z.; Zhu, Y. A unified framework for generative data augmentation: A comprehensive survey. arXiv 2023, arXiv:2310.00277. [Google Scholar] [CrossRef]
- Nisa, U.; Pozi, M.S.M.; Saip, M.A. A decade of research in small object detection: A comprehensive bibliometric analysis. Int. J. Data Sci. Anal. 2025, 20, 7331–7355. [Google Scholar] [CrossRef]
- Li, Y.; Dong, X.; Chen, C.; Zhuang, W.; Lyu, L. A simple background augmentation method for object detection with diffusion model. In Lecture Notes in Computer Science; Computer Vision—ECCV 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer: Cham, Switzerland, 2025; p. 15124. [Google Scholar] [CrossRef]
- Nisa, U. Image augmentation approaches for small and tiny object detection in aerial images: A review. Multimed. Tools Appl. 2025, 84, 21521–21568. [Google Scholar] [CrossRef]
- Liu, X.; Luo, X.; Ye, Y.; Huang, X. Potential of diffusion-generated data on salient object detection. IEEE Trans. Multimed. 2025, 1–13. [Google Scholar] [CrossRef]
- Li, P.; Zhang, T.; Qing, C.; Zhang, S. F2SOD: A Federated Few-Shot Object Detection. Electronics 2025, 14, 1651. [Google Scholar] [CrossRef]
- Ferreira, A.D.S.; Ramos, A.P.M.; Junior, J.M.; Gonçalves, W.N. Data augmentation and resolution enhancement using GANs and diffusion models for tree segmentation. arXiv 2025, arXiv:2505.15077. [Google Scholar] [CrossRef]
- Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. SOD-YOLO: Small-Object-Detection Algorithm Based on Improved YOLOv8 for UAV Images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
- Zhu, M.; Zhong, H.; Zhao, C.; Du, Z.; Huang, Z.; Liu, M.; Chen, H.; Zou, C.; Chen, J.; Yang, M.; et al. Active-O3: Empowering multimodal large language models with active perception via GRPO. arXiv 2025, arXiv:2505.21457. [Google Scholar] [CrossRef]
- Feng, C.; Zhong, Y.; Jie, Z.; Xie, W.; Ma, L. InstaGen: Enhancing object detection by training on synthetic dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, DC, USA, 17–21 June 2024; pp. 14121–14130. [Google Scholar]
- Flores-Calero, M.; Astudillo, C.A.; Guevara, D.; Maza, J.; Lita, B.S.; Defaz, B.; Ante, J.S.; Zabala-Blanco, D.; Armingol Moreno, J.M. Traffic Sign Detection and Recognition Using YOLO Object Detection Algorithm: A Systematic Review. Mathematics 2024, 12, 297. [Google Scholar] [CrossRef]
- Jing, S.; Guo, G.; Xu, X.; Zhao, Y.; Wang, H.; Lv, H.; Feng, Y.; Zhang, Y. ESVT: Event-based streaming vision transformer for challenging object detection. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5607113. [Google Scholar] [CrossRef]
- Tang, D.; Tang, S.; Wang, Y.; Guan, S.; Jin, Y. A global object-oriented dynamic network for low-altitude remote sensing object detection. Sci. Rep. 2025, 15, 19071. [Google Scholar] [CrossRef]
- Liu, D.; Zhang, J.; Qi, Y.; Xi, Y.; Jin, J. Exploring lightweight structures for tiny object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5623215. [Google Scholar] [CrossRef]
- Shi, S.; Fang, Q.; Xu, X.; Dong, D. Multiscale Gaussian attention mechanism for tiny-object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5635216. [Google Scholar] [CrossRef]
- Liu, Z.; Jiang, H.; Zhong, T.; Wu, Z.; Ma, C.; Li, Y.; Yu, X.; Zhang, Y.; Pan, Y.; Shu, P.; et al. Holistic evaluation of GPT-4V for biomedical imaging. arXiv 2023, arXiv:2312.05256. [Google Scholar] [CrossRef]
- Rekavandi, A.M.; Xu, L.; Boussaid, F.; Seghouane, A.-K.; Hoefs, S.; Bennamoun, M. A guide to image- and video-based small object detection using deep learning: Case study of maritime surveillance. IEEE Trans. Intell. Transp. Syst. 2025, 26, 2851–2879. [Google Scholar] [CrossRef]
- Farooq, M.A.; Shariff, W.; O’Callaghan, D.; Merla, A.; Corcoran, P. On the role of thermal imaging in automotive applications: A critical review. IEEE Access 2023, 11, 25152–25173. [Google Scholar] [CrossRef]
- Costa, D.; Silva, C.; Costa, J.; Ribeiro, B. Enhancing pest detection models through improved annotations. In Progress in Artificial Intelligence, Proceedings of the 22nd EPIA Conference on Artificial Intelligence, EPIA 2023, Faial Island, Azores, 5–8 September 2023; Moniz, N., Vale, Z., Cascalho, J., Silva, C., Sebastião, R., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2023; p. 14116. [Google Scholar] [CrossRef]
- Nie, Y.; Fang, C.; Cheng, L.; Lin, L.; Li, G. Adapting object size variance and class imbalance for semi-supervised object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington DC, USA, 7–14 February 2023; pp. 1966–1974. [Google Scholar] [CrossRef]
- Mudavath, T.; Mamidi, A. Object detection challenges: Navigating through varied weather conditions—A comprehensive survey. J. Ambient Intell. Humaniz. Comput. 2025, 16, 443–457. [Google Scholar] [CrossRef]
- Jiang, Y.; Yan, X.; Ji, G.-P.; Fu, K.; Sun, M.; Xiong, H.; Fan, D.-P.; Khan, F.S. Effectiveness assessment of recent large vision–language models. Vis. Intell. 2024, 2, 17. [Google Scholar] [CrossRef]
- Nie, J.; Wang, Y.; Yu, Z.; Zhou, S.; Lei, J. High-precision grain size analysis of laser-sintered Al2O3 ceramics using a deep-learning-based ceramic grains detection neural network. Comput. Mater. Sci. 2025, 250, 113724. [Google Scholar] [CrossRef]
- Guo, H.; Wu, Q.; Wang, Y. AUHF-DETR: A Lightweight Transformer with Spatial Attention and Wavelet Convolution for Embedded UAV Small Object Detection. Remote Sens. 2025, 17, 1920. [Google Scholar] [CrossRef]
- Han, J.; Sun, F.; Hou, Y.; Sun, J.; Li, H. Exploring a lightweight and efficient network for salient object detection in ORSI. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5631014. [Google Scholar] [CrossRef]
- Liu, J.; Wang, Y.; Cao, Y.; Guo, C.; Shi, P.; Li, P. Unified Spatial-Frequency Modeling and Alignment for Multi-Scale Small Object Detection. Symmetry 2025, 17, 242. [Google Scholar] [CrossRef]
- Zhao, H.; Zhang, S.; Peng, X.; Lu, Z.; Li, G. Improved object detection method for autonomous driving based on DETR. Front. Neurorobotics 2025, 18, 1484276. [Google Scholar] [CrossRef] [PubMed]
- Liu, K.; Fu, Z.; Jin, S.; Chen, Z.; Zhou, F.; Jiang, R.; Chen, Y.; Ye, J. ESOD: Efficient small object detection on high-resolution images. IEEE Trans. Image Process. 2025, 34, 183–195. [Google Scholar] [CrossRef] [PubMed]
- Qian, B.; Qian, J.; Wen, Z.; Wu, D.; He, S.; Chen, J.; Ranjan, R. DEEPCON: Improving distributed deep learning model consistency in edge–cloud environments via distillation. IEEE Trans. Cogn. Commun. Netw. 2025. [Google Scholar] [CrossRef]
- Radulov, N.; Zhang, Y.; Bujanca, M.; Ye, R.; Luján, M. A framework for reproducible benchmarking and performance diagnosis of SLAM systems. In Proceedings of the 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024; pp. 14225–14232. [Google Scholar] [CrossRef]
- Khan, A.; Ullah, H.; Munir, A. LiteCOD: Lightweight Camouflaged Object Detection via Holistic Understanding of Local-Global Features and Multi-Scale Fusion. AI 2025, 6, 197. [Google Scholar] [CrossRef]
- Sheikholeslami, S.; Ghasemirahni, H.; Payberah, A.H.; Wang, T.; Dowling, J.; Vlassov, V. Utilizing large language models for ablation studies in machine learning and deep learning. In Proceedings of the 5th Workshop on Machine Learning and Systems (EuroMLSys 2025), World Trade Center, Rotterdam, The Netherlands, 31 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 230–237. [Google Scholar] [CrossRef]
- Sharma, L.; Rihan, M.; Rana, N.K.; Dube, S.K.; Asgher, M.S. Advanced modeling of forest fire susceptibility and sensitivity analysis using hyperparameter-tuned deep learning techniques in the Rajouri district, Jammu and Kashmir. Adv. Space Res. 2025, 76, 614–632. [Google Scholar] [CrossRef]
- Bong, H.M.; de Azambuja, R.; Beltrame, G. BlabberSeg: Real-time embedded open-vocabulary aerial segmentation. arXiv 2024, arXiv:2410.12979. [Google Scholar] [CrossRef]
- Yang, Z.; Khan, Z.; Shen, Y.; Liu, H. GTDR-YOLOv12: Optimizing YOLO for Efficient and Accurate Weed Detection in Agriculture. Agronomy 2025, 15, 1824. [Google Scholar] [CrossRef]
- Gao, X.; Gao, J.; Qureshi, W.A. Applications, Trends, and Challenges of Precision Weed Control Technologies Based on Deep Learning and Machine Vision. Agronomy 2025, 15, 1954. [Google Scholar] [CrossRef]
- Lokhande, H.; Ganorkar, S.R. Object detection in video surveillance using MobileNetV2 on resource-constrained low-power edge devices. Bull. Electr. Eng. Inform. 2025, 14, 357–365. [Google Scholar] [CrossRef]
- Xiao, L.; Li, W.; Yao, S.; Liu, H.; Ren, D. High-precision and lightweight small-target detection algorithm for low-cost edge intelligence. Sci. Rep. 2024, 14, 23542. [Google Scholar] [CrossRef]
- Ayoub, K. Dissertation Submitted to the Department of Computer Science in Partial Fulfillment of the Requirements for Engineer’s Degree in Computer Science Specialty. Engineer’s Thesis, Higher School of Computer Science, Amizour, Algeria, 2025. [Google Scholar] [CrossRef]
- Surantha, N.; Sutisna, N. Key Considerations for Real-Time Object Recognition on Edge Computing Devices. Appl. Sci. 2025, 15, 7533. [Google Scholar] [CrossRef]
- Raza, S.M.; Abidi, S.M.H.; Masuduzzaman, M.; Shin, S.Y. Survey on the application with lightweight deep learning models for edge devices. TechRxiv 2025, preprint. [Google Scholar] [CrossRef]
- Li, J.; Zhang, T.; Luo, Q.; Zeng, S.; Luo, X.; Chen, C.L.P.; Yang, C. A lightweight palm fruit detection network for harvesting equipment integrates binocular depth matching. Comput. Electron. Agric. 2025, 233, 110061. [Google Scholar] [CrossRef]
- Gu, Y.; Jing, Y.; Li, H.-D.; Shi, J.; Lin, H. DEMNet: A Small Object Detection Method for Tea Leaf Blight in Slightly Blurry UAV Remote Sensing Images. Remote Sens. 2025, 17, 1967. [Google Scholar] [CrossRef]
- Zhang, Z.; Shu, Z. Unmanned Aerial Vehicle (UAV)-Assisted Damage Detection of Wind Turbine Blades: A Review. Energies 2024, 17, 3731. [Google Scholar] [CrossRef]
- Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. SAMFusion: Sensor-adaptive multimodal fusion for 3D object detection in adverse weather. In Computer Vision—ECCV 2024, Proceedings of the 18th European Conference, Milan, Italy, 29 September–4 October 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2025; Volume 15119, pp. 484–503. [Google Scholar] [CrossRef]
- Platel, A.; Sandino, J.; Shaw, J.; Bollard, B.; Gonzalez, F. Advancing Sparse Vegetation Monitoring in the Arctic and Antarctic: A Review of Satellite and UAV Remote Sensing, Machine Learning, and Sensor Fusion. Remote Sens. 2025, 17, 1513. [Google Scholar] [CrossRef]
- Zhang, X.; Li, J.; Li, Z.; Liu, H.; Zhou, M.; Wang, L.; Zou, Z. Multi-Sensor Fusion for Autonomous Driving; Springer: Singapore, 2023. [Google Scholar] [CrossRef]
- Tavaris, D.; de Zan, A.; Toma, A.; Foresti, G.L.; Scagnetto, I.; Martinel, N. Multi-perspective RGB and infrared dataset for UAV detection. IEEE Access 2025, 13, 168792–168803. [Google Scholar] [CrossRef]
- Brenner, M.; Reyes, N.H.; Susnjak, T.; Barczak, A.L.C. RGB-D and thermal sensor fusion: A systematic literature review. IEEE Access 2023, 11, 82410–82442. [Google Scholar] [CrossRef]
- Meydani, A. State-of-the-art analysis of the performance of the sensors utilized in autonomous vehicles in extreme conditions. In Artificial Intelligence and Smart Vehicles ICAISV 2023; Communications in Computer and Information Science; Ghatee, M., Hashemi, S.M., Eds.; Springer: Cham, Switzerland, 2023; Volume 1883, pp. 137–166. [Google Scholar] [CrossRef]
- Sharma, M. Multimodal Data Fusion and Model Compression Methods for Computer Vision. Ph.D. Thesis, Rochester Institute of Technology, Rochester, NY, USA, 2024. [Google Scholar]
- Cai, Z.; Wen, C.; Bao, L.; Ma, H.; Yan, Z.; Li, J.; Gao, X.; Yu, L. Fine-Scale Grassland Classification Using UAV-Based Multi-Sensor Image Fusion and Deep Learning. Remote Sens. 2025, 17, 3190. [Google Scholar] [CrossRef]
- El Zeinaty, C.; Hamidouche, W.; Herrou, G.; Menard, D. Designing object detection models for TinyML: Foundations, comparative analysis, challenges, and emerging solutions. ACM Comput. Surv. 2025, 58, 50. [Google Scholar] [CrossRef]
- Liu, X.; Zhou, T.; Wang, C.; Wang, Y.; Wang, Y.; Cao, Q.; Du, W.; Yang, Y.; He, J.; Qiao, Y.; et al. Toward the unification of generative and discriminative visual foundation model: A survey. Vis. Comput. 2025, 41, 3371–3412. [Google Scholar] [CrossRef]
- Giacalone, E. AI-Powered Autonomous Industrial Monitoring: Integrating Robotics, Computer Vision, and Generative AI. Ph.D. Thesis, Politecnico di Torino, Turin, Italy, 2025. Available online: https://webthesis.biblio.polito.it/35371/ (accessed on 15 October 2025).
- Edozie, E.; Shuaibu, A.N.; John, U.K.; Sadiq, B.O. Comprehensive review of recent developments in visual object detection based on deep learning. Artif. Intell. Rev. 2025, 58, 277. [Google Scholar] [CrossRef]
- Awais, M.; Naseer, M.; Khan, S.; Anwer, R.M.; Cholakkal, H.; Shah, M.; Yang, M.H.; Khan, F.S. Foundation models defining a new era in vision: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 2245–2264. [Google Scholar] [CrossRef] [PubMed]
- Guleria, A.; Varshney, K.; Jindal, S. A systematic review: Object detection. AI Soc. 2025, 1–18. [Google Scholar] [CrossRef]
- Sapkota, R.; Roumeliotis, K.I.; Cheppally, R.H.; Flores Calero, M.; Karkee, M. A review of 3D object detection with vision–language models. arXiv 2025, arXiv:2504.18738. [Google Scholar] [CrossRef]
- Hussain, A.; Ali, S.; Farwa, U.E.; Mozumder, M.A.I.; Kim, H.-C. Foundation models: From current developments, challenges, and risks to future opportunities. In Proceedings of the 27th International Conference on Advanced Communications Technology (ICACT), Pyeong Chang, Republic of Korea, 16–19 February 2025; pp. 51–58. [Google Scholar] [CrossRef]
- Saraei, M.; Lalinia, M.; Lee, E.-J. Deep learning-based medical object detection: A survey. IEEE Access 2025, 13, 53019–53038. [Google Scholar] [CrossRef]








| Model | Backbone | Input Size | AP_S (%) | mAP (%) | Params (M) | GFLOPs | FPS | Primary Contribution | Citation |
|---|---|---|---|---|---|---|---|---|---|
| SOD-YOLOv8 | CSPDarknet | 640 × 640 | 33.2 | 53.8 | 44.1 | 168.1 | 118 | Spatial-Frequency Modeling | [38] |
| Pika-YOLOv8n | CSP-Light | 640 × 640 | 18.5 | 36.2 | 32.1 | 32 | 30 | Lightweight Fusion Block | [52] |
| AUHF-DETR | MobileNetv3 | 640 × 1280 | 28.9 | 49.5 | 5.8 | 8.9 | 68 | Lightweight Transformer for UAV | [39] |
| ESOD-RetinaNet | ResNet-101 | 1200 × 800 | 25.6 | 43.1 | 44.6 | 239 | 15 | Efficient High-Res Processing | [55] |
| MGA-Net | ResNet-50 | 640 × 640 | 30.1 | 51.2 | 28.5 | 98 | 95 | Multiscale Gaussian Attention | [42] |
| Improved-DETR | ResNet-50 | 800 × 800 | 29.5 | 48.9 | 41.5 | 86 | 32 | Efficient DETR for Driving | [44] |
| Domain | Model | Input (pixels) | AP_S | mAP | FPS | Latency (ms) | Notes | Citation |
|---|---|---|---|---|---|---|---|---|
| Remote Sensing/UAV Aerial Imagery | SOD-YOLOv8 | 640 × 640 | 33.2 | 53.8 | 118 | 8.5 | Spatial–frequency modeling; high AP_S; high accuracy | [38] |
| Autonomous Driving | Improved-DETR | 800 × 800 | 29.5 | 48.9 | 32 | 31.2 | Efficient DETR for driving; accuracy–efficiency tradeoff | [44] |
| UAV-Embedded (Edge-Constrained) | AUHF-DETR | 640 × 1280 | 28.9 | 49.5 | 68 | 14.7 | Lightweight transformer for UAVs; high FPS; strict latency | [39] |
| Industrial Inspection (High-Resolution) | ESOD-RetinaNet | 1200 × 800 | 25.6 | 43.1 | 15 | 66.7 | Efficient high-res processing; favors detail preservation | [55] |
| Surveillance/Real Time (lightweight YOLO family) | Pika-YOLOv8n | 640 × 640 | 18.5 | 36.2 | 30 | 33.3 | Lightweight fusion block; prioritizes FPS over AP_S | [52] |
| General SOD (multiscale attention) | MGA-Net | 640 × 640 | 30.1 | 51.2 | 95 | 10.5 | Multiscale Gaussian attention; strong AP_S with high FPS | [42] |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Aldubaikhi, A.; Patel, S. Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance. Appl. Sci. 2025, 15, 11882. https://doi.org/10.3390/app152211882
Aldubaikhi A, Patel S. Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance. Applied Sciences. 2025; 15(22):11882. https://doi.org/10.3390/app152211882
Chicago/Turabian StyleAldubaikhi, Ali, and Sarosh Patel. 2025. "Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance" Applied Sciences 15, no. 22: 11882. https://doi.org/10.3390/app152211882
APA StyleAldubaikhi, A., & Patel, S. (2025). Advancements in Small-Object Detection (2023–2025): Approaches, Datasets, Benchmarks, Applications, and Practical Guidance. Applied Sciences, 15(22), 11882. https://doi.org/10.3390/app152211882

