Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions

Wei, Zhe; Zou, Yurong; Xu, Haibo; Wang, Sen

doi:10.3390/electronics14132614

Open AccessReview

Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions

¹

School of Computer Science, Civil Aviation Flight University of China, Guanghan 618307, China

²

Guanghan Flight College, Civil Aviation Flight University of China, Guanghan 618307, China

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(13), 2614; https://doi.org/10.3390/electronics14132614 (registering DOI)

Submission received: 4 June 2025 / Revised: 25 June 2025 / Accepted: 26 June 2025 / Published: 28 June 2025

(This article belongs to the Special Issue New Trends in Computer Vision and Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Small object detection in traffic scenes presents unique challenges for mobile robots operating under constrained computational resources and highly dynamic environments. Unlike general object detection, small targets often suffer from low resolution, weak semantic cues, and frequent occlusion, especially in complex outdoor scenarios. This study systematically analyses the challenges, technical advances, and deployment strategies for small object detection tailored to mobile robotic platforms. We categorise existing approaches into three main strategies: feature enhancement (e.g., multi-scale fusion, attention mechanisms), network architecture optimisation (e.g., lightweight backbones, anchor-free heads), and data-driven techniques (e.g., augmentation, simulation, transfer learning). Furthermore, we examine deployment techniques on embedded devices such as Jetson Nano and Raspberry Pi, and we highlight multi-modal sensor fusion using Light Detection and Ranging (LiDAR), cameras, and Inertial Measurement Units (IMUs) for enhanced environmental perception. A comparative study of public datasets and evaluation metrics is provided to identify current limitations in real-world benchmarking. Finally, we discuss future directions, including robust detection under extreme conditions and human-in-the-loop incremental learning frameworks. This research aims to offer a comprehensive technical reference for researchers and practitioners developing small object detection systems for real-world robotic applications.

Keywords:

small object detection; mobile robots; embedded vision systems; multi-modal sensor fusion; real-time deployment

1. Introduction

1.1. Academic and Engineering Significance of Mobile Robot Perception

Mobile robots, the result of the deep integration of artificial intelligence and embedded systems, are now widely applied in smart transportation, urban patrol, autonomous driving, logistics and delivery, and disaster relief [1]. As urban traffic environments grow increasingly complex, mobile robots place higher demands on the performance of environmental perception systems. In particular, in traffic scenes, robots must accurately and in real time identify and interpret various dynamic and static objects in their surroundings, such as vehicles, pedestrians, traffic signs, guardrails, cones, and other traffic facilities. The perception of these objects directly impacts the robot’s behavioral decisions and safety [2].

Object detection, as one of the core technologies of perception systems, significantly influences the efficiency and reliability of the entire system. In recent years, object detection has achieved significant advances in large-object recognition, thanks to the rapid development of deep learning technology [3]. However, detecting small objects efficiently on resource-constrained mobile platforms still presents numerous challenges [4]. Therefore, research on small object detection (SSD) technologies suitable for mobile robots holds not only significant academic value but also practical importance in promoting the real-world deployment of intelligent transportation systems.

1.2. Characteristics and Technical Bottlenecks of Small Object Detection

Compared with conventional targets, small objects typically refer to image regions of smaller scale and limited pixel count. Their prominent characteristics include a small area ratio, blurred boundaries, and a tendency to be obscured by background information; weak semantic content, with convolutional neural networks prone to losing critical features after multiple downsampling operations; vulnerability to environmental disturbances such as occlusion, blurring, and lighting variation; and high annotation difficulty, limited training data, and a tendency to overfit [5].

In traffic scenes, small objects are diverse in type, including distant pedestrians, traffic signs, warning cones, and road debris. Although small in size, these objects are closely related to traffic safety, and failure to identify them can result in serious consequences. Current mainstream object detection algorithms, such as You Only Look Once (YOLO), Faster Regions with Convolutional Neural Networks (R-CNN), and SSD, often struggle with insufficient accuracy, imprecise localization, and high false negative rates when detecting small objects [6]. Additionally, mobile robots typically operate on embedded platforms with limited computational resources, power budgets, and model capacity, further constraining the complexity and deployability of small object detection models [7]. Therefore, balancing detection accuracy and computational efficiency while enhancing mobile robots’ capabilities in perceiving small objects remains a key challenge and research focus in the field of intelligent perception [8].

This study aims to review recent research progress in small object detection within traffic scenes, with a particular focus on its application to resource-constrained mobile robot platforms. Specifically, it will systematically outline the definition, challenges, and evaluation criteria of small object detection; analyze the applicability of mainstream detection frameworks in traffic environments; and discuss key advancements in model design, feature enhancement, multi-scale fusion, and data augmentation. In addition, it will summarize optimization strategies suitable for embedded or mobile platforms, such as lightweight network structures, edge inference methods, and hardware–software co-design. By comparing the advantages, limitations, and applicable scenes of different methods, this study seeks to provide systematic references and technical support for future researchers engaged in designing small object detection systems for practical deployment.

This study is organized as follows. Section 2 analyzes the core challenges and system requirements from the perspectives of environmental complexity, object characteristics, and platform constraints. Building on this, Section 3 provides a detailed overview of core technological advancements, including feature enhancement, architectural innovation, and data-driven strategies. Section 4 shifts the focus to mobile robot-specific solutions, discussing sensor adaptation, multimodal perception, and deployment under embedded constraints. Section 5 examines the current evaluation systems and benchmarking datasets, highlighting the limitations of existing metrics in real-world traffic scenarios. Finally, Section 6 outlines open issues and future research directions, proposing improvements in generalization, learning efficiency, latency, and platform-specific optimization to guide subsequent developments.

2. Analysis of Challenges and Requirements in Traffic Scenes

In real-world traffic environments, the factors affecting small object detection performance extend well beyond the characteristics of the objects themselves, encompassing complex and dynamic scene conditions as well as system performance limitations. This section presents a systematic analysis of the challenges associated with small object detection in traffic scenes from three perspectives: environmental factors, object characteristics, and system constraints. It serves as a foundation for the subsequent discussion of proposed solutions.

2.1. Environmental Factors

Traffic scenes are characterized by highly dynamic and uncertain environmental conditions, particularly in outdoor settings where factors such as low light, complex illumination, and adverse weather pose significant challenges for object detection systems.

In low-light and nighttime environments, such as tunnels or poorly lit roads, reduced image contrast and increased noise severely impair the visibility of small objects such as traffic signs and reflective markings. Traditional image enhancement techniques often struggle to balance noise suppression with detail preservation, while deep learning-based methods, although more effective, require extensive training data and computational resources [9]. In contrast, bright daylight introduces its own complications: strong sunlight can cause intense reflections, glare, and deep shadows, resulting in blurred boundaries and partial object occlusion, thereby degrading detection accuracy [10]. As illustrated in Figure 1, the right-hand image, captured under nighttime glare conditions, erroneously identifies trees as giraffes, in contrast to the correct detection in the left-hand image. Moreover, adverse weather conditions such as rain, snow, and fog introduce a range of visual disturbances, including occlusion, motion blur, water droplet artifacts, and light scattering effects [11]. To ensure reliable performance in such scenarios, small object detection systems must exhibit strong environmental robustness. This can be achieved through advanced techniques such as targeted data augmentation, domain adaptation, and multimodal perception strategies that enhance model generalizability across diverse environmental conditions.

2.2. Target Characteristics

The geometric and semantic characteristics of small objects are critical intrinsic factors contributing to the difficulty of their detection, particularly in complex traffic scenes. These challenges typically manifest in three main forms. Firstly, small objects generally occupy very limited pixel areas (e.g., fewer than 32 × 32), making it difficult to preserve sufficient structural detail in deep feature maps. As deep neural networks conduct multiple downsampling operations, such information is often lost or blurred, which significantly hampers both classification and localization performance [12]. Secondly, small traffic objects exhibit considerable shape diversity. They encompass a broad range of categories, including traffic signs, cones, mirrors, vehicle tail lights, and signboards, and they often share strong visual similarity and low inter-class variance, thereby increasing the likelihood of misclassification and false detections [13]. As shown in the left-hand image of Figure 1, the detection accuracy for vehicles partially occluded by trees is notably lower than that for unobstructed vehicles. Thirdly, such objects are frequently subject to partial or complete occlusion by surrounding elements such as vehicles, vegetation, or pedestrians, or they may appear only momentarily in dynamic scenes. In such situations, accurate detection requires robust contextual modeling and spatio-temporal reasoning capabilities [14]. To address these intrinsic challenges, detection models must incorporate targeted advancements in feature extraction, scale-aware representation, attention mechanisms, and contextual enhancement techniques.

2.3. System Constraints

When deploying object detection models on embedded platforms such as mobile robots, system resources and task demands impose additional constraints on model performance, primarily in the following areas. Firstly, real-time responsiveness is essential: mobile robots operating in traffic scenes typically require detection to be performed at frame rates no lower than 10–30 FPS to ensure timely system responses. For example, in autonomous driving or unmanned patrol vehicles traveling at high speeds, low-latency detection is critical for obstacle avoidance and path planning [15]. Secondly, computational limitations pose a significant challenge. Mobile platforms are often equipped with low-power CPUs, ARM-based processors, or lightweight GPUs (e.g., Jetson Nano, Raspberry Pi), which are insufficient for large-scale computations. Although complex models may deliver superior accuracy, they are frequently impractical for deployment on embedded systems. Thirdly, energy and thermal constraints further restrict deployment possibilities. Due to limited space and battery capacity, embedded systems demand low power consumption and efficient heat dissipation during model inference. As such, models must sustain high detection performance while operating within strict power budgets [16]. Therefore, designing small object detection models for mobile platforms necessitates not only the optimization of accuracy and recall but also careful consideration of model compactness, inference speed, and hardware compatibility, ultimately aiming to balance the core metrics of accuracy, efficiency, and real-time performance.

3. Core Technological Developments

In response to the challenges of small object scale, weak feature representation, and high susceptibility to occlusion in traffic scene object detection, researchers have proposed a series of improvement strategies from various perspectives. These strategies can be broadly categorized into three main approaches: feature enhancement, network architecture innovation, and data-driven optimization. This section will provide a comprehensive summary and analysis of each category.

3.1. Feature Enhancement Methods

The core of feature enhancement methods lies in improving the presence and representational capacity of small objects within deep semantic features. Common techniques include multi-scale fusion, attention mechanisms, and context modeling.

3.1.1. Multi-Scale Feature Fusion

Small objects are more easily identifiable in high-resolution feature maps; however, they tend to exhibit weaker deep semantic information. To address this issue, the Feature Pyramid Network (FPN) effectively fuses spatial detail and semantic information by horizontally connecting shallow and deep features, serving as a foundational module in many small object detection frameworks. Further enhancements, such as the Bidirectional Feature Pyramid Network (BiFPN) and the Path Aggregation Network (PANet), introduce weighted or cross-scale fusion mechanisms to strengthen multi-layer semantic representation.

Seung-Wook Kim et al. [17] developed an object detector based on convolutional neural networks (CNNs), replacing traditional feature-based image pyramids with progressively constructed pyramid-shaped feature layers. However, variations in abstraction levels across CNN layers often limit detection performance, particularly for small objects. To address this limitation, they proposed the Parallel Feature Pyramid Network (PFPNet), where the pyramid is constructed by widening the network rather than deepening it. Initially, spatial pyramid pooling and several feature transformations are employed to produce feature maps of varying sizes. In PFPNet, these additional transformations are executed in parallel, resulting in feature maps with consistent semantic abstraction across different scales. The elements of the feature pools are then resized to a uniform scale, and contextual information is aggregated to construct the final feature pyramid. Experimental results show that PFPNet improves the performance of the latest Single-Shot MultiBox Detector (SSD) by 6.4% mAP on the MS COCO dataset, with a notable 7.8% increase inAP_small.

Yuqi Chen et al. [18] proposed an efficient Enhanced Semantic Feature Pyramid Network (ES-FPN), which integrates high-level semantic information with low-level contextual cues to improve multi-scale feature learning for small object detection. Specifically, the network exploits rich semantic content through lateral connections to enhance feature semantics, and it utilizes abundant contextual information from low-level, high-resolution features to recover details lost in higher-level, low-resolution representations. This approach mitigates contextual information loss during progressive fusion, effectively preventing object disappearance, a critical factor for small object detection. Finally, ES-FPN progressively fuses features across all layers, yielding final representations with stronger semantic properties conducive to accurate localization. Extensive experiments on three standard object detection benchmarks (MS COCO, PASCAL VOC, and Cityscapes) demonstrate that ES-FPN accurately localizes objects with clear boundaries and relatively complete structures, outperforming existing feature pyramid-based methods.

Ma P et al. [19] introduced an Improved Small Object Detection (ISOD) network aimed at achieving both speed and accuracy. This model employs an efficient channel attention mechanism to extract features from the backbone, and it incorporates an extended scale feature pyramid network to simplify computation by introducing additional high-resolution pyramid layers. These modifications enhance the network’s capability in detecting small objects. To evaluate ISOD’s effectiveness, experiments were conducted on a reflective vest scene dataset and the Tsinghua-Tencent 100K dataset, achieving 0.425 and 0.635 mAP@[0.5:0.95], respectively. These results surpass those of the state-of-the-art YOLOv7 model, demonstrating ISOD’s superior small object detection performance and scalability.

3.1.2. Attention Mechanisms

Attention mechanisms help models to focus on small objects by highlighting salient regions and suppressing redundant background information. Common methods include channel attention (e.g., SE, ECA), spatial attention (e.g., CBAM), and their combined designs. This is particularly beneficial in traffic scenes, where small objects are easily obscured by the background. Attention mechanisms have shown a significant impact on improving detection rates and reducing false positives.

To improve the detection accuracy of small objects in traffic scenes, Jing Lian et al. [20] proposed a detection method based on attention feature fusion. First, they designed a multi-scale channel attention block (MS-CAB), which utilizes both local and global scales to aggregate informative features from the feature maps. Based on this, they introduced an attention feature fusion block (AFFB), which more effectively integrates contextual information from different network layers. The linear fusion module in the object detection network was then replaced with the AFFB to obtain the final architecture. Experimental results demonstrated that, compared with the baseline model YOLOv5s, the proposed method achieved higher mean average precision (mAP) while maintaining real-time performance. On the validation set of the BDD100K traffic scene dataset, the overall mAP increased by 0.9 percentage points, and the mAP for small objects improved by 3.5%.

As shown in the detection results in Figure 2, sourced from [20], both the MS-CAB_YOLOv5s and AFFB_YOLOv5s models improved detection performance in traffic scenes, with the AFFB_YOLOv5s achieving the best results, particularly for small objects located far from vehicles. This has significant implications for enhancing the stability and efficiency of autonomous driving systems and for preventing traffic accidents.

3.1.3. Context Enhancement and Global Modeling

When small objects lack sufficient information, leveraging surrounding context can help to improve detection accuracy. In recent years, numerous studies have introduced global context modeling modules to expand the receptive field while preserving image details, making them suitable for dense object scenarios such as urban traffic monitoring and road perception.

Jia Chen et al. [21] proposed a graph-based context reasoning network (GCRN) to enhance object feature representation by modeling and reasoning about contextual information. Specifically, this approach first employs a context relationship module to encode contextual information, capturing local contextual features of objects as well as dependencies between them. Subsequently, the Graphormer aggregates all contextual information within the context inference module to enrich the visual features of small objects. Extensive experiments on two public datasets (MS COCO and TinyPerson) demonstrate that this method effectively enhances small object feature representation and performs well in feature information transmission.

Zhengkai Ma et al. [22] proposed a small object detection method named Context Information Enhancement YOLO (CIE-YOLO). The architecture mainly comprises a Context Information Enhancement Module (CIE), a Channel Spatial Joint Attention Module (CSJA), and a Pixel Feature Enhancement Module (PFEM). The CIE module extracts and strengthens contextual information to reduce confusion between small objects and the background. The CSJA suppresses background noise, highlighting important small object features. Finally, the PFEM mitigates feature loss during the upsampling process through feature and pixel resolution enhancement. Extensive experiments validate the effectiveness of CIE-YOLO in small object detection.

3.2. Network Architecture Innovations

To meet the deployment requirements of small object detection on embedded or edge devices, researchers have conducted extensive explorations in network architecture design, focusing primarily on lightweight design, modularization, and optimization of the detection head.

3.2.1. Lightweight Backbone Networks

Lightweight backbone architectures such as MobileNet, ShuffleNet, and GhostNet significantly reduce model parameters and computational complexity through techniques like separable convolutions and channel reuse, making them well-suited for deployment on resource-constrained devices such as the Jetson Nano and Raspberry Pi. Adapted versions of these architectures are available for frameworks including YOLOv5n, YOLOv6s, and YOLOv8n.

While redundancy in internal feature maps is a key characteristic of successful convolutional neural networks, it has received relatively little attention in neural architecture design. Kai Han et al. [23] proposed a novel Ghost module that generates additional feature maps using inexpensive operations. Based on a set of intrinsic feature maps, they apply a series of low-cost linear transformations to generate numerous Ghost feature maps, effectively revealing the underlying information of the intrinsic features. The Ghost module can serve as a plug-and-play component for upgrading existing convolutional neural networks. By designing Ghost bottlenecks to stack Ghost modules, a lightweight GhostNet can be readily established. Experiments on benchmark datasets demonstrate that the proposed Ghost module is an effective alternative to conventional convolutional layers in baseline models. On the ImageNet ILSVRC-2012 classification dataset, GhostNet achieves higher recognition performance than MobileNetV3 at a comparable computational cost (e.g., 75.7% top-1 accuracy).

3.2.2. Detection Head Optimization

Traditional detection heads typically have fixed output sizes, which makes it challenging to handle multi-scale targets. In recent years, many models have adapted to small target distributions by introducing anchor-free methods (e.g., FCOS, YOLOv8) or dynamic anchor matching mechanisms (e.g., SimOTA). Additionally, the introduction of novel loss functions and post-processing strategies, such as Distribution Focal Loss (DFL) and Soft-NMS, has positively impacted the accuracy of small target localization.

Jianhong Mu et al. [24] innovatively designed a Concat detection head to effectively extract features. They also introduced a new attention mechanism, Multi-Head Mixed Self-Attention (MMSA), to enhance the feature extraction capability of the backbone network. To improve detection sensitivity for small objects, a combination of Normalized Wasserstein Distance (NWD) and Intersection over Union (IoU) is used to calculate localization loss and optimize bounding box regression. Experimental results on the TT100K dataset show that the mAP@0.5 reached 88.1%, representing an improvement of 13.5% over YOLOv8n. Further experiments on the BDD100K dataset validated the versatility of this method, with comparisons against various object detection algorithms indicating significant improvements and practical value in the field of small object detection.

Yi Gao et al. [25] proposed a PCB defect detection algorithm, YOLOv5_ES, based on a novel YOLOv5 multi-scale attention mechanism (EMA) and spatial pyramid expansion convolution (SPD-Conv) network to improve the YOLOv5 framework. First, the detection head was optimized by removing the medium and large detection layers, fully leveraging the small detection head’s ability to identify micro-object defects, thereby improving model accuracy while achieving model lightweighting. Second, SPD-Conv was introduced to enhance feature extraction capability by minimizing information loss, further reducing parameter count and computational cost. Third, the EMA module was incorporated to fuse context information across different scales, enhancing the model’s generalization capability. Compared to the YOLOv5s model, the average precision (mAP@0.5) improved by 3.1%, the number of model parameters decreased by 55.8%, and the gigaflops per second (GFLOPs) decreased by 4.8%, demonstrating significant improvements in both accuracy and parameter efficiency.

3.2.3. Task Collaboration and Branch Design

Some attempts have been made to train detection tasks jointly with segmentation, keypoint localization, depth estimation, and other tasks to improve the semantic accuracy of small object localization. For example, YOLO-Pose incorporates pose estimation information into the detection framework, which helps to enhance the ability to recognize occluded pedestrians [26].

3.3. Data-Driven Optimization

Data plays a decisive role in small object detection, particularly in traffic scenes. Data-driven methods are applied not only in model training but also across various aspects such as data augmentation, data synthesis, and transfer learning.

3.3.1. Small Object Enhancement Strategies

To address the issue of insufficient quantities of small targets, researchers have proposed various image enhancement methods, including:

Copy-Paste: cutting targets from the background and pasting them into other images to increase the density of small targets [27];
Mosaic/MixUp: mixing multiple images to maintain distribution diversity [28];
Random Zoom-In: randomly zooming in on targets to increase their relative size and improve model attention [29].

3.3.2. Synthetic Data and Simulation-Based Generation

High-quality small target scenes are generated using simulation platforms (such as CARLA and AirSim) or image synthesizers to overcome the challenges of real-world data collection. These scenes are widely employed in traffic sign and obstacle detection tasks. Additionally, generative adversarial networks (GANs) are utilized for cross-domain image enhancement to improve model robustness [30].

3.3.3. Transfer Learning and Few-Shot Training Strategies

In the vision systems of mobile robots operating in traffic environments, small objects, such as traffic signs, pedestrians, and non-motorized vehicles, often exhibit characteristics including limited size, sparse distribution, and severe occlusion. In addition, manual annotation is costly, and data acquisition remains challenging. These factors substantially hinder the training effectiveness and generalization capability of detection models. To overcome these issues, researchers have proposed various training strategies aimed at enhancing detection performance under data-scarce conditions, primarily encompassing three categories: transfer learning, few-shot learning, and self-supervised pretraining.

Firstly, transfer learning is among the most widely adopted strategies. By pretraining models on large-scale datasets (e.g., ImageNet, COCO) and transferring them to the target task, it becomes possible to markedly improve the model’s initial performance, accelerate convergence, and enhance adaptability to specific traffic scenes. For mobile robots operating across diverse environments, such as urban roads, tunnels, and rural areas, transfer learning significantly improves detection robustness under limited data conditions. Secondly, few-shot learning provides an effective solution in situations of extreme data scarcity. These methods typically rely on a support–query task structure, enabling models to rapidly recognize new categories from only a handful of examples. In traffic scenes where the system may encounter emergent objects, such as temporary traffic signs or unforeseen hazards, few-shot detection models exhibit strong adaptability and real-time deployment potential. In addition, self-supervised learning facilitates the extraction of robust visual representations without the need for manual annotation, using pretraining tasks such as contrastive learning and image reconstruction. Representative approaches such as SimCLR, MoCo, and MAE have been shown to significantly enhance feature discriminability in small-object contexts, particularly under challenging traffic conditions such as night-time driving or adverse weather (e.g., rain or fog). Recent research has also explored the integration of self-supervised learning with domain adaptation to address domain shifts arising when mobile robots transition from training environments to real-world road conditions [31].

3.4. Empirical Comparative Analysis

Although the preceding sections have reviewed a range of network architectures and optimization strategies tailored for small-object detection, a systematic comparison of their performance in real-world traffic scenes, particularly on resource-constrained mobile-robotic platforms, remains lacking. To address this gap, we compile and contrast the key performance metrics of five representative categories of recent small-object detection methods, evaluated on public datasets. The metrics include mean Average Precision (mAP) at intersection over union thresholds from 0.5 to 0.95 (mAP@0.5:0.95), AP_small parameter count, computational complexity measured in floating point operations per second (FLOPs), and inference speed in frames per second (FPS) on edge devices such as the Jetson Nano and Xavier NX, as shown in Table 1.

As Table 1 indicates, feature-enhancement structures based on pyramidal designs (e.g., PFPNet, ES-FPN) markedly improve the detection accuracy of small objects, yielding a 7–10 percentage point increase in AP_small over baseline models. However, this improvement is accompanied by a substantial rise in model size and computational load, limiting their suitability for real-time deployment on mobile-robotic systems. In contrast, BiFPN strikes a favorable balance between receptive-field enhancement and model compactness, offering an optimal trade-off between accuracy and speed on edge platforms such as EfficientDet-D0. With regard to lightweight models, the incorporation of GhostNet reduces the parameter count of YOLOv5s to below 70 percent of the original model while sustaining a real-time inference speed of roughly 40 FPS on the Jetson Nano, at the cost of only a minor mAP decrease of about 3 percentage points. These findings underscore the importance of selecting architectures with a high accuracy–efficiency ratio for deployment on resource-constrained platforms.

4. Mobile Robot-Specific Solutions

In response to the challenges of small object detection in traffic scenes, mobile robot systems must provide robust and efficient perception capabilities under hardware constraints, diverse tasks, and complex environments. Unlike cloud or PC platforms, mobile robots rely on local computing to accomplish end-to-end tasks, necessitating optimization in algorithm design, sensor fusion, and deployment implementation. This section focuses on three aspects: hardware perception adaptation design, multimodal collaborative perception, and embedded deployment practices.

4.1. Hardware-Aware Algorithm Design

In practical applications, mobile robots commonly face challenges such as limited computing power, storage constraints, and power limitations. Therefore, the design of small object detection algorithms must consider the following aspects: The algorithm should control the number of parameters and computational complexity (FLOPs) to ensure compatibility with embedded chips such as the NVIDIA Jetson Nano, Xavier NX, or RK3588. For example, detection models specifically designed for edge devices, such as YOLOv5n, PP-PicoDet, and NanoDet, significantly reduce computational resource requirements while maintaining detection accuracy [32]. When processing high-resolution images, a multi-resolution input strategy or dynamic ROI cropping mechanism can be employed to focus processing only on areas containing potential targets, thereby reducing the overhead of analyzing the entire image. For example, MobileDet uses a lightweight backbone for coarse screening of candidate regions at the front end and a reinforced branch to concentrate on small targets at the back end [33]. The focus of small targets may vary across different tasks. For example, pedestrian detection emphasizes posture, while traffic sign detection emphasizes edge geometry. Therefore, some studies have adopted task-aware heads to customize the design of detection branches, thereby improving detection efficiency and accuracy [34].

4.2. Multi-Sensor Spatio-Temporal Alignment

In dynamic traffic environments, relying solely on a single camera often makes it difficult to maintain continuous and stable perception of small targets. Integrating information from multiple sensors can significantly improve robustness and spatial perception capabilities.

4.2.1. Camera–LiDAR Fusion

Cameras provide texture information, while Light Detection and Ranging (LiDAR) supplies depth and geometric data. These two modalities complement each other in detecting small targets such as cones, pedestrians, and bicycles. Typical fusion methods include BEV (Bird’s Eye View) fusion and point cloud projection completion. For low-density point clouds, an attention fusion module must be designed to avoid information redundancy or misalignment.

Fusing camera and LiDAR data has become the de facto standard for 3D object detection tasks. Current approaches rely on LiDAR point clouds as queries to extract features from image space. However, this fundamental assumption means that existing fusion frameworks fail to make predictions when LiDAR malfunctions, whether severely or mildly, fundamentally limiting their deployment in real-world autonomous driving scenarios. In contrast, Tingting Liang et al. [35] proposed a simple yet novel fusion framework called BEVFusion, in which the camera data stream operates independently of LiDAR input, thus addressing the shortcomings of previous methods. Empirical studies demonstrate that this framework outperforms state-of-the-art methods under conventional training settings. Moreover, under robust training conditions simulating various LiDAR faults, Tingting Liang et al.’s framework achieves mAP improvements ranging from 15.7% to 28.9% over state-of-the-art methods. They are the first to explicitly address real-world LiDAR faults, enabling deployment in practical scenarios without any post-processing.

While many multimodal methods merely augment the original LiDAR point cloud with camera features before feeding them into existing 3D detection models, Yingwei Li et al.’s [36] research shows that fusing camera features with depth-enhanced LiDAR features (rather than raw point clouds) yields better performance. Since these features are typically enhanced and aggregated, a key challenge is how to effectively align the transformed features from both modalities. Yingwei Li et al. proposed two novel techniques: InverseAug, which reverses geometry-related enhancement operations (e.g., rotation) to achieve precise geometric alignment between LiDAR point clouds and image pixels; and LearnableAlign, which employs a cross-attention mechanism to dynamically capture correlations between image and LiDAR features during fusion. Building on InverseAug and LearnableAlign, they developed a series of generalized multimodal 3D detection models called DeepFusion, which surpass previous methods in accuracy. For example, DeepFusion improves pedestrian detection performance of the PointPillars, CenterPoint, and 3D-MAN baselines to 6.7, 8.9, and 6.2 LEVEL_2 APH, respectively. Notably, the model achieves state-of-the-art performance on the Waymo Open Dataset and demonstrates strong robustness, effectively handling input damage and out-of-distribution data.

4.2.2. IMU and GPS Spatio-Temporal Synchronization

In high-dynamic scenarios (such as autonomous delivery robots or patrol vehicles), it is necessary to precisely align image frames with Inertial Measurement Unit (IMU)/Global Positioning System (GPS) data using timestamps to ensure consistency in spatial transformations across consecutive frames. This alignment is critical for motion compensation, target tracking, and mapping. Common methods include fusion algorithms based on the Extended Kalman Filter (EKF) or Visual Inertial Odometry (VIO).

In many extended object tracking applications (e.g., using millimeter-wave radar to track vehicles), the shape of the extended object (EO) remains constant, while the heading angle changes over time. Therefore, it is reasonable to track the shape and heading angle as separate parameters. Additionally, the close coupling between the heading angle and yaw angle contains information that can improve estimation performance. Consequently, Zheng Wen et al. [37] proposed a constrained filtering method that utilizes this information. First, an EO model is constructed using a direction vector with a heading constraint, represented by the relationship between the direction vector and the velocity vector. Second, based on this model, a variational Bayesian (VB) method is proposed to estimate the kinematic, shape, and direction vector states. Pseudo-measurements are constructed based on the heading constraint and incorporated into the VB framework. The proposed method can also address the ambiguity issue in heading angle estimation. Simulation and real-data results validate the effectiveness of the proposed model and estimation method.

Yanli Gao et al. [38] proposed an adaptive distributed Student’s t-extended Kalman filter (EKF) based on Allan variance for ultra-wideband (UWB) positioning. First, state equations are established using the target’s position and velocity in the east and north directions, while measurement equations are derived from the distance between the UWB base station (BS) and the target object. Then, the adaptive distributed filter adopts a federated structure: a local t-extended Kalman filter estimates the target position by fusing the distance measurements, while the main filter fuses the outputs of the local filters to produce the final estimate. To overcome the assumption of white noise in traditional Kalman filtering, which limits adaptability to real-world environments, the noise is modeled using a t-distribution. Additionally, Allan variance is calculated to assist the local filter, enhancing its adaptability. Experimental results demonstrate that compared to the distributed EKF, the proposed method significantly improves navigation accuracy.

Vision-based localization is a critical challenge for autonomous systems, and the performance of vision-based odometry degrades in challenging environments. Pengfei Gu et al. [39] proposed S-VIO, an RGB-D visual inertial odometry system that fully utilizes multi-sensor measurements (depth, RGB, and IMU), heterogeneous landmarks (points, lines, and surfaces), and environmental structural patterns to achieve robust and accurate localization. To detect underlying structural patterns, a two-step Atlanta world inference method is proposed. Using the gravity direction estimated by the VIO system, the algorithm first generates horizontal Atlanta axis hypotheses from recently optimized planar landmarks. Then, remaining planar landmarks and line clusters are used to filter out occasionally observed axes based on observed persistence. The remaining axes are retained and stored in the Atlanta map for future re-observation. Specifically, Pengfei Gu et al. employ an efficient mining and insertion (MnS) method to classify structural lines and extract missing points from each line cluster. Additionally, a closed-form initialization method for structural line features is proposed, leveraging known directions to obtain more optimal initial estimates. S-VIO is tested on two publicly available real RGB-D inertial datasets, demonstrating higher accuracy and robustness compared to state-of-the-art VIO and RGB-D VIO algorithms.

The integration of dynamic effects has proven significant in enhancing the accuracy and robustness of visual inertial odometry (VIO) systems in dynamic scenes. Existing methods either prune dynamic features or heavily rely on prior semantic knowledge or dynamic models, which are unsuitable for scenes with numerous dynamic elements. Nan Luo et al. [40] proposed a novel monocular VIO dynamic feature fusion method called DFF-VIO, which does not require prior models or scene preferences. By combining IMU-predicted pose with visual cues, dynamic features are preliminarily identified during tracking using consistency and motion intensity constraints. Nan Luo et al. then innovatively designed a dynamic transformation operation (DTO) to separate the influence of dynamic features across multiple frames into pairwise effects and constructed dynamic feature cells (DFC) to preserve valid information. Subsequently, they reformulated the VIO nonlinear optimization problem and constructed dynamic feature residuals using the transformed DFC as units. Based on the proposed motion feature inter-frame model, a motion compensation method was developed to address the reprojection issues of dynamic features, integrating their effects into tightly coupled VIO optimization to achieve robust localization in dynamic scenes. Evaluations on ADVIO and VIODE, performance degradation tests on the EuRoC dataset, and ablation studies demonstrate that DFF-VIO outperforms state-of-the-art methods in pose accuracy and robustness across various dynamic environments.

4.2.3. Millimeter-Wave Radar and Sonar-Based Blind Spot Compensation

In low-light or adverse weather conditions (such as fog or rain), millimeter-wave radar and ultrasonic sensors can compensate for visual blind spots [41]. Integrating their detection results with visual models through late fusion or deep fusion can aid confidence reconstruction and reduce false detections for small targets [42].

Vehicle detection using visual sensors such as LiDAR and cameras is a key function for achieving autonomous driving. While these sensors generate detailed point clouds or high-resolution images with rich information under good weather conditions, their performance deteriorates significantly under adverse conditions (e.g., fog) due to light distortion caused by opaque particles that reduce visibility. Consequently, methods relying solely on LiDAR or cameras experience marked performance declines in these rare but critical scenarios. To address this issue, Kun Qian et al. [43] adopted complementary radar, which is less affected by adverse weather and increasingly prevalent in vehicles. They proposed the Multi-modal Vehicle Detection Network (MVDNet), a two-stage deep fusion detector that first generates candidate regions from two sensors and then fuses region features across multi-modal streams to improve final detection results.

Multimodal sensor fusion is fundamental for autonomous robots, enabling robust object detection and decision-making under input errors or uncertainty. While recent fusion methods perform well under normal environmental conditions, they fail under adverse weather such as fog, heavy snow, or occlusions caused by dirt. Edoardo Palladin et al. [44] introduced a novel multi-sensor fusion method tailored for adverse weather. In addition to fusing RGB and LiDAR sensors commonly used in autonomous driving, their fusion stack also incorporates NIR gated cameras and radar modalities to address low-light and adverse weather conditions.

4.3. Embedded Deployment

Deploying a small object detection system on a real robotic platform involves multiple steps, including model compression, adaptation to the deployment framework, and integration of communication interfaces. Typical examples include mobile robots based on the Jetson series and solutions integrated within the ROS (Robot Operating System) ecosystem.

Devices such as the Jetson Nano and Xavier NX provide CUDA and TensorRT support for deploying deep learning models. Common optimization techniques include:

Using TensorRT to accelerate model inference (e.g., converting PyTorch (v1.13.1) models to ONNX and then to TensorRT) [45];
Applying INT8/FP16 quantization to compress models [46];
Utilizing the DeepStream SDK to build asynchronous pipelines for improved processing throughput [47];
Deploying lightweight detection networks such as YOLOv5-Nano to meet the battery life requirements of mobile power supplies [48].

In the ROS architecture, the detection module can be encapsulated as a DetectionNode and decoupled from the perception, decision-making, and control modules [49]. Data synchronization is achieved via the publish/subscribe mechanism, and the detection and localization processes can be visualized using tools such as RViz and rqt_graph [50]. The use of rosbag for playback testing further facilitates algorithm debugging [51].

In practical applications, small-object detection models have been successfully deployed on various embedded platforms. For instance, Ravi and El-Sharkawy [4] implemented a real-time detector on the Jetson Nano, achieving efficient inference under limited power and memory constraints. Xie et al. [47] further demonstrated UAV-based deployment using YOLOv5s and NVIDIA DeepStream for aerial object inspection. Similar studies include the deployment of edge AI systems in forestry robotics [8], as well as lightweight MobileNet-SSD models on low-resource mobile platforms [48]. These cases highlight both the feasibility and the challenges of integrating small-object detectors into real-world robotic systems.

5. Benchmarking Systems and Metric Limitations in Small Object Detection

In the field of small object detection research, the completeness and scientific rigor of evaluation frameworks are of critical importance for validating model performance and informing methodological advancements. Compared to general object detection tasks, small object detection places particular emphasis on the perception of low-resolution, high-density, and multi-scale objects, necessitating a systematic analysis that combines specialized datasets with graded evaluation metrics. This section explores two key aspects: the annotation and structural characteristics of mainstream datasets in traffic scenes, and the adaptability and limitations of commonly used detection metrics.

5.1. Dataset Comparison

In traffic scenes, typical small objects include pedestrians, traffic signs, cones, bicycles, pets, etc., with pixel sizes generally less than 1%–5% of the image width and height [52]. To systematically analyze the performance of different datasets with respect to small objects, Table 2 presents several representative datasets and compares their key characteristics, such as the proportion of small objects, object categories, and image resolution.

It can be observed that:

Although COCO is a general-purpose dataset, it exhibits limited generalizability to traffic scenes;
KITTI has low object density and a limited proportion of small objects, making it difficult to comprehensively assess the small object detection capabilities of detectors;
VisDrone and DOTA, due to their aerial perspectives and dense object distributions, have become important benchmarks for small object detection;
TT100K is specifically designed for traffic sign detection, featuring a wide range of categories and a significant proportion of small objects, making it well-suited for research into micro-object detection in traffic environments.

Therefore, in evaluating small object detection for mobile robots, datasets such as VisDrone, TT100K, and BDD100K, which have a high proportion of small objects and are closely aligned with real-world applications, should be prioritized. These should be complemented by multi-view and multi-scale configurations to more objectively reflect model performance.

5.2. Evaluation Methods for Small Object Detection in Real-World Traffic Scenes

In object detection tasks, the mean Average Precision (mAP) is the most commonly used performance metric, primarily employed to measure a model’s detection accuracy across different confidence thresholds. Two mainstream mAP computation schemes are widely adopted: mAP@0.5 and mAP@0.5:0.95. The former calculates the average precision at a single Intersection over Union (IoU) threshold of 0.5; this relatively lenient criterion is suitable for the preliminary assessment of early detectors such as YOLOv3 and SSD. The latter, by contrast, computes average precision at ten IoU thresholds ranging from 0.5 to 0.95 and takes the mean, offering a much stricter standard. It has become the benchmark metric for mainstream datasets such as COCO, providing a more realistic reflection of a model’s localization capability.

However, in small-object detection tasks, mAP@0.5 tends to overestimate model performance. Due to the limited size of small targets, detection boxes with certain deviations may still exceed an IoU of 0.5 and thus receive high scores. Yet, when evaluated at higher IoU thresholds such as mAP@0.75 or above, performance often deteriorates significantly, revealing deficiencies in precise localization. This issue is particularly pronounced for small objects with blurred edges or a tendency to be occluded. More importantly, in intelligent control tasks within traffic scenes, such as obstacle avoidance or interactive grasping, systems demand significantly higher localization precision than 0.5. Therefore, reporting model performance solely based on mAP@0.5 is insufficient and potentially misleading.

It is thus recommended that, in both research and engineering practice, mAP@0.5 be reported alongside mAP@0.5:0.95, AP_small, AP_medium, AP_large, complemented by deployment-related indicators such as inference speed (FPS), parameter count, and computational complexity (FLOPs), to comprehensively reflect the model’s practicality and adaptability in real-world scenarios.

Given the complexity of real-world traffic environments, traditional evaluation approaches remain somewhat detached from practical conditions. For instance, most evaluation settings assume ideal weather and lighting (e.g., clear daylight), which deviates considerably from adverse conditions encountered during real deployments, such as rain, snow, fog, or night-time. In foggy conditions, images are affected by light scattering and become blurred, making small objects such as traffic signs or pedestrians more susceptible to omission or misrecognition. To improve model robustness across diverse environmental conditions, evaluation protocols should incorporate multi-weather test sets, such as Foggy Cityscapes and RainCOCO, or apply data augmentation strategies (e.g., occlusion simulation, low-light emulation) to test model performance under challenging conditions, thereby offering a more accurate representation of real-world reliability.

Moreover, sensor configurations in traffic applications vary considerably. Detection systems relying solely on monocular cameras often struggle to maintain stability and generalizability. In recent years, multi-sensor fusion has emerged as a vital strategy to enhance detection robustness. For example, fusing LiDAR and camera data can effectively mitigate the limitations of single-modal perception, such as occlusion and viewpoint blind spots. Accordingly, when evaluating small-object detection models, it is important not only to assess accuracy metrics but also to consider performance variations under different sensor fusion strategies (e.g., early fusion vs. late fusion). Furthermore, evaluation should take into account hardware resource constraints, such as computational power and energy consumption, common in embedded systems, to better assess deployment efficiency. This is of particular relevance for real-world deployment of mobile robots in complex traffic environments.

Regarding dataset selection, widely used benchmarks show significant variability in their suitability for small-object detection. While the COCO dataset is commonly employed, it contains only around 10% small-object annotations, making it less representative for assessing small-object detection capabilities. In contrast, datasets such as VisDrone, TT100K, and BDD100K feature dense small-object annotations and are collected under realistic traffic conditions, rendering them more suitable as evaluation benchmarks. Thus, dataset selection should align with specific application demands to ensure that evaluation remains task-relevant and representative.

6. Open Issues and Future Directions

Despite significant progress in small object detection technology in recent years, particularly in lightweight model design, feature enhancement, and cross-modal fusion, mobile robots operating in complex traffic scenes continue to face numerous challenges. To further enhance system robustness, practical applicability, and long-term adaptability, researchers must achieve breakthroughs in several key areas. This paper proposes four noteworthy future development directions: enhancing generalization capabilities in extreme environments; implementing human–machine collaborative annotation and continuous learning mechanisms; enabling ultra-low-latency detection; and developing differentiated optimization strategies for specific platforms.

6.1. Generalization Ability in Extreme Scenarios

Mobile robots are often deployed in non-ideal, dynamically changing environments, requiring their visual systems to demonstrate strong robustness and high generalization capabilities. However, existing models are primarily trained under ideal lighting conditions with clear images, making it challenging to maintain stable performance in the following extreme scenarios:

Low-light and nighttime environments: for instance, urban night navigation, tunnels, and areas without street lighting, where image noise is high, contrast is low, and small objects are easily obscured by the background;
Adverse weather conditions: Such as rain, snow, and haze, which may cause image blurring and color distortion, thereby affecting the model’s perceptual reliability;
Dynamic scene variations: Including rapid motion leading to motion blur and frequent scene transitions, requiring the detector to adapt quickly;
Domain shift issues: When models are generalized from training datasets (e.g., urban traffic) to real-world deployment scenarios (e.g., suburban or mountainous areas), performance tends to degrade significantly, necessitating solutions to domain adaptation challenges.

To enhance generalization capabilities, future research may focus on the following directions: employing self-supervised pretraining, domain adaptation, and style transfer techniques; integrating redundant modalities (e.g., infrared and RGB, depth and RGB) to leverage complementary information in visually degraded regions; and constructing large-scale, multi-source heterogeneous datasets that encompass a wide range of extreme conditions to facilitate robust model fitting and generalization.

6.2. Human–Machine Collaborative Annotation and Incremental Learning

Data scarcity and shifts in target distribution are fundamental issues that affect the performance of small object detection, particularly during long-term task execution following real-world deployment. The system must be capable of continuously adapting to new environments and categories, placing greater demands on both annotation efficiency and learning efficiency.

Current detection systems predominantly rely on fully supervised, static learning approaches, which present the following limitations:

Annotating small objects is expensive and demands high levels of precision;
Systems lack adaptability to new target categories, requiring full retraining whenever the environment changes (e.g., introduction of new object types);
Model updates are resource-intensive and cannot be conducted at scale in an online manner.

To address these challenges, future development should focus on establishing a human–machine collaborative annotation framework in which models perform automatic pre-annotation while human operators are only required to review or correct the results, thereby significantly reducing manual effort. In addition, active learning and weakly supervised methods can be introduced to prioritize high-value samples for precise annotation, improving overall training efficiency. Further efforts should enhance incremental and lifelong learning mechanisms, enabling models to rapidly adapt to new categories and scenarios without experiencing catastrophic forgetting. Finally, the incorporation of memory modules and parameter reorganization strategies may help to mitigate forgetting effects and improve model adaptability and long-term stability.

6.3. Ultra-Low-Latency Detection and System Coordination Optimization

For mobile robots operating in traffic scenes, system response speed directly impacts both safety and usability. Therefore, the construction of high-precision, ultra-low-latency detection systems has become a core objective for real-world deployment. However, deploying detection models on embedded platforms still faces several challenges:

Embedded devices offer limited computational resources and are often unable to support real-time inference of complex models.
System bottlenecks beyond model inference (e.g., preprocessing, I/O latency) frequently result in excessive overall latency.
Multi-stage processing chains lack collaborative optimization, leading to unstable end-to-end response times.

To address these issues and achieve ultra-low latency, future efforts may focus on the following areas: Firstly, at the hardware acceleration level, edge accelerators such as Edge TPU, FPGA, and Jetson Orin should be fully utilized. Combined with tools such as TensorRT and ONNX Runtime, techniques like model quantization and structural fusion can enhance inference efficiency while reducing power consumption. In addition, exploring algorithm–hardware co-design paradigms can enable integrated optimization across model architecture and deployment platforms. Secondly, at the model structure and perception mechanism level, networks with input-aware capabilities can be designed to dynamically adjust resolution or inference paths. When combined with early-exit mechanisms and regional attention modules, this enables low-cost key region perception. Furthermore, the introduction of historical frame reuse and inter-frame redundancy suppression mechanisms can further reduce real-time computational overhead. Finally, at the system level, an asynchronous, pipeline-based processing architecture can be constructed, supported by end-to-end latency analysis tools to identify performance bottlenecks. Moreover, by leveraging edge–cloud collaboration mechanisms, latency-sensitive, high-frequency tasks can be processed locally, while computationally intensive tasks are offloaded to remote servers, thereby minimizing the overall response time of the perception–computation system.

6.4. Differentiated Optimization Strategies for Specific Applications

Different types of mobile robot platforms present fundamentally distinct requirements in terms of performance, latency, and computational resources when deploying detection systems. Consequently, future research should adopt an application-oriented perspective to develop differentiated optimization strategies tailored to the specific characteristics and constraints of each platform.

For high-computing-power platforms such as autonomous vehicles, which are equipped with ample hardware resources and rich sensor configurations, the deployment of multi-modal fusion perception and complex detection models is feasible. In such scenarios, it is crucial to prioritize all-weather reliability, cross-domain generalization, and safety redundancy mechanisms to enhance system robustness across diverse conditions, including urban, suburban, and adverse weather environments. In contrast, unmanned aerial vehicle (UAV) platforms are limited by onboard computational capacity and battery life, necessitating a strong emphasis on model lightweighting and inference efficiency. In these applications, focus should be placed on model compression, hardware-efficient architecture design, and the integration of edge intelligence techniques to ensure reliable real-time detection and tracking of small targets, even during high-speed flight and under conditions of strong environmental disturbance.

7. Conclusions

With the rapid development of autonomous driving, intelligent transportation, and service robotics technologies, mobile robots are facing unprecedented challenges and demands in terms of their ability to detect small objects. In particular, in complex traffic scenes, small targets such as traffic signs, pedestrians, non-motorized vehicles, and animals are not only diverse in type and shape but also present challenges such as high dynamics, low resolution, and severe occlusion. Achieving high-precision, low-latency small object detection on resource-constrained embedded platforms has become one of the key research directions in the field of computer vision.

This study presents an overview of small object detection for mobile robots in traffic scenes, systematically summarizing recent advancements and key technological challenges. The main conclusions are as follows:

Traffic scenes are highly dynamic and complex, subject to multiple environmental disturbances such as lighting variation, adverse weather, object occlusion, and motion blur, all of which significantly increase the difficulty of small object detection.
The development of small object detection methods has progressed along multiple dimensions, including feature enhancement (e.g., FPN, attention mechanisms), lightweight network architectures (e.g., MobileNet, YOLOv7-tiny), and data-driven optimization (e.g., Mosaic and copy–paste data augmentation, pseudo-label learning).
For practical deployment, it is essential to consider the joint design of perception capability and computational constraints, such as deployment optimization on hardware platforms like Jetson Nano and Raspberry Pi. Meanwhile, multi-sensor fusion and ROS integration have become crucial approaches to enhancing system robustness.
Current mainstream evaluation systems do not yet fully reflect the small object detection needs in real-world deployment. It is therefore necessary to construct more representative datasets and adopt fine-grained metrics (e.g., mAP@0.5:0.95, small object recall rate) for comprehensive assessment.
In the future, advances in generalization, incremental learning, low-latency inference, and platform-aware optimization will drive the development of small object detection systems that are not only adaptive over time but also capable of autonomous evolution across complex real-world scenarios.

In conclusion, research into small object detection in traffic scenes holds both significant theoretical value and broad engineering application prospects. By integrating multidisciplinary technologies, such as computer vision, embedded systems, and intelligent perception, it is possible to realize safer, more efficient, and more intelligent mobile robotic systems.

Author Contributions

Conceptualization, Z.W. and Y.Z.; resources, Z.W., Y.Z., H.X. and S.W.; writing—original draft preparation, Z.W. and Y.Z.; writing—review and editing, Z.W., Y.Z., H.X. and S.W.; supervision, Z.W.; All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Fundamental Research Funds for the Central Universities under grant Nos. 25CAFUC03036, 25CAFUC03037, 25CAFUC03038, 25CAFUC09010, 24CAFUC04015, and 24CAFUC03042, Civil Aviation Professional Project under grant No. MHJY2023027, 2024 Statistical Education Reform Project under grant No. 2024JG0226, and Sichuan Engineering Research Center Project for Intelligent Operation and Maintenance of Civil Aviation Airports under grant No. JCZX2024ZZ16.

Data Availability Statement

No new data were generated in this paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Raj, R.; Kos, A. A Comprehensive Study of Mobile Robot: History, Developments, Applications, and Future Research Perspectives. Appl. Sci. 2022, 12, 6951. [Google Scholar] [CrossRef]
Antonyshyn, L.; Silveira, J.; Givigi, S.; Marshall, J. Multiple Mobile Robot Task and Motion Planning: A Survey. ACM Comput. Surv. 2023, 55, 1–35. [Google Scholar] [CrossRef]
Kaur, R.; Singh, S. A Comprehensive Review of Object Detection with Deep Learning. Digit. Signal Process. 2023, 132, 103812. [Google Scholar] [CrossRef]
Ravi, N.; El-Sharkawy, M. Real-Time Embedded Implementation of Improved Object Detector for Resource-Constrained Devices. J. Low Power Electron. Appl. 2022, 12, 21. [Google Scholar] [CrossRef]
Mirzaei, B.; Nezamabadi-Pour, H.; Raoof, A.; Derakhshani, R. Small Object Detection and Tracking: A Comprehensive Review. Sensors 2023, 23, 6887. [Google Scholar] [CrossRef]
Liu, Y.; Sun, P.; Wergeles, N.; Shang, Y. A Survey and Performance Evaluation of Deep Learning Methods for Small Object Detection. Expert Syst. Appl. 2021, 172, 114602. [Google Scholar] [CrossRef]
Mittal, P. A Comprehensive Survey of Deep Learning-Based Lightweight Object Detection Models for Edge Devices. Artif. Intell. Rev. 2024, 57, 242. [Google Scholar] [CrossRef]
da Silva, D.Q.; dos Santos, F.N.; Filipe, V.; Sousa, A.J.; Oliveira, P.M. Edge AI-Based Tree Trunk Detection for Forestry Monitoring Robotics. Robotics 2022, 11, 136. [Google Scholar] [CrossRef]
Tian, Z.; Qu, P.; Li, J.; Sun, Y.; Li, G.; Liang, Z.; Zhang, W. A Survey of Deep Learning-Based Low-Light Image Enhancement. Sensors 2023, 23, 7763. [Google Scholar] [CrossRef]
Matveev, I.; Karpov, K.; Chmielewski, I.; Siemens, E.; Yurchenko, A. Fast Object Detection Using Dimensional Based Features for Public Street Environments. Smart Cities 2020, 3, 93–111. [Google Scholar] [CrossRef]
He, B.; Yang, Y.; Zheng, S.; Fan, G. YOLOv8 for Adverse Weather: Traffic Sign Detection in Autonomous Driving. In Proceedings of the Fourth International Conference on Advanced Algorithms and Signal Image Processing (AASIP 2024), Kuala Lumpur, Malaysia, 28–30 June 2024; pp. 311–316. [Google Scholar]
Liu, J.; Zhang, J.; Ni, Y.; Chi, W.; Qi, Z. Small-Object Detection in Remote Sensing Images with Super Resolution Perception. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 15721–15734. [Google Scholar] [CrossRef]
Majhi, S.K.; Gupta, R.K.; Ojha, B.; Sapkota, A.; Muduli, D. Deep Learning Fusion Ensemble for Enhanced Traffic Sign Detection Using the ICTS Dataset. In Proceedings of the2024 IEEE 4th International Conference on Applied Electromagnetics, Signal Processing, & Communication (AESPC), Bhubaneswar, India, 29–30 November 2024; pp. 1–5. [Google Scholar]
Muzammul, M.; Li, X. Comprehensive Review of Deep Learning-Based Tiny Object Detection: Challenges, Strategies, and Future Directions. Knowl. Inf. Syst. 2025, 67, 3825–3913. [Google Scholar] [CrossRef]
Bai, L.; Cao, J.; Zhang, M.; Li, B. Collaborative Edge Intelligence for Autonomous Vehicles: Opportunities and Challenges. IEEE Netw. 2025, 39, 12–19. [Google Scholar] [CrossRef]
Kim, K.; Jang, S.J.; Park, J.; Lee, E.; Lee, S.S. Lightweight and Energy-Efficient Deep Learning Accelerator for Real-Time Object Detection on Edge Devices. Sensors 2023, 23, 1185. [Google Scholar] [CrossRef]
Kim, S.-W.; Kook, H.-K.; Sun, J.-Y.; Kang, M.-C.; Ko, S.-J. Parallel Feature Pyramid Network for Object Detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; Ferrari, V., Sminchisescu, C., Hebert, M., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 239–256. [Google Scholar]
Chen, Y.; Zhu, X.; Li, Y.; Wei, Y.; Ye, L. Enhanced Semantic Feature Pyramid Network for Small Object Detection. Signal Process. Image Commun. 2023, 113, 116919. [Google Scholar] [CrossRef]
Ma, P.; He, X.; Chen, Y.; Liu, Y. ISOD: Improved Small Object Detection Based on Extended Scale Feature Pyramid Network. Vis. Comput. 2025, 41, 465–479. [Google Scholar] [CrossRef]
Lian, J.; Yin, Y.; Li, L.; Wang, Z.; Zhou, Y. Small Object Detection in Traffic Scenes Based on Attention Feature Fusion. Sensors 2021, 21, 3031. [Google Scholar] [CrossRef]
Chen, J.; Li, X.; Ou, Y.; Hu, X.; Peng, T. Graphormer-Based Contextual Reasoning Network for Small Object Detection. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; Springer: Singapore, 2023; pp. 294–305. [Google Scholar]
Ma, Z.; Zhou, L.; Wu, D.; Zhang, X. A Small Object Detection Method with Context Information for High Altitude Images. Pattern Recognit. Lett. 2025, 188, 22–28. [Google Scholar] [CrossRef]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. GhostNet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Mu, J.; Su, Q.; Wang, X.; Liang, W.; Xu, S.; Wan, K. A Small Object Detection Architecture with Concatenated Detection Heads and Multi-Head Mixed Self-Attention Mechanism. J. Real-Time Image Process. 2024, 21, 184. [Google Scholar] [CrossRef]
Gao, Y.; Li, Z.; Wang, Y.; Zhu, S. A Novel YOLOv5_ES Based on Lightweight Small Object Detection Head for PCB Surface Defect Detection. Sci. Rep. 2024, 14, 23650. [Google Scholar] [CrossRef]
Maji, D.; Nagori, S.; Mathew, M.; Poddar, D. YOLO-Pose: Enhancing YOLO for Multi Person Pose Estimation Using Object Keypoint Similarity Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, New Orleans, LA, USA, 19–20 June 2022; pp. 2637–2646. [Google Scholar]
Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.-Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple Copy-Paste Is a Strong Data Augmentation Method for Instance Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2918–2928. [Google Scholar]
Yang, Z.; Lan, X.; Wang, H. Comparative Analysis of YOLO Series Algorithms for UAV-Based Highway Distress Inspection: Performance and Application Insights. Sensors 2025, 25, 1475. [Google Scholar] [CrossRef] [PubMed]
Wei, J.; Li, Y.; Zhang, B. EDCNet: A Lightweight Object Detection Method Based on Encoding Feature Sharing for Drug Driving Detection. In Proceedings of the 2022 IEEE 24th International Conference on High Performance Computing & Communications; 8th International Conference on Data Science & Systems; 20th International Conference on Smart City; 8th International Conference on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys), Hainan, China, 18–20 December 2022; pp. 1006–1013. [Google Scholar]
Senapati, R.K.; Satvika, R.; Anmandla, A.; Ashesh Reddy, G.; Anil Kumar, C. Image-to-Image Translation Using Pix2Pix GAN and Cycle GAN. In Proceedings of the International Conference on Data Intelligence and Cognitive Informatics, Tirunelveli, India, 27–28 June 2023; Springer: Singapore, 2023; pp. 573–586. [Google Scholar]
Wang, Y.; Yao, Q.; Kwok, J.T.; Ni, L.M. Generalizing from a Few Examples: A Survey on Few-Shot Learning. ACM Comput. Surv. 2020, 53, 1–34. [Google Scholar] [CrossRef]
Chong, T.; Zhang, Y.; Ma, C.; Liu, T. Design and Analysis of Video Desensitization Algorithm Based on Lightweight Model PP PicoDet. In Proceedings of the 2023 International Conference on Artificial Intelligence and Automation Control (AIAC), Xiamen, China, 17–19 November 2023; pp. 121–124. [Google Scholar]
Xiong, Y.; Liu, H.; Gupta, S.; Akin, B.; Bender, G.; Wang, Y.; Kindermans, P.-J.; Tan, M.; Singh, V.; Chen, B. MobileDets: Searching for Object Detection Architectures for Mobile Accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 3825–3834. [Google Scholar]
Chang, Q.; Peng, J.; Xie, L.; Sun, J.; Yin, H.; Tian, Q.; Zhang, Z. DATA: Domain-Aware and Task-Aware Self-Supervised Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 9841–9850. [Google Scholar]
Liang, T.; Xie, H.; Yu, K.; Xia, Z.; Lin, Z.; Wang, Y.; Tang, T.; Wang, B.; Tang, Z. BEVFusion: A Simple and Robust LiDAR-Camera Fusion Framework. Adv. Neural Inf. Process. Syst. 2022, 35, 10421–10434. [Google Scholar]
Li, Y.; Yu, A.W.; Meng, T.; Caine, B.; Ngiam, J.; Peng, D.; Shen, J.; Lu, Y.; Zhou, D.; Le, Q.V.; et al. DeepFusion: LiDAR-Camera Deep Fusion for Multi-Modal 3D Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 17182–17191. [Google Scholar]
Wen, Z.; Zheng, L.; Zeng, T. Extended Object Tracking Using an Orientation Vector Based on Constrained Filtering. Remote Sens. 2025, 17, 1419. [Google Scholar] [CrossRef]
Gao, Y.; Yang, M.; Zang, X.; Deng, L.; Li, M.; Xu, Y.; Sun, M. Adaptive Distributed Student’s T Extended Kalman Filter Employing Allan Variance for UWB Localization. Sensors 2025, 25, 1883. [Google Scholar] [CrossRef]
Gu, P.; Meng, Z. S-VIO: Exploiting Structural Constraints for RGB-D Visual Inertial Odometry. IEEE Robot. Autom. Lett. 2023, 8, 3542–3549. [Google Scholar] [CrossRef]
Luo, N.; Hu, Z.; Ding, Y.; Li, J.; Zhao, H.; Liu, G.; Wang, Q. DFF-VIO: A General Dynamic Feature Fused Monocular Visual-Inertial Odometry. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 465–479. [Google Scholar] [CrossRef]
Senel, N.; Kefferpütz, K.; Doycheva, K.; Elger, G. Multi-Sensor Data Fusion for Real-Time Multi-Object Tracking. Processes 2023, 11, 501. [Google Scholar] [CrossRef]
Wei, Z.; Zhang, F.; Chang, S.; Liu, Y.; Wu, H.; Feng, Z. MmWave Radar and Vision Fusion for Object Detection in Autonomous Driving: A Review. Sensors 2022, 22, 2542. [Google Scholar] [CrossRef]
Qian, K.; Zhu, S.; Zhang, X.; Li, L.E. Robust Multimodal Vehicle Detection in Foggy Weather Using Complementary Lidar and Radar Signals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 444–453. [Google Scholar]
Palladin, E.; Dietze, R.; Narayanan, P.; Bijelic, M.; Heide, F. SAMFusion: Sensor-Adaptive Multimodal Fusion for 3D Object Detection in Adverse Weather. In Proceedings of the European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 484–503. [Google Scholar]
Jeong, E.; Kim, J.; Ha, S. TensorRT-Based Framework and Optimization Methodology for Deep Learning Inference on Jetson Boards. ACM Trans. Embed. Comput. Syst. 2022, 21, 3508391. [Google Scholar] [CrossRef]
Yao, Z.; Yazdani Aminabadi, R.; Zhang, M.; Wu, X.; Li, C.; He, Y. ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers. Adv. Neural Inf. Process. Syst. 2022, 35, 27168–27183. [Google Scholar]
Xie, S.; Deng, G.; Lin, B.; Jing, W.; Li, Y.; Zhao, X. Real-Time Object Detection from UAV Inspection Videos by Combining YOLOv5s and DeepStream. Sensors 2024, 24, 3862. [Google Scholar] [CrossRef] [PubMed]
Sekar, K.; Dheepa, T.; Sheethal, R.; Devi, K.S.; Baskaran, M.V.; Smita, R.S.; Dixit, T.V.; Dutta Borah, M. Efficient Object Detection on Low-Resource Devices Using Lightweight MobileNet-SSD. In Proceedings of the 2025 International Conference on Intelligent Systems and Computational Networks (ICISCN), Bidar, India, 24–25 January 2025; pp. 1–6. [Google Scholar]
Macenski, S.; Foote, T.; Gerkey, B.; Lalancette, C.; Woodall, W. Robot Operating System 2: Design, Architecture, and Uses in the Wild. Sci. Robot. 2022, 7, eabm6074. [Google Scholar] [CrossRef] [PubMed]
Liu, R.; Zheng, J.; Luan, T.H.; Gao, L.; Hui, Y.; Xiang, Y.; Dong, M. ROS-Based Collaborative Driving Framework in Autonomous Vehicular Networks. IEEE Trans. Veh. Technol. 2023, 72, 6987–6999. [Google Scholar] [CrossRef]
Duan, F.; Li, W.; Tan, Y. ROS Debugging. In Intelligent Robot: Implementation and Applications; Springer: Singapore, 2023; pp. 71–92. [Google Scholar]
Wu, Z.; Zhen, H.; Zhang, X.; Bai, X.; Liu, X. SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation. Remote Sens. 2025, 17, 1917. [Google Scholar] [CrossRef]

Figure 1. Figure illustrating YOLOv11n performance in two adverse conditions: (left)—occlusion during daylight; (right)—glare at night. The results reflect the detector’s robustness under varied environmental challenges.

Figure 2. Comparison of detection results between YOLOv5s, MS_CAB_YOLOv5s, and AFFB_YOLOv5s [20].

Table 1. Comparison of representative small object detection methods on different datasets.

Method	Backbone/Key Modules	Dataset	Accuracy (AP_small/mAP)	FPS
SSD + PFPNet	VGG-16	MS-COCO	+7.8 AP_small ↑ vs. SSD	34 (RTX-2070)
RetinaNet + ES-FPN	ResNet-50	MS-COCO	23.7/40.5	18 (RTX-2080)
EfficientDet-D0 (with BiFPN)	Efficient-B0	MS-COCO	12.0/33.8	23 (Jetson Xavier)
YOLOv5-s + GhostNet	Ghost-CSP	VisDrone	—/+3.1 mAP ↑ −28% params	40 (Jetson Nano)
ISOD (Ext. Scale FPN)	CSPDark-53	TT100K	—/0.635	28 (RTX-3060)

FPS results are based on author-reported hardware platforms. “↑” denotes improvement over baseline.

Table 2. Comparison of representative small object detection datasets in traffic scenes.

Dataset Name	Release Year	Typical Scenario	Number of Images	Number of Categories	Small Object Ratio (Area < 32²)	Annotation Type
KITTI	2012	Urban streets	∼15 K	8	17.3%	2D bounding boxes
BDD100K	2018	Urban, motorways	100 K	10	25.7%	2D bounding boxes + temporal data
COCO	2015	General scenarios	118 K	80	41.4%	2D bounding boxes/segmentation
VisDrone	2019	Aerial view	263 K	10	53.3%	2D bounding boxes
TT100K	2016	Chinese roads	100 K	221	32.8%	2D bounding boxes
DOTA	2018	Satellite aerial photography	280 K	15	60%+	Rotated bounding boxes

The small object ratio refers to the proportion of annotated objects whose area is less than 32 × 32 pixels.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wei, Z.; Zou, Y.; Xu, H.; Wang, S. Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions. Electronics 2025, 14, 2614. https://doi.org/10.3390/electronics14132614

AMA Style

Wei Z, Zou Y, Xu H, Wang S. Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions. Electronics. 2025; 14(13):2614. https://doi.org/10.3390/electronics14132614

Chicago/Turabian Style

Wei, Zhe, Yurong Zou, Haibo Xu, and Sen Wang. 2025. "Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions" Electronics 14, no. 13: 2614. https://doi.org/10.3390/electronics14132614

APA Style

Wei, Z., Zou, Y., Xu, H., & Wang, S. (2025). Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions. Electronics, 14(13), 2614. https://doi.org/10.3390/electronics14132614

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Small Object Detection in Traffic Scenes for Mobile Robots: Challenges, Strategies, and Future Directions

Abstract

1. Introduction

1.1. Academic and Engineering Significance of Mobile Robot Perception

1.2. Characteristics and Technical Bottlenecks of Small Object Detection

2. Analysis of Challenges and Requirements in Traffic Scenes

2.1. Environmental Factors

2.2. Target Characteristics

2.3. System Constraints

3. Core Technological Developments

3.1. Feature Enhancement Methods

3.1.1. Multi-Scale Feature Fusion

3.1.2. Attention Mechanisms

3.1.3. Context Enhancement and Global Modeling

3.2. Network Architecture Innovations

3.2.1. Lightweight Backbone Networks

3.2.2. Detection Head Optimization

3.2.3. Task Collaboration and Branch Design

3.3. Data-Driven Optimization

3.3.1. Small Object Enhancement Strategies

3.3.2. Synthetic Data and Simulation-Based Generation

3.3.3. Transfer Learning and Few-Shot Training Strategies

3.4. Empirical Comparative Analysis

4. Mobile Robot-Specific Solutions

4.1. Hardware-Aware Algorithm Design

4.2. Multi-Sensor Spatio-Temporal Alignment

4.2.1. Camera–LiDAR Fusion

4.2.2. IMU and GPS Spatio-Temporal Synchronization

4.2.3. Millimeter-Wave Radar and Sonar-Based Blind Spot Compensation

4.3. Embedded Deployment

5. Benchmarking Systems and Metric Limitations in Small Object Detection

5.1. Dataset Comparison

5.2. Evaluation Methods for Small Object Detection in Real-World Traffic Scenes

6. Open Issues and Future Directions

6.1. Generalization Ability in Extreme Scenarios

6.2. Human–Machine Collaborative Annotation and Incremental Learning

6.3. Ultra-Low-Latency Detection and System Coordination Optimization

6.4. Differentiated Optimization Strategies for Specific Applications

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI