1. Introduction
Urban garbage management constitutes a critical component of municipal environmental governance and public health maintenance [
1,
2]. With the acceleration of urbanization, the generation of daily garbage from residential, commercial, and industrial sources has exhibited sustained growth patterns [
3]. This surge poses formidable challenges to conventional manual cleaning methods, particularly in areas such as urban public roads, internal roads within industrial zones, and institutional campuses such as schools, government buildings, and corporate buildings. The persistent accumulation of garbage in these sites not only degrades urban aesthetics but also increases the risks of environmental contamination and disease transmission vectors [
4]. Current municipal sanitation operations are mainly based on scheduled manual inspections [
5], a methodology increasingly constrained by dual limitations: increasing labor costs and inherent human inefficiencies. Field observations confirm that substantial proportions of sudden garbage overflow incidents escape timely detection, while manual recording methods exhibit inherent limitations in the accuracy of the identification of the pollution source [
6]. These systemic deficiencies impede the realization of smart city management. The emergence of artificial intelligence and robotic technologies presents transformative potential for urban sanitation systems. Autonomous garbage collection vehicles equipped with multimodal sensors and machine vision systems have demonstrated measurable improvements in operational efficiency and emergency response capabilities compared with traditional methods [
7]. These advancements are primarily driven by the deployment of object detection and autonomous navigation technologies in self-driving waste collection vehicles, which allow for real-time identification of road litter and adaptive route planning in complex urban environments, thereby enhancing waste management efficiency. This technological evolution positions automatic garbage detection as the foundational capability for modern urban environmental management [
8]. Accurate real-time detection of garbage distribution patterns across diverse urban roads has consequently emerged as the primary research frontier in intelligent municipal governance.
Early research on deep learning-based garbage detection primarily focused on single-objective detection tasks in individual images, where scenarios typically contained either a solitary target or sparse instances, with research priorities emphasizing feature representation optimization through convolutional neural networks. Liu et al. [
9] enhanced cross-channel feature interaction by embedding a lightweight attention mechanism into YOLOv3, proposing the ECA-YOLO framework. Jiang et al. [
10] improved rural garbage detection in YOLOv5 by integrating an attention fusion mechanism, adding a dedicated small object detection layer and adopting the CIoU loss function with Adam optimization. While these methods achieved notable progress in controlled environments, they revealed three critical challenges in real-world applications: (1) drastic scale variations in garbage targets caused by varying sensing distances, leading to severe accuracy degradation for small nearby objects; (2) category distribution shifts due to geographic and container-type dependencies in complex scenes; and (3) insufficient labeled samples in specialized scenarios such as remote areas.
Researchers have explored attention mechanisms, hybrid architectures, and data-efficient strategies to address the aforementioned challenges. Guo et al. [
11] incorporated CBAM attention modules and focal loss into YOLOv4 to enhance feature discriminability and alleviate class imbalance. Zhao et al. [
12] designed Skip-YOLO, with large convolutional kernels and dense blocks, to expand receptive fields and strengthen shallow–deep feature sharing, improving sensitivity to subtle intra-class variations. Tang et al. [
13] proposed the RGB–Thermal Salient Object Detection (RGB-T SOD) task for accurately identifying salient objects from aligned RGB and thermal image pairs. They introduced ConTriNet, a robust, confluent, triple-flow architecture that employed a “Divide-and-Conquer” strategy to extract both modality-specific and complementary features through a unified encoder and specialized decoders, thereby enhancing saliency prediction robustness and accuracy. Quan et al. [
14] proposed the Centralized Feature Pyramid (CFP) network, which enhanced intra-layer feature representation through a globally explicit visual center mechanism. This approach effectively captured local information in crucial corner regions, addressing the limitation of existing visual feature pyramids that overly focus on inter-layer interactions while neglecting intra-layer regulation. Li et al. [
15] proposed DN-DETR, which accelerated DETR convergence and improved accuracy by injecting noisy ground-truth boxes into the Transformer decoder during training to alleviate the instability of bipartite graph matching in the early training stages. Meng et al. [
16] proposed Conditional DETR, introducing a conditional spatial query to the cross-attention mechanism to reduce dependence on content embeddings. This design accelerated DETR training convergence. Wang et al. [
17] proposed YOLOv9, a next-generation real-time object detection algorithm, in 2024. It introduced a learnable re-parameterization module (RepDet) and improved feature extraction structures, achieving a better balance between accuracy and efficiency. However, its training architecture was more complex than that of traditional YOLO models, and further optimization was needed for deployment on resource-constrained devices. In 2021, Sun et al. [
18] proposed Sparse R-CNN, a two-stage object detection algorithm based on the concept of sparse learning. It replaced traditional dense candidate box generation with a fixed number of learnable proposals, simplifying the detection pipeline. However, its inference speed was relatively slow, and the use of a fixed number of proposals may be insufficient for complex scenes with many objects, leading to potential missed detections. Zhang et al. [
19] proposed DINO, an advanced end-to-end object detector that improves detection accuracy and convergence speed through denoising training, hybrid query initialization, and a dual prediction strategy, demonstrating excellent performance on the COCO dataset. However, its relatively slow inference speed limited its widespread use in applications with high real-time requirements. Regarding data scarcity, Qin et al. [
20] proposed AL-DETR, integrating active learning into DETR for efficient training with limited annotations. Inspired by Transformer’s success in NLP, Zhao et al. [
21] developed a dual-branch network combining the CNN’s local feature extraction with Transformer’s global modeling, while Sun et al. [
22] enhanced YOLOv5-OCDS using omnidimensional dynamic convolutions and decoupled heads for task-specific adaptability. Notably, balancing accuracy and efficiency remains an open issue. Tan et al. [
23] improved multi-scale detection through spatial pyramid attention at the cost of increased computational complexity. Sun et al. [
24] achieved a lightweight–accuracy trade-off in YOLOv8n via GhostNet modules and an asymptotic feature pyramid design.
The introduction of DETR (Detection Transformer) by Carion et al. [
25] marked a paradigm shift in object detection by reformulating the task as a sequence-to-sequence problem using Transformer architectures [
6]. This end-to-end framework employed a Transformer encoder–decoder architecture, treating object detection as a set prediction task. It used a CNN to extract image features, leveraged self-attention mechanisms to establish global dependencies, and directly outputted bounding boxes and categories in parallel. Its core innovation lay in the bipartite, matching-based, Hungarian loss, which enforced unique alignment between predictions and ground truth, eliminating traditional anchor designs and non-maximum suppression (NMS) post-processing for end-to-end detection. DETR computed pixel-wise correlations across the entire feature map through self-attention, whereas CNNs relied on localized receptive fields. This demonstrated that Transformers inherently captured broader contextual relationships than CNNs as self-attention mechanisms enabled global interaction between all spatial positions in a single layer.
However, the application of the DETR framework to object detection has also revealed its limitations and shortcomings, mainly reflected in the following aspects: (1) slow training convergence due to global dependency learning in self-attention and the unstable optimization of bipartite matching loss during early training stages; (2) suboptimal performance in small object detection, attributed to the lack of explicit multi-scale feature fusion mechanisms (e.g., FPN) and fixed query slots that limit adaptability to dense small objects; (3) quadratic computational complexity with respect to feature map resolution, leading to high memory consumption for high-resolution images; (4) a mismatch between the classification scores and localization accuracy of queries leading to suboptimal localization performance and degraded Average Precision at high IoU thresholds (mAP75); and (5) fixed prediction slots, where the preset number of object queries (e.g., 100) causes missed detections for crowded scenes or redundant “no object” predictions for sparse scenes. Subsequent studies enhanced DETR’s capabilities while addressing its inherent limitations.
Zhu et al. [
7] proposed Deformable DETR, which replaced global attention with deformable attention mechanisms to reduce computational complexity by focusing on sparse key feature points, improving both efficiency and small-object detection accuracy. Building upon this, Liu et al. [
26] introduced DAB-DETR, where dynamic anchor boxes were implemented as learnable parameters iteratively refined in the Transformer decoder, enhancing spatial localization precision. To mitigate query misalignment during training, Li et al. developed DN-DETR by injecting noise into positive queries and integrating a denoising task, forcing the model to recover accurate prediction patterns. In 2023, Pu et al. [
27] proposed Rank-DETR to address the mismatch between classification scores and localization accuracy in DETR. They introduced a rank-oriented architectural design that promoted positive predictions and suppressed negative ones to ensure a lower false positive rate. Additionally, they developed rank-oriented loss functions and matching cost that prioritized predictions with higher localization accuracy during ranking, thereby improving the Average Precision (AP) at high IoU thresholds. In 2023, Lv et al. [
28] further advanced the framework by designing RT-DETR with an Attention-based Intra-scale Feature Interaction (AIFI) module for intra-scale feature refinement and a Cross-scale Feature-fusion Module (CCFM) for multi-scale representation complemented by dynamic query–target matching to accelerate convergence and boost small-object detection.
Despite advancements in the end-to-end detection frameworks of DETR and its variants, it still exhibits notable limitations. Small-scale object detection remains a persistent challenge as the fixed-size feature maps generated by multi-head self-attention mechanisms struggle to effectively capture discriminative features for low-resolution targets. Furthermore, this background interference frequently results in localization drift or false positive detection, particularly for objects lacking clear contrast boundaries. These limitations collectively highlight the need for enhanced feature representation learning in complex visual environments. In this paper, we present the RGD-DETR, a real-time detection framework that achieves promising performance on challenging road debris datasets. Our work overcomes the limitations of DETR-based detectors in road garbage detection through four contributions as follows:
An enhanced feature pyramid module utilizing channel attention mechanisms is designed to improve the effectiveness of feature learning.
A state space model is incorporated to capture the long-range dependencies among image pixels, thereby obtaining high-quality feature representations.
A Dynamic Sorting-aware Decoder is adopted to integrate a dynamic scoring module and a query-sorting module between decoder layers, which enables the model to prioritize the high-confidence predictions.
The classification- and localization-oriented loss and matching cost are introduced to improve localization precision and achieve superior performance at high IoU thresholds.
3. Experiments
3.1. Dataset
Considering the imbalance in object sizes (large, medium, and small) and the insufficient number of categories in public datasets for road garbage detection, a novel road garbage detection dataset was constructed in this study. The dataset is composed of images from multiple public datasets, AI-generated images, and self-collected images, totaling over 60,000 images. The proposed dataset covers a wide range of urban road scenes, including sandy, grassy, and concrete surfaces. It includes various common types of roadside garbage and images captured under different weather conditions, providing a solid foundation for deploying detection models in real-world environments. A large number of images containing small objects and occluded targets were deliberately included in the dataset to address the challenging detection in these circumstances. Each image is accompanied by accurate bounding boxes and corresponding category labels. Furthermore, the dataset follows the standard COCO format, ensuring compatibility with mainstream object detection frameworks. The object sizes within the dataset are shown in
Table 2. The sample images of the dataset are illustrated in
Figure 10.
3.2. Parameter Settings
The experiments in this study were conducted on an Ubuntu 18.04 system equipped with four NVIDIA GeForce RTX 4090 Graphics Cards (manufactured by Taiwan Semiconductor Manufacturing Company Limited, Hsinchu, Taiwan). The deep learning framework used is PyTorch (v2.1.0), running on Python v3.8 and accelerated by CUDA v11.2 and cuDNN v8.1.1 libraries to improve computational efficiency. The optimal configuration of our model parameters was initialized before training. Based on empirical principles and prior experience from the related literature [
28], we set the key training hyperparameters. Specifically, the learning rate and weight decay were set to 0.0001 to ensure stable convergence while preventing overfitting, which aligns with common practice in Transformer-based detection models. The AdamW optimizer was configured with the parameters [0.9, 0.999], as widely adopted in existing object detection frameworks for maintaining gradient stability. Considering the experimental environment and hardware configuration, the batch size was set to 16 images. Furthermore, the learning curves observed during training (as shown in
Figure 11) exhibit smooth convergence around the 72nd epoch with mild oscillations, suggesting that the model is not overly sensitive to moderate changes in these parameters. These findings imply that our model maintains reliable performance across a reasonable hyperparameter range and is robust in the current training scheme and computational setting.
A set of evaluation metrics were employed to assess the model’s performance: mean Average Precision (mAP), mean Average Precision at different target scales (mAPs, mAPm, and mAPl), and mean Average Precision at various IoU thresholds (mAP50 and mAP75). Additionally, the evaluation metrics included Giga-FLOPs (GFLOPs), which corresponds to the computational resource requirements of the model. The total number of model parameters (Params), reflecting the number of parameters that need to be learned in the deep network model, was also considered. To further reflect practical applicability, Frames Per Second (FPS) was included as a measure of inference speed and real-time processing capability.
3.3. Ablation Experiments
To validate the effectiveness of each component of the proposed method, an ablation study was conducted based on the self-built road garbage dataset. The components are as follows: (1) the Improved Feature Pyramid Module (IFPS), (2) the State-space Feature interaction Module (SFIM), (3) the Dynamic Sorting-aware Decoder (DS-Decoder), and (4) the classification- and localization-oriented loss and matching cost (CLOLMC). The experimental results are shown in
Table 3.
Experiment 1 in
Table 3 was used as a baseline with a mAP of 44.3% and a small target detection capability (mAPs) of 26.4%. After the introduction of the Improved Feature Pyramid Structure (IFPS) (Experiment 2), the overall mAP of our proposed method increased to 44.7%, and the detection accuracy of targets at all scales (mAPs, mAPm, and mAPl) was improved, especially the ability to detect small targets (mAPs), which proves the effectiveness of the model in representing multi-scale features. With the addition of the State-Spatial Feature Interaction Module (SFIM) (Experiment 3), our proposed method achieved a mAP of 45.5%, mAPs of 28.5%, and mAP75 (average detection accuracy of a high IoU threshold) of 40.0%. These improvements in detection accuracy show the SFIM’s ability to capture relationships between distant pixels in an image. In Experiment 4, the introduction of the Dynamic Sorting-aware Decoder (DS-Decoder) increased the mAP of our proposed method to 45.9%, while also improving the mAP50, mAP75, mAPl, mAPm, and mAPl. Finally, by adding the classification- and localization-oriented loss matching cost (CLOLMC) (Experiment 5), the mAP of our proposed method reached 46.1%, and the mAP75 reached 50.3%. This indicates that the localization performance of our method was improved. In conclusion, the experiment proves the role of the above components in enhancing the detection performance.
The modules are not simply combined. Instead, they mutually promote and synergistically enhance each other. First, the IFPS and the SFIM help each other, enabling the network to accurately capture local details while effectively integrating global information. Next, the DS-Decoder further optimizes the query structure based on this foundation, allowing the model to focus on high-quality queries. Finally, the CLOLMC ensures coordinated improvements in classification and localization performance. It is precisely the close cooperation of these modules at different levels—from feature extraction and prediction decoding to matching and loss design—that empowers the model to achieve stronger detection capabilities in complex scenarios.
3.4. Comparison with the State-of-the-Art Methods
To validate the performance of our proposed RGD-DETR model, we conducted comparative experiments on the self-built road garbage dataset against state-of-the-art methods, including YOLOv5 [
37], YOLOv8 [
38], DETR [
25], Deformable-DETR [
7], DAB-DETR [
26], Rank-DETR [
27], and RT-DETR [
28]. During the experiments, the number of training epochs for each model was not predetermined but was dynamically determined based on its actual convergence behavior. Training was terminated once the model’s performance became stable, ensuring that each model achieved its optimal performance. The experimental results are shown in
Table 4.
Table 4 shows a performance comparison between the proposed RGD-DETR method and current state-of-the-art object detection methods. Compared with the YOLO series, our proposed RGD-DETR shows an advantage in terms of overall detection accuracy. YOLOv5 has a mAP of 42.1%, YOLOv8 improves this value to 45.5%, and RGD-DETR reaches 46.1%, representing improvements of 4.0% and 0.6%, respectively. Regarding the high IoU threshold (mAP75), RGD-DETR reaches 50.3% compared with 49.9% of YOLOv8. In addition, in the case of small object detection (mAPs), RGD-DETR achieves 28.7% compared with 25.3% of YOLOv5 and 28.1% of YOLOv8, with improvements of 3.4% and 0.6%, respectively. Moreover, RGD-DETR has fewer parameters, lower GFLOPs, and faster inference speed. This efficient use of computational resources enables RGD-DETR to maintain high detection performance while offering stronger real-time processing capabilities and reduced hardware demands, demonstrating an excellent balance between performance and efficiency.
Our RGD-DETR also outperforms the original DETR-based methods. The mAP of the original DETR model was only 35.7%, while Deformable-DETR, DAB-DETR, Rank-DETR, and RT-DETR reached 37.9%, 38.7%, 43.5%, and 44.3%, respectively. Compared with RT-DETR, our RGD-DETR increased mAP from 44.3% to 46.1%, mAP75 from 47.8% to 50.3%, and mAPs from 26.4% to 28.7%, increases of 1.8%, 2.5%, and 2.3%, respectively. It can be seen that our RGD-DETR is superior to existing DETR variants, especially for small target detection. Although RGD-DETR has a moderately larger number of parameters (65 M vs. 42 M) compared to RT-DETR, it reduces GFLOPs to 121 G (from RT-DETR’s 136 G) and increases inference speed to 116 FPS, surpassing RT-DETR’s 108 FPS. In practical deployment scenarios, inference speed and detection accuracy are often more critical than a slight increase in model size. Through effective structural optimization and pruning, RGD-DETR achieves a superior balance by delivering faster inference and higher accuracy while maintaining manageable computational complexity, making it well-suited for resource-constrained environments. Overall, our RGD-DETR demonstrates promising experimental results in road garbage detection, showing potential advantages over the compared state-of-the-art methods.
3.5. Visualization
We also qualitatively compared the detection results generated by our proposed RGD-DETR model with those of several state-of-the-art methods, including YOLOv5, YOLOv8, and various DETR variants. As shown in
Figure 12, our RGD-DETR consistently shows more accurate localization, fewer missed small targets, and accurate occlusion object detection in road, sand, and grass scenes, respectively.
In the case of the YOLO series, as can be seen from
Figure 12, YOLOv5 exhibits a number of missed detections in all four columns, especially small and occluded targets. Even though YOLOv8 captured more junk items, it failed to detect small cigarette butts in the first column and incorrectly classified partially clogged cups as cans in the second column. Also, YOLOv8 incurred obvious positioning errors in the third column and ignored clogged plastic bags in the fourth column.
Similarly, the DETR series also exhibits detection errors. The DETR model and the Deformable-DETR model produced missed detections. The DAB-DETR also incorrectly detected the can at the bottom of the first column as a battery and the occluded bottle in the fourth column as a can. Although Rank-DETR exhibits strong localizability for garbage objects, it still exhibits missed detections in each column of images. The RT-DETR model, which had the best detection performance, also erroneously detected small targets, i.e., cigarette butts and the pericarp in the first column. The cans and clothes in the second and third columns were also missed. In the occlusion scene, the plastic bag in the fourth column was missed by the RT-DETR model.
In contrast, our proposed RGD-DETR model achieves better detection performance. It can accurately detect small-scale targets and occluded targets, showing accurate localization capacity. This is due to the incorporation of the channel attention mechanism, the state-space model, the Dynamic Sorting-aware Decoder, and the classification and positioning-oriented losses and matching costs. The experimental results prove that our proposed model has better detection performance for small targets and occluded targets compared with state-of-the-art methods.
4. Conclusions
This paper proposes the RGD-DETR model to address the challenges in road garbage detection, such as small object detection and occlusion. An Improved Feature Pyramid Structure is designed for efficient multi-scale feature fusion, and a state-space feature interaction module is introduced to enhance global representation. To address the mismatch between classification confidence and localization quality in RT-DETR, a Dynamic Sorting-aware Decoder incorporating dynamic scoring and query sorting for better query selection is proposed. Additionally, the classification- and localization-oriented loss and matching cost are adopted to improve bounding box accuracy. The experimental results show that our RGD-DETR achieves a mAP of 46.1%, mAP75 of 50.3%, and mAPs of 28.7% on the road garbage dataset. Moreover, our RGD-DETR reduces GFLOPs while delivering faster inference speed.
Although the performance of the presented model is optimized, there is still room to improve computational efficiency, especially for mobile and real-time detection. RT-DETR uses deformable attention, which introduces extra overhead due to offset computation, limiting deployment on resource-constrained devices. In contrast, RT-DETRv2 adopts fixed sampling points, reducing complexity and improving inference speed. Future work should focus on further streamlining computation based on RT-DETRv2 to enhance real-time deployment on edge devices. In future work, we will conduct comprehensive comparisons between the proposed RGD-DETR model and state-of-the-art object detection models, such as YOLOv10, RT-DETRv2, Sparse R-CNN, and DINO. By performing systematic experiments across multiple benchmark datasets and evaluation metrics, we aim to thoroughly analyze the performance of each model in terms of detection accuracy, computational efficiency, and robustness, thereby further validating the adaptability and effectiveness of RGD-DETR in complex real-world scenarios.