EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments

Su, Jing; Chen, Ruihan; Li, Mingzhi; Liu, Shenlin; Xu, Guobao; Zheng, Zanhong

doi:10.3390/s25113451

Open AccessArticle

EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments

by

Jing Su

¹,

Ruihan Chen

^1,2

,

Mingzhi Li

¹,

Shenlin Liu

^1,2,

Guobao Xu

¹ and

Zanhong Zheng

^1,*

¹

School of Mathematics and Computer, Guangdong Ocean University, Zhanjiang 524088, China

²

School of Microelectronics and Communication Engineering, Chongqing University, Chongqing 400044, China

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(11), 3451; https://doi.org/10.3390/s25113451

Submission received: 11 April 2025 / Revised: 21 May 2025 / Accepted: 27 May 2025 / Published: 30 May 2025

(This article belongs to the Collection Advances in Deep-Learning-Based Sensing, Imaging, and Video Processing)

Download

Browse Figures

Versions Notes

Abstract

Conventional waste monitoring relies heavily on manual inspection, while most detection models are trained on close-range, simplified datasets, limiting their applicability for real-world surveillance. Even with surveillance imagery, challenges such as cluttered backgrounds, scale variation, and small object sizes often lead to missed detections and reduced robustness. To address these challenges, this study introduces EcoDetect-YOLOv2, a lightweight and high-efficiency object detection model developed using the Intricate Environment Waste Exposure Detection (IEWED) dataset. Building upon the YOLOv8s architecture, EcoDetect-YOLOv2 incorporates a small object detection P2 detection layer to enhance sensitivity to small objects. The integration of an efficient multi-scale attention (EMA) mechanism prior to the P2 head further improves the model’s capacity to detect small-scale targets, while bolstering robustness against cluttered backgrounds and environmental noise, as well as generalizability across scale variations. In the feature fusion stage, a Dynamic Upsampling Module (Dysample) replaces traditional nearest-neighbor upsampling to yield higher-quality feature maps, thereby facilitating improved discrimination of overlapping and degraded waste particles. To reduce computational overhead and inference latency without sacrificing detection accuracy, Ghost Convolution (GhostConv) replaces conventional convolution layers within the neck. Based on this, a GhostResBottleneck structure is proposed, along with a novel ResGhostCSP module—designed via a one-shot aggregation strategy—to replace the original C2f module. Experiments conducted on the IEWED dataset, which features multi-object, multi-class, and highly complex real-world scenes, demonstrate that EcoDetect-YOLOv2 outperforms the baseline YOLOv8s by 1.0%, 4.6%, 4.8%, and 3.1% in precision, recall, mAP₅₀, and mAP_50:95, respectively, while reducing the parameter count by 19.3%. These results highlight the model’s effectiveness in real-time, multi-object waste detection, providing a scalable and efficient tool for automated urban and digital governance.

Keywords:

multi-scale waste detection; YOLOv8s; ResGhostCSP; P2; EMA; Dysample

1. Introduction

With the rapid advancement of the global economy and accelerating urbanization, municipal solid waste (MSW) generation has continued to rise, posing a significant challenge to ecological sustainability and public health. Projections indicate that annual global waste production will surge from 2.01 billion metric tons in 2016 to 3.4 billion metric tons by 2050 [1], exerting severe environmental and societal pressures. Efficient and scientifically informed waste management is critical for mitigating environmental pollution and reducing the risk of disease transmission [2], while also constituting a fundamental component of sustainable development strategies. In urban environments and public spaces, MSW primarily comprises paper, plastics, and metals [3], among which plastic waste, due to its widespread distribution, persistence, and potential ecological toxicity, presents a particularly severe threat to both the environment and human health [4].

Extensive research has demonstrated that, in the absence of effective waste management, numerous environmental and societal issues may arise, including urban aesthetic degradation, proliferation of pathogenic microorganisms, air and water pollution, and even long-term adverse impacts on tourism and urban economies [5]. Despite increasing public awareness of the hazards associated with waste pollution, efficient MSW management remains a formidable challenge worldwide, particularly in urban areas of developing nations and certain developed countries [6]. The extensive and dynamic distribution of waste renders traditional manual inspection-based collection and cleaning methods costly, susceptible to omissions, and potentially hazardous to sanitation workers’ occupational health. Therefore, the development of a lightweight and highly generalizable algorithm, capable of real-time waste detection and integration with urban surveillance systems, is of paramount importance for reducing manual inspection costs, advancing automation in MSW management, and fostering digital economic growth.

In recent years, the emergence of deep convolutional neural networks has revolutionized object detection by enabling end-to-end learning of discriminative feature representations directly from raw image data. Through hierarchical extraction of semantic information ranging from low- to high-level cues, contemporary detectors have surpassed the limitations of hand-crafted features, delivering substantial improvements in both accuracy and computational efficiency [7,8,9,10,11]. Among these approaches, one-stage architectures—such as the Single Shot MultiBox Detector (SSD) [12] and the You Only Look Once (YOLO) series [13]—have proven particularly effective for real-time applications, simultaneously predicting object classes and bounding boxes in a single forward pass.

Building upon these advancements, a growing body of research has focused on adapting one-stage detectors to the unique challenges posed by waste detection tasks. Liu et al. [14] proposed a YOLOv5-based model capable of effective detection in simple backgrounds (e.g., lawns, pavements). Patel et al. [15] developed various waste detection models; however, their effectiveness remained constrained due to the limited scale of available datasets. Mao et al. [16] designed a single-object waste detector based on YOLOv3 using a dataset of recyclable waste from Taiwan. Li et al. [17] introduced a cascaded detection framework combining SSD, YOLOv4, and Faster R-CNN to enhance detection reliability and reduce false positives.

Despite these advances, existing waste detection models are predominantly trained on relatively simple scenarios, facing significant challenges when deployed in surveillance camera perspectives. These challenges include the difficulty of detecting small objects, interference from complex backgrounds, occlusion of waste items, and high variability in object morphology, all of which contribute to substantial limitations in detection accuracy and real-time performance. Furthermore, many state-of-the-art detection models entail high computational complexity, making them unsuitable for real-time monitoring applications.

To address these challenges, this study introduces EcoDetect-YOLOv2, an optimized waste detection algorithm based on an improved YOLOv8 architecture, leveraging the waste detection dataset constructed by Liu et al. [18] in a surveillance camera environment. The primary contributions of this study are as follows:

(1): A novel Cross-Stage Partial (CSP) module, ResGhostCSP, was designed by integrating the GhostConv operation with a One-Shot Aggregation strategy. This approach reduces computational complexity and inference time while maintaining detection accuracy.
(2): A new small-object detection layer, P2, was incorporated to retain finer-scale details and positional information, thereby improving the accuracy and efficiency of small-object waste detection.
(3): An EMA mechanism was introduced before the P2 detection head, enhancing model robustness in complex and noisy environments while improving cross-scale object generalization.
(4): In the neck network, traditional nearest-neighbor upsampling was replaced with Dysample upsampling, generating higher-quality feature maps and improving the model’s ability to distinguish overlapping waste objects.

The remainder of this paper is structured as follows: Section 2 provides an overview of the original YOLOv8s architecture, multi-scale feature fusion techniques, and attention mechanisms. Section 3 describes the dataset and the proposed EcoDetect-YOLOv2 algorithm. Experimental procedures and analysis are detailed in Section 4. Finally, conclusions are drawn in Section 5.

2. Related Work

2.1. YOLOv8

As one of the most advanced object detection algorithms to date, YOLOv8 demonstrates exceptional performance in both detection accuracy and computational efficiency, surpassing the majority of existing object detection methods. Therefore, this study selects YOLOv8 as the baseline model for comparative analysis.

As illustrated in Figure 1, YOLOv8 employs Darknet53 as its backbone network and integrates the Path Aggregation Feature Pyramid Network (PAFPN) during the feature fusion stage to enhance multi-scale feature representation. In the detection head design, YOLOv8 adopts an anchor-free mechanism, eliminating the need for predefined anchor boxes. This significantly reduces the number of bounding box predictions, thereby improving the execution speed of Non-Maximum Suppression (NMS). As NMS is computationally intensive in the post-processing stage for object selection, the anchor-free design effectively decreases the computational burden during inference, enhancing real-time detection performance.

Regarding data augmentation, YOLOv8 employs Mosaic augmentation, a technique that enriches training data by randomly cropping and stitching multiple images together, thereby improving the model’s generalization capability across diverse scenarios and objects. However, Mosaic augmentation may lead to overfitting during training. To mitigate this issue, this study disables Mosaic augmentation during the final 10 epochs of training, allowing the model to converge on the original dataset without augmented distortions, thereby alleviating potential drawbacks associated with Mosaic augmentation.

For loss computation, YOLOv8 incorporates the TaskAlignedAssigner from Task-Aligned Object Detection (TOOD) [19] as its dynamic target assignment strategy, optimizing the target matching mechanism. The target assignment in TOOD is formulated as follows:

t = s^{α} \times u^{β}

(1)

where

s

denotes the predicted score corresponding to the ground truth class,

u

represents the Intersection over Union (IoU) between the predicted and ground truth bounding boxes,

t

is the resulting matching quality metric that jointly reflects classification confidence and localization accuracy; the exponent

α

governs the relative weighting of the classification score, while

β

modulates the influence of the IoU-based localization term. The loss function is defined as:

l o s s (o, t) = - \frac{1}{n} (\sum_{i} (t [i] * \log (o [i]) + (1 - t [i]) * \log (1 - o [i])))

(2)

where

n

denotes the total number of samples used in the loss computation,

i

indexes each sample,

o [i]

represents the predicted probability produced by the model, and

t [i]

is the ground truth label for the

i

-th sample, and the ground truth probability is implicitly defined by the label. This task-aligned strategy enables dynamic adjustment of target assignment, thereby improving the precision of bounding box matching and enhancing the model’s overall detection performance.

Compared to the widely adopted YOLOv5, YOLOv8 refines several architectural components. The first convolutional layer kernel size transitions from 6 × 6 to 3 × 3, while the C3 module is replaced with the C2f module, which incorporates additional skip connections and feature splitting operations. Additionally, two convolutional layers are removed to simplify the neck module. A significant modification is observed in the head design, which transitions from a coupled head to a decoupled head, replacing the anchor-based approach of YOLOv5 with an anchor-free paradigm.

2.2. Attention Mechanism

The attention mechanism, inspired by human visual perception and neural information processing, has been extensively applied in deep learning [20]. This mechanism dynamically allocates computational resources, allowing models to focus on critical feature information while suppressing irrelevant signals, thereby improving feature representation and task awareness. Furthermore, attention mechanisms effectively reduce model redundancy, enhance robustness against noise, and improve network generalization.

In conventional neural networks, all input features are typically treated equally, leading to computational inefficiency and reduced generalization capability [21]. However, in many practical tasks, only a subset of input features significantly influences the final prediction. Treating all features uniformly can introduce unnecessary computational overhead. Additionally, deep neural networks are often regarded as “black-box” models with limited interpretability. The incorporation of attention mechanisms enables weight visualization, enhancing model interpretability and facilitating network optimization.

Attention mechanisms have demonstrated outstanding performance across various deep learning tasks. For instance, in image classification, attention mechanisms enable models to focus on discriminative regions, improving classification accuracy. In text summarization, these mechanisms help identify key information in source text, generating more precise and concise summaries [22]. By optimizing feature selection and information filtering, attention mechanisms have achieved significant advancements in computer vision, natural language processing, and speech recognition, establishing themselves as a pivotal research direction in deep learning.

2.3. Lightweight Networks

Lightweight neural networks are designed to maintain high task performance while reducing parameter size and computational complexity, making them suitable for resource-constrained environments. These networks leverage various optimization strategies, such as structural simplification, parameter redundancy elimination, and computation graph optimization, ensuring efficient deployment on mobile devices, embedded systems, and edge computing platforms. By enhancing inference speed and reducing storage overhead, lightweight networks strike a balance between accuracy and real-time performance, making them highly applicable in fields such as intelligent surveillance, autonomous driving, and industrial inspection.

To minimize computational cost while maintaining detection efficiency, several lightweight strategies have been proposed. One approach focuses on reducing the precision of network weights to decrease storage requirements and enhance computational efficiency [23]. Another method involves pruning redundant parameters to optimize computation flow. For example, MADNet [24], a compact lightweight network, employs dense connections to enhance multi-scale feature representation and feature correlation learning. Additionally, Liu et al. [25] introduced a network architecture integrating dilated convolutions and attention mechanisms, leveraging multi-scale pooling operations to improve semantic information encoding. However, these methods primarily focus on compressing pre-trained models or training small-scale networks, which may impact overall performance equilibrium.

Addressing these challenges, this study proposes an optimized lightweight detection model, EcoDetect-YOLOv2, by extending the GhostConv structure with GhostResBottleneck and ResGhostCSP modules. This approach significantly reduces computational complexity and inference time while preserving detection accuracy, thereby achieving a superior lightweight object detection framework.

3. Proposed Methodology

3.1. EcoDetect-YOLOv2 Object Detection Model

In the context of waste detection through surveillance cameras, YOLOv8 encounters limitations in detection accuracy and its capacity for detecting small objects. To address these challenges, this study selects YOLOv8s as the baseline model and proposes an efficient, lightweight object detection network—EcoDetect-YOLOv2. The overall architecture of EcoDetect-YOLOv2 is depicted in Figure 2.

Specifically, to retain finer details and positional information, an additional small object detection layer, P2, is incorporated, enhancing both the detection efficiency and accuracy for small-sized waste. Within the Neck structure, an efficient multi-scale attention (EMA) mechanism is introduced before the P2 detection head, significantly improving the capability for detecting small objects, robustness to complex backgrounds and noise, and generalization across varying object scales. Furthermore, lightweight Dysample upsampling replaces the conventional nearest-neighbor upsampling, generating higher-quality feature maps and improving the model’s ability to distinguish overlapping and worn particles during feature fusion.

Additionally, this study introduces the GhostResBottleneck structure based on GhostConv and designs an efficient cross-stage partial (CSP) module, ResGhostCSP, using a one-shot aggregation strategy. These enhancements ensure high detection accuracy while significantly improving computational efficiency, facilitating lightweight feature extraction and fusion.

3.2. Enhancements in P2 Small Object Detection Layer

The original backbone structure of YOLOv8 extracts features through a top-down hierarchical downsampling process. As network depth increases, high-level semantic feature representations are enhanced, yet feature information for smaller objects is significantly reduced. Consequently, YOLOv8 struggles to learn distinctive features for small objects in the Neck layer, leading to potential missed detections in the three detection heads of the Head layer.

The standard YOLOv8 network comprises three detection heads. When processing an input image of size 640 × 640, the multi-stage downsampling in the Backbone and subsequent feature fusion in the Neck produce detection feature maps of 20 × 20, 40 × 40, and 80 × 80, which correspond to detecting objects larger than 32 × 32, 16 × 16, and 8 × 8 pixels, respectively. However, objects smaller than 8 × 8 pixels remain challenging to detect, resulting in missed detections.

From the perspective of surveillance cameras, waste exposure detection frequently involves numerous small objects, particularly paper debris, which occupy only a few pixels within an image. Relying solely on the P3 detection head risks missing a significant number of these small targets. To mitigate this issue, this study incorporates an additional P2 (160 × 160) detection head into the network, improving the model’s ability to recognize fine-grained objects.

3.3. Design of GhostResBottleneck and ResGhostCSP Based on GhostConv

3.3.1. GhostConv

Feature extraction in deep neural networks often generates a substantial number of redundant feature maps, leading to high computational costs. Although these redundant features are essential for data representation, they impose significant computational overhead. Inspired by GhostNet [26], which demonstrated the effectiveness of GhostConv in reducing computational costs, this study adopts the GhostConv mechanism to enhance computational efficiency during feature space expansion. GhostConv generates additional feature maps at a reduced computational expense, thereby lowering memory consumption during intermediate feature mapping. The fundamental structure of GhostConv is shown in Figure 3.

3.3.2. GhostResBottleneck and ResGhostCSP Structure Design

GhostConv achieves approximately 50% of the computational cost of standard convolution (Conv) while maintaining comparable feature learning capability. Based on GhostConv, this study proposes the GhostResBottleneck structure, which integrates GhostConv with residual connections. The specific architecture of GhostResBottleneck is illustrated in Figure 4. Furthermore, to enhance the feature learning capability of CNNs, this study incorporates generalized deep learning optimization strategies [27,28,29] and employs a one-shot aggregation strategy to design an efficient, hardware-friendly cross-stage partial (CSP) module—ResGhostCSP. This module maintains high accuracy while effectively reducing computational complexity and inference time, as depicted in Figure 4.

To ensure effective feature extraction and improve network stability, residual connections are incorporated into both the GhostResBottleneck and ResGhostCSP structures. Residual connections mitigate training issues in deep networks, such as overfitting, gradient vanishing, and gradient explosion. The mathematical formulation is as follows:

O u t p u t = F (I n p u t, \{W_{i}\}) + I n p u t,

(3)

where

I n p u t

and

O u t p u t

denote the input and output of the residual block, respectively, while

F (I n p u t, \{W_{i}\})

represents the nonlinear transformation applied to the input features, including convolution and activation functions. Compared to conventional feature concatenation (Concat) operations, residual connections alleviate gradient-related issues during deep network training, thereby enhancing model stability.

By combining the low-cost feature generation advantages of GhostConv with the efficient feature transmission capabilities of residual connections, GhostResBottleneck and ResGhostCSP reduce computational resource consumption while improving feature extraction. These advancements provide a robust foundation for designing lightweight, high-efficiency networks.

3.4. Efficient Multi-Scale Attention (EMA)

Efficient Multi-scale Attention (EMA), proposed by Ouyang et al. [30], is a multi-scale attention module designed for cross-spatial learning. Unlike conventional convolutional downsampling approaches, EMA selectively reshapes certain channels into batch channels for dimensionality reduction while employing cross-spatial learning strategies for multi-scale feature extraction. Convolutional operations are widely utilized in object detection and semantic segmentation due to their superior feature learning capabilities. However, as the depth of convolutional networks increases, memory consumption and computational complexity grow significantly. Furthermore, the translation-invariant nature of convolutions can lead to feature localization, limiting the effective modeling of global information. Introducing attention mechanisms within deep networks can enhance feature discrimination.

To improve efficiency in waste detection from surveillance camera perspectives and to fully integrate multi-scale spatial and channel features, EMA is incorporated into the proposed model. Given an input feature map, the input tensor is defined as follows:

X \in R^{C \times H \times W},

(4)

where

C

represents the number of channels, and

H

and

W

denote the spatial dimensions of the input feature map. The EMA mechanism (Figure 5) first partitions

X

into

G

groups of sub-features along the channel dimension:

X = [\begin{matrix} X_{0}, X_{i}, \dots, X_{G - 1} \end{matrix}], X_{i} \in R^{C / / G \times H \times W},

(5)

The core objective of EMA is to learn feature semantics across multiple scales. To achieve this, EMA employs a multi-scale feature extraction strategy utilizing two 1 × 1 branches and one 3 × 3 convolutional kernel in parallel, enhancing the fusion of multi-scale features. The 1 × 1 branches leverage global average pooling, while the 3 × 3 branch extracts features across multiple paths to capture inter-channel dependencies and reduce computational complexity. This mechanism encodes features and integrates them along the height dimension. Without reducing the number of channels, it applies shared 1 × 1 convolution and partitions the output into two vectors, employing a nonlinear sigmoid function to establish cross-channel interactions. Meanwhile, the 3 × 3 branch facilitates feature interactions through convolution operations, preserving spatial structural information. The spatial information of the three branches is then encoded using two-dimensional global average pooling:

R_{1}^{1 \times C / / G} \times R_{3}^{C / / G \times H W},

(6)

where the two-dimensional average pooling operation is computed as:

Z_{c} = \frac{1}{H \times W} \sum_{j}^{H} \sum_{i}^{W} x_{c} (j, i) .

(7)

To further enhance computational efficiency; EMA integrates softmax functions with average pooling operations for linear transformations; ultimately outputting a set of spatial attention weights to enhance feature representation. By effectively modeling multi-scale spatial and channel feature relationships; EMA ensures both accuracy and computational efficiency in waste recognition from surveillance perspectives. The core advantage of EMA lies in its efficient learning of feature semantics while integrating multi-scale information, offering an innovative solution for waste detection in complex environments

3.5. DySample Upsampling

DySample enhances computational resource efficiency by avoiding dynamic convolution operations and instead adopting a point-based sampling method for upsampling. In the YOLOv8 model, conventional upsampling methods typically require significant computational resources and parameters, limiting lightweight deployment in surveillance scenarios and thereby affecting small-object waste detection performance. In real-world monitoring applications, small waste objects, such as paper scraps, often occupy minimal pixel areas and are susceptible to pixel distortion, leading to the loss of fine details and posing challenges for feature learning. To address these issues, a lightweight and efficient dynamic upsampling method, DySample, is introduced as a substitute for conventional upsampling techniques. DySample improves small-object detection performance even under low-quality surveillance imaging conditions. The core principle combines point-based sampling strategies with learned sampling methods to achieve efficient upsampling, reducing computational burden while enhancing image resolution and overall model performance.

Figure 6 illustrates the point-based dynamic upsampling mechanism and modular design of DySample. The sampling point generator (Figure 6a) is a crucial component. Given an input feature map

X

of dimensions

C \times H_{1} \times W_{1}

and a sampling set

δ

of size

2 \times H_{2} \times W_{2}

, where the first two dimensions represent

x

and

y

coordinates, the

g r i d_s a m p l e

function resamples

X

using coordinates from

δ

. This process employs bilinear interpolation to generate a new feature map

X^{'}

of size

C \times H_{2} \times W_{2}

:

X^{'} = g r i d_s a m p l e (X, δ) .

(8)

Assuming an upsampling factor of

s

, the input feature map has dimensions

C \times H \times W

. To implement upsampling, a linear transformation layer is first applied, taking an input channel size of and outputting

2 s^{2}

channels, resulting in an offset matrix

O

of size

2 s^{2} \times H \times W

. The pixel shuffle algorithm then rearranges

O

into dimensions

2 \times s H \times s W

, generating the final sampling set

δ

:

O = l i n e a r (X),

(9)

δ = G + O .

(10)

While the description of the reshape operation is omitted, the final upsampled feature map

X^{'}

, derived from

δ

and the function

g r i d_s a m p l e

, has dimensions

C \times s H \times s W

as defined in Equation (9). Figure 6 further illustrates the DySample module’s dynamic upsampling mechanism and overall design.

4. Experimental Results and Discussion

4.1. Datasets

To evaluate the effectiveness of the proposed model, this study employs the Intricate Environment Waste Exposure Detection dataset introduced by Liu et al. [18] for performance assessment of EcoDetect-YOLOv2 and other benchmark models. This dataset encompasses nine categories of garbage detection targets, including paper trash, plastic trash, snakeskin bags, packed trash, stone waste, sand waste, cartons, foam trash, and metal waste. Representative samples from established waste detection benchmarks (Figure 7a) are juxtaposed with imagery from the proposed Intricate Environment Waste Exposure Detection (IEWED) dataset (Figure 7b). In contrast to conventional datasets—typically composed of close-range images with homogeneous backgrounds—IEWED comprises surveillance footage acquired from elevated, oblique viewpoints under variable lighting conditions and within dynamically evolving scenes. This real-world setting introduces three principal challenges for automated detection: (1) multi-scale object representation, as waste items often occupy a small, transient fraction of the visual field; (2) substantial background heterogeneity, characterized by occlusions and complex textures; and (3) inherently lower resolution and higher noise levels. These conditions necessitate the development of detection frameworks with enhanced sensitivity, accurate spatial localization capabilities, and robust feature extraction mechanisms resilient to scale variation and noise interference.

Figure 8 illustrates the composition of the dataset, providing a detailed account of both the number of images (“Image Count”) and the number of annotated object instances (“Instance Count”) across categories. To preserve the intrinsic class distribution, a hierarchically stratified sampling strategy was adopted. The dataset was partitioned into training and testing subsets following an 8:2 split, ensuring proportional representation of each category and its subordinate levels during both model training and evaluation.

4.2. Experimental Evaluation Metrics

4.2.1. Precision and Recall

Precision (P) quantifies the proportion of correctly identified positive samples among all predicted positive instances, computed as follows:

P r e c i s i o n = \frac{T P}{F P + F P} .

(11)

Recall (R) measures the proportion of correctly predicted positive samples relative to all actual positive instances, defined as:

R e c a l l = \frac{T P}{F P + F N},

(12)

where TP denotes the number of true positive instances, FP represents false positives, and FN indicates false negatives.

4.2.2. mAP₅₀ and mAP_50:95

Mean Average Precision (mAP) evaluates the area under the Precision-Recall curve and is calculated as follows:

A P = \int_{0}^{1} P (R) d R,

(13)

m A P = \frac{\sum_{i = 1}^{n} {A P}_{i}}{n},

(14)

where mAP₅₀ corresponds to the mean average precision computed at an Intersection over Union (IoU) threshold of 0.5, while mAP_50:95 represents a more stringent evaluation criterion by averaging precision values across IoU thresholds ranging from 0.5 to 0.95.

4.2.3. Parameter Count, Floating Point Operations (FLOPs) and Frames per Second (FPS)

The total number of parameters refers to the sum of all weights and biases within the deep learning model, directly influencing storage requirements and learning capacity. Generally, larger parameter counts correspond to longer training and inference times.

FLOPs serve as a metric for computational complexity, with lower values indicating faster execution speeds. The calculation for convolutional operations is expressed as:

F L O P s (C o n v) = (2 \times C_{i n} \times K^{2} - 1) \times W_{o u t} \times H_{o u t} \times C_{o u t},

(15)

F L O P s (C o n v) = (2 \times C_{i n} - 1) \times C_{o u t},

(16)

where

C_{i n}

and

C_{o u t}

denote input and output channels, respectively, while

K

,

H_{o u t}

, and

W_{o u t}

represent the kernel size, output feature map height, and output feature map width.

Frames per second (FPS)—the reciprocal of average inference latency—provides a concise indicator of real-world throughput, capturing the rate at which the network processes input frames under a given hardware setup.

4.3. Experimental Environment and Initial Parameter Settings

The experiments are conducted on a Windows 10 operating system, with the software executed in PyCharm 2024.1 (Professional Edition). The deep learning framework utilized includes CUDA 12.4, Python 3.11.0, and PyTorch 2.5.1. The hardware configuration comprises an NVIDIA RTX 4090D GPU with 24 GB of VRAM and an AMD Ryzen 9 7950X CPU operating at 4.50 GHz. The hyperparameter settings for training are presented in Table 1.

4.4. Performance Evaluation

To comprehensively evaluate the advancements introduced by the EcoDetect-YOLOv2 model, a comparative analysis was conducted against the baseline YOLOv8s model. The experimental results are summarized in Table 2.

The results, validated using the test dataset, demonstrate that EcoDetect-YOLOv2 achieves an improvement of 1.0% in precision, 4.6% in recall, 4.8% in mAP₅₀, and 3.1% in mAP_50:95, while simultaneously reducing parameter count by 19.3% compared to YOLOv8s.

Figure 9 illustrates the confusion matrices for the YOLOv8s baseline model and the proposed EcoDetect-YOLOv2 under the mAP₅₀ metric, highlighting a substantial enhancement in classification accuracy across nearly all categories. Notably, EcoDetect-YOLOv2 improves the classification accuracy of the frequently misclassified small-target category paper trash by 13%. Additionally, accuracy improvements of 8% and 6% were observed for the small-object classes plastic trash and packed trash, respectively, indicating a superior ability to detect small-scale objects. Furthermore, larger-object categories such as sand waste and metal waste exhibit accuracy gains of 16% and 14%, respectively, demonstrating that the model enhances detection performance across various object scales. Meanwhile, snakeskin bag and stone waste maintain their already high accuracy, with only a slight reduction in carton classification accuracy. Overall, these findings underscore the substantial improvements in predictive accuracy and generalization capabilities achieved by EcoDetect-YOLOv2 compared to the YOLOv8s baseline.

To further assess EcoDetect-YOLOv2’s performance in garbage exposure detection from surveillance camera viewpoints, qualitative comparisons were conducted using multiple images from the test dataset, as depicted in Figure 10. The experimental results reinforce the limitations of YOLOv8s in complex scenarios.

As illustrated in Figure 10a–f, YOLOv8s exhibits severe misdetection issues in small-target detection scenarios. Specifically, Figure 10a–d shows that due to the minimal size of paper trash objects within the image, YOLOv8s fails to detect almost all instances of this category. In contrast, while EcoDetect-YOLOv2 still exhibits some level of false negatives, it is capable of effectively identifying small paper trash objects. Beyond paper trash, YOLOv8s also exhibits high false negative rates for other distant or small-scale objects. For instance, as shown in Figure 10e, a remotely placed carton is not correctly detected by YOLOv8s, whereas EcoDetect-YOLOv2 successfully identifies it. Similarly, in Figure 10f, stone waste objects with small spatial footprints are missed by YOLOv8s but are successfully detected by EcoDetect-YOLOv2.

Furthermore, YOLOv8s demonstrates significant false-positive detection errors. For example, as shown in Figure 10c, the model incorrectly classifies a faded parking line as paper trash, while in Figure 10g, the foundation of a structure is misclassified as plastic trash. In contrast, EcoDetect-YOLOv2 exhibits superior robustness in such challenging detection environments. Likewise, in Figure 10g, a nearby plastic bag is missed by YOLOv8s yet is recognized with high confidence by EcoDetect-YOLOv2.

Under low-light conditions, the detection performance of YOLOv8s further deteriorates. As shown in Figure 10i, in nighttime scenarios, YOLOv8s fails to detect clearly visible plastic trash objects, whereas EcoDetect-YOLOv2, despite reduced confidence scores, successfully identifies these targets. These experimental findings suggest that EcoDetect-YOLOv2 is well-adapted to variations in target size and feature characteristics, significantly enhancing both detection accuracy and robustness.

4.5. Analysis of Attention Mechanisms

To systematically evaluate the advantages of the Exponential Moving Average (EMA) attention mechanism, multiple mainstream attention mechanisms were integrated into the YOLOv8s model for comparative analysis, including Coordinate Attention (CA) [31], Squeeze-and-Excitation (SE) [32], NonlocalBlockND [33], Convolutional Block Attention Module (CBAM) [33], and Spatial-Channel Synergic Attention (SCSA) [34]. These evaluations aim to determine the effectiveness of each attention mechanism within the context of this task.

CA enhances feature spatial encoding by incorporating position sensitivity and directional awareness, thereby improving object localization accuracy. SE dynamically adjusts channel weights based on their importance to mitigate information loss caused by imbalanced channel weight allocation. NonlocalBlockND captures long-range dependencies by computing correlations between arbitrary positions in the feature map, enhancing global contextual understanding. However, its high computational complexity increases processing overhead when handling high-resolution feature maps. CBAM integrates both channel and spatial attention mechanisms, optimizing feature channel information allocation through global average pooling and max pooling, while its spatial attention module highlights critical regions by compressing channel dimensions. Despite its advantages in improving detection performance, CBAM’s computational efficiency is slightly lower than that of lightweight attention mechanisms such as EMA. SCSA overcomes traditional attention mechanisms’ limitations in local and global feature modeling by leveraging spatial-channel synergistic modeling and multi-semantic information fusion.

As shown in Table 3, integrating EMA into YOLOv8s yields the highest overall detection performance. Compared to the original YOLOv8s, EMA achieves precision, recall, mAP₅₀, and mAP_50:95 values of 68.9%, 53.5%, 57.3%, and 33.3%, respectively, with both mAP metrics attaining optimal performance. Although EMA does not achieve the highest individual precision or recall values, it provides the best balance between these two metrics, enhancing model robustness in detection tasks. Moreover, the introduction of EMA does not significantly increase computational complexity or parameter size, further validating its effectiveness in improving object detection performance while maintaining a lightweight model structure.

4.6. Results of Ablation Experiments

To evaluate the effectiveness of the proposed modifications in EcoDetect-YOLOv2, a systematic ablation study was conducted (Table 4). Model 0 corresponds to the baseline YOLOv8s, while Models 1 to 15 represent different combinations of modifications, enabling an assessment of each component’s impact on detection performance. In Table 4, a tick (“T”) signifies the inclusion of the respective algorithmic enhancement and a dash (“-”) indicates its omission.

The results in Table 4 highlight several key findings:

(1): Impact of the P2 Small-Object Detection Layer: Model 1 demonstrates that integrating the P2 small-object detection layer enhances detection performance in surveillance camera scenarios, where small-object detection is crucial. The precision (P), recall (R), mAP₅₀, and mAP_50:95 increased by 2.2%, 1.9%, and 0.4%, respectively. Despite a slight decrease in precision, this aligns with the precision-recall trade-off, where increased recall and detection coverage may result in some false positives. However, from a holistic perspective, this trade-off is acceptable and contributes to overall model robustness.
(2): Effectiveness of the EMA Module: Model 2 confirms that the inclusion of the EMA module significantly enhances detection capability. The EMA module leverages cross-spatial learning to reshape part of the channel dimensions into batch dimensions, processing them in groups. By integrating 3×3 convolutions at multiple scales, it enables joint modeling of channel and spatial information. The parallel sub-network structure applies multi-scale convolution to grouped sub-features, suppressing background noise and enhancing target response through a dynamic modulation mechanism while avoiding dimensional reduction issues common in traditional attention mechanisms. Furthermore, the cross-spatial feature aggregation strategy of EMA, coupled with lightweight decomposition and depthwise separable convolution, reduces computational complexity while strengthening the representation of small objects. Experimental results indicate that the EMA module improves P, R, mAP₅₀, and mAP_50:95 by 0.7%, 1.4%, 2.2%, and 1.5%, respectively, demonstrating its effectiveness in complex scenarios.
(3): Contribution of DySample: Model 3 reveals that incorporating DySample results in significant improvements in R, mAP₅₀, and mAP_50:95 (increases of 1.0%, 0.8%, and 0.7%, respectively). DySample enhances multi-scale feature representation through dynamic upsampling and lightweight decomposition while introducing minimal additional parameters. Although P slightly decreases from 68.2% to 67.8%, this may be attributed to DySample’s increased feature sensitivity, which lowers the response threshold for edge-blurred regions and background noise. Nevertheless, overall performance gains indicate that DySample effectively balances efficiency and robustness by leveraging cross-scale feature aggregation and dynamic sampling point segmentation.
(4): Impact of the ResGhostCSP Module: Results from Model 4 demonstrate that the proposed ResGhostCSP, based on GhostConv, significantly reduces computational complexity while maintaining detection performance. This enables simultaneous optimization of model lightweighting and detection accuracy.
(5): Combinatorial Optimization of Multiple Components: Models 5 through 14 represent experiments combining different enhancement modules. While the inclusion of multiple modifications slightly increases computational complexity and parameter count, the overall detection accuracy improves. Notably, the introduction of ResGhostCSP effectively reduces computational costs while preserving detection performance, validating the efficacy of multi-component collaborative optimization.
(6): Overall Performance of EcoDetect-YOLOv2: Model 15, the proposed final model in this study, exhibits superior performance in the given task. Compared with the baseline YOLOv8s, it achieves improvements of 0.5% in P, 10.9% in R, 9.0% in mAP₅₀, and 6.1% in mAP_50:95. Additionally, the parameter count is reduced by 19.3%, and computational complexity experiences only a slight increase. These results confirm that EcoDetect-YOLOv2 effectively enhances waste detection under surveillance camera perspectives while maintaining a lightweight structure. The findings offer an efficient and reliable solution for intelligent environmental monitoring and digital governance.

4.7. Comparison of Different Methods

To evaluate the performance of EcoDetect-YOLOv2, a comparative analysis was conducted against various state-of-the-art object detection models of comparable scale, including YOLOv3-tiny, YOLOv5s, YOLOv8s, and YOLO11s, as well as the medium (m) and large (l) variants of the YOLOv5, YOLOv8, and YOLO11 series. Additionally, several advanced detection models, such as Waste-YOLO, YOLOv5s-OCDS, EcoDetect-YOLO, and GCC-YOLO, were included in the comparison. All models were trained under identical experimental conditions and hyperparameter settings, with the results presented in Table 5.

As presented in Table 5, EcoDetect-YOLOv2 outperforms existing object detection models in terms of recall (R), mean Average Precision at IoU 0.5 (mAP₅₀), and mean Average Precision at IoU 0.5:0.95 (mAP_50:95), despite operating with reduced computational complexity and parameter count. While the precision (P) is not the highest among the models, the overall detection performance remains superior.

Compared to models of similar scale, including YOLOv3-tiny, YOLOv5s, YOLOv8s, and YOLO11s, EcoDetect-YOLOv2 demonstrates enhanced detection capability, likely due to the improved robustness against background interference and superior small-object detection. Conversely, YOLOv8n, despite its minimal computational complexity, exhibits limited detection performance, making it unsuitable for effective waste detection.

As the model size and complexity increase, detection performance improves across the larger variants of the YOLOv5, YOLOv8, and YOLO11 series (including the m and l variants), as well as the SSD and RTDETR series (comprising the l and x variants). However, the performance enhancements become marginal in relation to the additional computational cost incurred. Notably, YOLO11l demonstrates a lower precision compared to YOLO11m, indicating that excessive model complexity may contribute to overfitting.

Compared with state-of-the-art models such as GCC-YOLO, Waste-YOLO, and YOLOv5s-OCDS, EcoDetect-YOLOv2 achieves the highest recall and mAP scores, demonstrating enhanced robustness in challenging scenarios. In summary, EcoDetect-YOLOv2 achieves optimal detection performance while maintaining relatively low computational complexity, highlighting its efficiency and robustness for real-world waste detection applications.

5. Conclusions

This study addresses the challenge of accurately identifying and localizing multi-scale waste objects within complex backgrounds from surveillance camera perspectives by introducing a lightweight and efficient detection model, EcoDetect-YOLOv2. Built upon YOLOv8s, the model integrates a P2 small-object detection layer and an efficient multi-scale attention (EMA) mechanism, significantly enhancing small-object detection capabilities while improving robustness in complex environments and noise conditions. The adoption of the Dysample upsampling technique optimizes feature fusion and improves discrimination of overlapping objects. Furthermore, the incorporation of GhostConv and the one-shot aggregation-based ResGhostCSP module effectively reduces computational complexity and inference time while maintaining detection accuracy.

Experimental results indicate that, despite a 19.3% reduction in parameter count, EcoDetect-YOLOv2 achieves a 1.0%, 4.6%, 4.8%, and 3.1% improvement in precision, recall, mAP₅₀, and mAP_50:95, respectively, compared to the baseline YOLOv8s model. These findings validate the model’s capability for real-time multi-object waste detection in surveillance applications. Overall, EcoDetect-YOLOv2 provides a robust and efficient solution for automated urban waste management and the advancement of digital economy applications. Future research will focus on leveraging collaborative learning [38] and transfer learning strategies to enhance model’s adaptability to larger datasets and more complex environments, as well as optimizing real-time system integration, further driving the progress of intelligent waste management systems.

Author Contributions

Conceptualization, J.S. and R.C.; Data curation, R.C., M.L. and S.L.; Formal analysis, R.C.; Funding acquisition, G.X. and Z.Z.; Investigation, R.C., M.L. and S.L.; Methodology, R.C.; Project administration, J.S., G.X. and Z.Z.; Resources, R.C.; Software, R.C.; Supervision, J.S., G.X. and Z.Z.; Validation, R.C. and S.L.; Visualization, R.C.; Writing—original draft, R.C.; Writing—review and editing, J.S., R.C., G.X. and Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by a special grant from the 2022 Guangdong University Quality Assurance Project by the Education Department of Guangdong Province—Course Faculty “Circuits and Electronics Course Faculty” (Guangdong Provincial Department of Education Higher Education Document [2023] No. 4), 2020 Guangdong University Quality Assurance Project by the Education Department of Guangdong Province—Guangdong Ocean University—Guangzhou Shangguan Information Technology Co., Ltd. off-campus practice teaching base for college students (Guangdong Provincial Department of Education Higher Education Document [2020] No. 19), Guangdong Provincial Science and Technology Innovation Strategy under Grant No. pdjh2023b0247, and Guangdong Ocean University Undergraduate Innovation Team Project under Grant No. CXTD2023014.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and analyzed during the current study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as potential conflicts of interest.

References

Kaza, S.; Yao, L.; Bhada-Tata, P.; Van Woerden, F. What a Waste 2.0: A Global Snapshot of Solid Waste Management to 2050. Urban Development Series; World Bank: Washington, DC, USA, 2018. [Google Scholar]
Torkashvand, J.; Emamjomeh, M.M.; Gholami, M.; Farzadkia, M. Analysis of cost–benefit in life-cycle of plastic solid waste: Combining waste flow analysis and life cycle cost as a decision support tool to the selection of optimum scenario. Environ. Dev. Sustain. 2021, 23, 13242–13260. [Google Scholar] [CrossRef]
Gholami, M.; Torkashvand, J.; Rezaei Kalantari, R.; Godini, K.; Jonidi Jafari, A.; Farzadkia, M. Study of littered wastes in different urban land-uses: An 6 environmental status assessment. J. Environ. Health Sci. Eng. 2020, 18, 915–924. [Google Scholar] [CrossRef] [PubMed]
Araújo, M.C.B.; Costa, M.F. A critical review of the issue of cigarette butt pollution in coastal environments. Environ. Res. 2019, 172, 137–149. [Google Scholar] [CrossRef]
Asensio-Montesinos, F.; Anfuso, G.; Ramírez, M.O.; Smolka, R.; Sanabria, J.G.; Enríquez, A.F.; Arenas, P.; Bedoya, A.M. Beach litter composition and distribution on the Atlantic coast of Cádiz (SW Spain). Reg. Stud. Mar. Sci. 2020, 34, 101050. [Google Scholar] [CrossRef]
Farzadkia, M.; Alinejad, N.; Ghasemi, A.; Rezaei Kalantary, R.; Esrafili, A.; Torkashvand, J. Clean environment index: A new approach for litter assessment. Waste Manag. Res. 2022, 41, 368–375. [Google Scholar] [CrossRef]
Peng, X.; Zou, C.; Qiao, Y.; Peng, Q. Action recognition with stacked fisher vectors. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 581–595. [Google Scholar]
Salimi, I.; Dewantara, B.S.B.; Wibowo, I.K. Visual-based trash detection and classification system for smart trash bin robot. In Proceedings of the 2018 International Electronics Symposium on Knowledge Creation and Intelligent Computing (IES-KCIC), Bali, Indonesia, 29–30 October 2018; pp. 378–383. [Google Scholar]
Chen, L.; Wu, X.; Sun, C.; Zou, T.; Meng, K.; Lou, P. An intelligent vision recognition method based on deep learning for pointer meters. Meas. Sci. Technol. 2023, 34, 055410. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the 29th Annual Conference on Neural Information Processing Systems (NIPS 2015), Montreal, QC, Canada, 11–12 December 2015. [Google Scholar]
Nowakowski, P.; Pamuła, T. Application of deep learning object classifier to improve e-waste collection planning. Waste Manag. 2020, 109, 1–9. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Visio (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Liu, C.; Xie, N.; Yang, X.; Chen, R.; Chang, X.; Zhong, R.Y.; Peng, S.; Liu, X. A Domestic Trash Detection Model Based on Improved YOLOX. Sensors 2022, 22, 6974. [Google Scholar] [CrossRef]
Patel, D.; Patel, F.; Patel, S.; Patel, N.; Shah, D.; Patel, V. Garbage Detection using Advanced Object Detection Techniques. In Proceedings of the 2021 International Conference on Artificial Intelligence and Smart Systems (ICAIS), Coimbatore, India, 25–27 March 2021; pp. 526–531. [Google Scholar]
Mao, W.L.; Chen, W.C.; Fathurrahman, H.I.K.; Lin, Y.H. Deep learning networks for real-time regional domestic waste detection. J. Clean. Prod. 2022, 344, 131096. [Google Scholar] [CrossRef]
Li, J.; Chen, J.; Sheng, B.; Li, P.; Yang, P.; Feng, D.D.; Qi, J. Automatic Detection and Classification System of Domestic Waste via Multimodel Cascaded Convolutional Neural Network. IEEE Trans. Ind. Inform. 2022, 18, 163–173. [Google Scholar] [CrossRef]
Liu, S.; Chen, R.; Ye, M.; Luo, J.; Yang, D.; Dai, M. EcoDetect-YOLO: A Lightweight, High-Generalization Methodology for Real-Time Detection of Domestic Waste Exposure in Intricate Environmental Landscapes. Sensors 2024, 24, 4666. [Google Scholar] [CrossRef] [PubMed]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-Aligned One-Stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 3490–3499. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IOU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
DeRose, J.F.; Wang, J.; Berger, M. Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models. IEEE Trans. Vis. Comput. Graph. 2021, 27, 1160–1170. [Google Scholar] [CrossRef] [PubMed]
Ma, S.; Zhang, X.; Jia, C.; Zhao, Z.; Wang, S.; Wang, S. Image and video compression with neural networks: A review. IEEE Trans. Circuits Syst. Video Technol. 2019, 30, 1683–1698. Available online: https://ieeexplore.ieee.org/document/8693636/ (accessed on 1 April 2025). [CrossRef]
Lan, R.; Sun, L.; Liu, Z.; Lu, H.; Pang, C.; Luo, X. MADNet: A fast and lightweight network for single-image super resolution. IEEE Trans. Cybern. 2020, 51, 1443–1453. [Google Scholar] [CrossRef]
Liu, C.; Gao, H.; Chen, A. A real-time semantic segmentation algorithm based on improved lightweight network. In Proceedings of the 2020 International Symposium on Autonomous Systems (ISAS), Guangzhou, China, 6–8 December 2020; pp. 249–253. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
Huang, G.; Liu, Z.; Van der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Lee, Y.; Hwang, J.-W.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–17 June 2019; pp. 752–760. [Google Scholar] [CrossRef]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 1571–1580. [Google Scholar] [CrossRef]
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Lu, F.; Li, K.; Nie, Y.; Tao, Y.; Yu, Y.; Huang, L.; Wang, X. Object Detection of UAV Images from Orthographic Perspective Based on Improved YOLOv5s. Sustainability 2023, 15, 14564. [Google Scholar] [CrossRef]
Wang, Z.; Sun, W.; Zhu, Q.; Shi, P. Face Mask-Wearing Detection Model Based on Loss Function and Attention Mechanism. Comput. Intell. Neurosci. 2022, 2022, 2452291. [Google Scholar] [CrossRef] [PubMed]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Si, Y.; Xu, H.; Zhu, X.; Zhang, W.; Dong, Y.; Chen, Y.; Li, H. SCSA: Exploring the Synergistic Effects between Spatial and Channel Attention. Neurocomputing 2025, 634, 129866. [Google Scholar] [CrossRef]
Xia, K.; Lv, Z.; Liu, K.; Zhang, Q.; Zhu, L.; Ding, L. Global Contextual Attention Augmented YOLO with ConvMixer Prediction Heads for PCB Surface Defect Detection. Sci. Rep. 2023, 13, 9805. [Google Scholar] [CrossRef]
Wang, H.; Wang, L.; Chen, H.; Li, X.; Zhang, X.; Zhou, Y. Waste-YOLO: Towards high accuracy real-time abnormal waste detection in waste-to-energy power plant for production safety. Meas. Sci. Technol. 2024, 35, 016001. [Google Scholar] [CrossRef]
Sun, Q.; Zhang, X.; Li, Y.; Wang, J. YOLOv5-OCDS: An Improved Garbage Detection Model Based on YOLOv5. Electronics 2023, 12, 3403. [Google Scholar] [CrossRef]
Xie, X.; Lang, C.; Miao, S.; Cheng, G.; Li, K.; Han, J. Mutual-Assistance Learning for Object Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 15171–15184. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall architecture of YOLOv8.

Figure 2. Overall Network Architecture of EcoDetect-YOLOv2.

Figure 3. Structure of GhostConv.

Figure 4. Structural Diagrams of GhostResBottleneck and ResGhostCSP.

Figure 5. Structure of the EMA attention mechanism, where

G

represents the number of sub-features, X Avgpool indicates one-dimensional horizontal average pooling, and Y Avgpool denotes one-dimensional vertical average pooling.

Figure 5. Structure of the EMA attention mechanism, where

G

represents the number of sub-features, X Avgpool indicates one-dimensional horizontal average pooling, and Y Avgpool denotes one-dimensional vertical average pooling.

Figure 6. DySample module architecture demonstrating the point-based dynamic upsampling mechanism and design.

Figure 7. Example examples from (a) widely adopted waste-detection benchmarks and (b) the Intricate Environment Waste Exposure Detection dataset.

Figure 8. Category distribution of the dataset: the dark blue bars represent the number of images per category, while the dark orange bars indicate the instance count for each garbage category.

Figure 9. Confusion matrices under the mAP@0.5 metric. (a) YOLOv8s; (b) EcoDetect-YOLOv2.

Figure 10. Visual comparison of detection results on the dataset. Row A presents detection outcomes of the baseline YOLOv8 model, while Row B shows results from the proposed EcoDetect-YOLOv2. Bounding box colors indicate different types of detection errors: orange denotes missed detections, purple indicates false positives, and all remaining colors represent correctly identified waste items across different categories. Subpanels (a–f) compare the two models’ performance on smallobject detection tasks; panels (c,g,h) highlight the baseline model’s elevated false-positive tendency; panel (i) illustrates detection efficacy under nighttime conditions.

Table 1. Hyperparameter settings for training.

Parameter Name	Parameter Value
Epochs	300
Batch size	16
Image size	640
Initial learning rate	0.01
Momentum	0.937
Weight decay	0.0005
Optimizer	SGD

Table 2. Performance Comparison Between YOLOv8s and EcoDetect-YOLOv2 Models.

Models	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
YOLOv8s	68.2	52.1	55.1	31.8	23.6	9.39
Our EcoDetect-YOLOv2	69.2	56.7	59.9	34.9	26.2	7.58

Table 3. Comparison of Attention Mechanism Results.

Attention	P	R	mAP₅₀	mAP_50:95	FLOPs(G)	Parameters (MB)
None	68.2	52.1	55.1	31.8	23.6	9.39
SE	68.9	51.1	55.6	31.3	23.6	9.39
CA	68.2	53.0	56.7	32.6	23.6	9.39
NonlocalBlockND	68.2	53.6	56.5	32.4	24.2	9.54
CBAM	69.6	50.8	56.4	32.1	23.8	9.63
SCSA	69.1	51.7	56.9	32.5	23.6	9.39
EMA	68.9	53.5	57.3	33.3	23.7	9.39

Table 4. Results of ablation experiments.

Num	P2	EMA	Dysample	RGC	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)
0	-	-	-	-	68.2	52.1	55.1	31.8	23.6	9.39
1	T	-	-	-	67.5	54.3	57.0	32.2	31.7	9.56
2	-	T	-	-	68.9	53.5	57.3	33.3	23.7	9.39
3	-	-	T	-	67.8	53.1	55.9	32.5	23.6	9.41
4	-	-	-	T	68.6	51.8	55.3	31.7	19.8	7.49
5	T	T	-	-	70.2	55.6	58.9	34.3	31.8	9.57
6	T	-	T	-	67.6	55.0	57.6	32.9	31.8	9.59
7	T	-	-	T	67.7	54.1	57.1	32.4	26.1	7.58
8	-	T	T	-	68.4	53.7	57.8	33.5	23.7	9.41
9	-	T	-	T	69.1	53.0	57.4	33.2	19.8	7.50
10	-	-	T	T	68.0	52.7	56.2	32.5	23.6	9.41
11	T	T	T	-	69.6	56.6	59.6	34.9	31.8	9.59
12	T	T	-	T	70.4	55.4	59.0	34.2	26.2	7.58
13	T	-	T	T	68.1	54.7	57.8	33.0	26.2	7.58
14	-	T	T	T	68.7	53.6	57.9	33.3	19.8	7.50
15	T	T	T	T	69.2	56.7	59.9	34.8	26.2	7.58

Table 5. Performance comparison of different models.

Model	P	R	mAP₅₀	mAP_50:95	FLOPs (G)	Parameters (MB)	FPS
SSD	65.1	47.7	52.5	29.8	268.5	23.8	112.44
DETR	-	-	38.5	11.9	86.0	41.0	299.61
RT-DETR-l	60.7	44.9	49.0	28.0	103.9	30.54	161.12
RT-DETR-x	65.1	42.4	46.0	25.0	223.2	62.49	134.31
YOLOv3	68.9	52.1	57.3	32.1	154.7	58.7	196.77
YOLOv3-tiny	53.3	39.9	45.8	27.7	12.9	8.3	967.48
YOLOv5s	68.6	50.1	52.4	30.1	15.8	6.7	638.73
YOLOv5m	69.1	52.6	55.2	32.6	48.0	19.9	411.48
YOLOv5l	69.4	53.5	56.5	33.0	107.8	44.0	308.59
YOLOv8n	55.4	44.9	48.6	28.9	6.9	2.56	758.13
YOLOv8s	68.2	52.1	55.1	31.8	23.6	9.39	601.56
YOLOv8m	69.1	52.6	56.3	32.7	67.8	22.15	354.87
YOLOv8l	67.4	55.5	57.0	33.1	145.8	37.64	294.11
YOLO11s	68.2	52.1	55.1	31.8	21.5	8.99	627.19
YOLO11m	69.9	53.4	56.9	33.2	68.2	19.13	407.62
YOLO11l	69.0	53.5	57.0	33.5	87.3	24.14	306.85
GCC-YOLO [35]	70.1	54.7	57.3	32.3	19.8	7.2	623.47
Waste-YOLO [36]	67.1	50.5	53.7	30.3	16.7	6.71	606.47
YOLOv5s-OCDS [37]	67.5	51.1	56.2	30.9	54.2	13.81	467.33
EcoDetect-YOLO [18]	73.5	52.9	58.3	31.2	20.0	7.0	589.12
Our EcoDetect-YOLOv2	69.2	56.7	59.9	34.8	26.2	7.58	553.97

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Su, J.; Chen, R.; Li, M.; Liu, S.; Xu, G.; Zheng, Z. EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments. Sensors 2025, 25, 3451. https://doi.org/10.3390/s25113451

AMA Style

Su J, Chen R, Li M, Liu S, Xu G, Zheng Z. EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments. Sensors. 2025; 25(11):3451. https://doi.org/10.3390/s25113451

Chicago/Turabian Style

Su, Jing, Ruihan Chen, Mingzhi Li, Shenlin Liu, Guobao Xu, and Zanhong Zheng. 2025. "EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments" Sensors 25, no. 11: 3451. https://doi.org/10.3390/s25113451

APA Style

Su, J., Chen, R., Li, M., Liu, S., Xu, G., & Zheng, Z. (2025). EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments. Sensors, 25(11), 3451. https://doi.org/10.3390/s25113451

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

EcoDetect-YOLOv2: A High-Performance Model for Multi-Scale Waste Detection in Complex Surveillance Environments

Abstract

1. Introduction

2. Related Work

2.1. YOLOv8

2.2. Attention Mechanism

2.3. Lightweight Networks

3. Proposed Methodology

3.1. EcoDetect-YOLOv2 Object Detection Model

3.2. Enhancements in P2 Small Object Detection Layer

3.3. Design of GhostResBottleneck and ResGhostCSP Based on GhostConv

3.3.1. GhostConv

3.3.2. GhostResBottleneck and ResGhostCSP Structure Design

3.4. Efficient Multi-Scale Attention (EMA)

3.5. DySample Upsampling

4. Experimental Results and Discussion

4.1. Datasets

4.2. Experimental Evaluation Metrics

4.2.1. Precision and Recall

4.2.2. mAP50 and mAP50:95

4.2.3. Parameter Count, Floating Point Operations (FLOPs) and Frames per Second (FPS)

4.3. Experimental Environment and Initial Parameter Settings

4.4. Performance Evaluation

4.5. Analysis of Attention Mechanisms

4.6. Results of Ablation Experiments

4.7. Comparison of Different Methods

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.2.2. mAP₅₀ and mAP_50:95