1. Introduction
With the acceleration of the global urbanization process, the management of urban solid waste has become an increasingly serious issue [
1,
2,
3]. The imperative transformation of the solid waste management system from a production centric to an environmentally focused framework is an undeniable necessity [
4]. A good waste management system is crucial, not only for the cleanliness of the urban environment and the health of its residents, but also for sustainable development and the recycling of resources. In this process, waste detection is undoubtedly a key step. It can significantly improve the efficiency and quality of waste processing, reduce dependence on landfills, promote the recycling of resources, and decrease environmental pollution. However, traditional methods of waste sorting often rely on manual sorting, which is not only inefficient and costly, but also poses health risks [
5].
The generation of solid waste inevitably results from human activities, impacting both public health and the environment. Consequently, the field of waste management is garnering a heightened focus on intelligent and sustainable practices, particularly across both developed and developing nations. The inappropriate disposal of waste in non-designated areas presents significant challenges [
6], prompting the use of various techniques to identify and sort waste [
7,
8]. However, the prevalent methods for waste classification largely rely on human expertise, which can be both challenging and laborious for precise waste categorization.
In the field of waste detection, challenges arise from the presence of a wide variety of waste types, as well as differing shapes, sizes, colors, and conditions. These factors significantly increase the complexity of automated classification tasks. Various types of waste also possess distinguishable characteristics; for instance, plastics typically exhibit a certain gloss and color, while metals often reflect gloss and have specific shapes. Early research regarding waste detection often focused on utilizing these characteristics for color and shape-based processing methods. However, these approaches can be influenced by changes in lighting conditions and the fading of waste surfaces, ultimately leading to a significant decrease in their performance.
The utilization of machine learning algorithms, such as artificial neural networks (ANNs) [
9], has been extensively explored in the domain of waste identification. Several research efforts have been documented, each proposing various methodologies to enhance the efficiency and accuracy of garbage classification systems. For example, Yuan et al. [
10] introduced MAPMobileNet-18, a streamlined residual network aimed at refining the process of waste identification, alongside assessing its accuracy, speed, and compatibility with edge devices. Ying et al. [
11] introduced modifications to the YOLOv2 model by adjusting its parameters and implementing optimization and acceleration techniques. These modifications aimed to optimize the balance between the model’s real-time application capabilities and the accuracy of bounding box clustering, impacting its generalization in new or complex environments. Meanwhile, Ying et al. [
12] developed a system for autonomous garbage detection, leveraging the open-source Faster R-CNN framework, opting for the ResNet network over VGG for the foundational convolutional layers. However, the ResNet model is relatively more complex, which may lead to an increased computational burden. Conversely, Fu et al. [
13] presented an approach utilizing the MobileNetV3 framework; however, while its extensive network architecture contributed to decreased computational speed, it exhibited poor applicability on edge devices. Chen’s [
14] work involved the integration of a MobileNet-v2 as a backbone for model distillation in waste detection, achieving a reduction in parameter count and an uptick in accuracy, albeit without addressing model generalizability. Feng’s [
15] contribution involved a 23-layer CNN to elevate accuracy in waste detection, yet this complexity inadvertently affected the model’s real-time processing capabilities. Kang’s [
16] research leveraged ResNet-34 for amalgamating multiple features from waste imagery, incorporating a novel activation function to enhance the detection of small-sized waste, although it encountered challenges in maintaining real-time detection efficiency due to computational demands. Gupta et al. [
17] conducted a comparative analysis of the efficacy of various pre-trained neural networks for the task of garbage classification, employing supplementary hardware devices such as PiCam, Raspberry Pi, and infrared sensors. However, the goal of real-time garbage sorting remained unattained. Shi et al. [
18] modified the Xception network to mitigate backpropagation issues, securing commendable classification performance at a high computational cost. Despite these innovations, a common oversight remains the high computational demands associated with ANNs, posing significant challenges for their integration into edge device hardware. Due to the diversity of scenarios in which garbage appears, the accuracy of the test results of the above algorithm model is affected, and the issue of high computational costs is overlooked. This makes it difficult to embed the garbage detection model into edge devices.
Real-time garbage detection presents a significant challenge. To achieve higher precision in garbage detection, it is crucial to continuously refine deep learning algorithms with the goal of achieving accurate multi-scale object detection without sacrificing speed. This will ensure that the network can adapt to variations in scale. In the contemporary era, marked by the swift advancement of deep learning technologies, the domain of object detection has witnessed significant progress. It has been widely applied in various complex scenarios [
19]. The methodologies employed in object detection can broadly fall into two distinct categories. The first category encompasses two-stage algorithms, notably Fast Region-Based Convolutional Neural Networks [
20], Region-Based Fully Convolutional Networks [
21], and Mask Region-Based Convolutional Neural Networks [
22]. These algorithms primarily focus on proposing candidate regions before performing the classification and bounding box regression assignments. 
However, two-stage algorithms have some flaws, the first of which is that they are usually slow [
20]. This is because these algorithms initially require the generation of region proposals [
23], followed by classification and bounding box regression for each proposal. This two-stage processing makes the algorithm limited in real-time or fast detection scenarios. Additionally, there is a significant computational resource consumption [
22]. Generating high-quality region proposals typically requires complex algorithms, and processing each proposal also necessitates the repeated execution of the same convolution operations, further increasing the computational burden. 
On the other hand, the second category involves one-stage detection algorithms, which streamline the process by simultaneously conducting classification and bounding box regression in a single step. This approach, exemplified by the You Only Look Once (YOLO) [
24,
25,
26,
27,
28,
29,
30] series, Single Shot MultiBox Detector [
31], and RetinaNet [
32], offers the advantage of increased processing speed. Single-stage object detection algorithms offer several significant advantages over two-stage algorithms, especially in terms of speed and simplified processes [
31]. These advantages make single-stage algorithms particularly popular in scenarios requiring real-time processing and when computational resources are limited [
24]. 
However, single-stage algorithms also have their limitations, such as potentially lower detection accuracy compared to that of two-stage algorithms, in some cases. The gap in accuracy mainly stems from the tendency of single-stage algorithms to produce more false positives [
32], as they predict multiple categories and bounding boxes at each location simultaneously. By improving the Yolov8 model algorithm, it is possible to refine the classification of solid waste and effectively reduce the amount of garbage landfilled. This can increase the resource recycling utilization rate, thereby reducing environmental pollution and resource wastage. It also plays a role in promoting the protection of the ecological environment.
This paper improves the YOLOv8s object detection algorithm and tests it on the “Huawei Cloud” datasets, demonstrating that the proposed algorithm enhances detection efficiency. This document’s primary contributions include the following:
The remainder of this document is organized as follows: An extensive review of literature relevant to our study is presented in 
Section 2. Following this, 
Section 3 is dedicated to a detailed exposition of the enhancements achieved in our methodology. In 
Section 4, this study undertake a series of empirical studies to validate the efficacy of our refined model, with the findings detailed therein underscoring its enhanced performance. The paper concludes with a detailed summary in 
Section 5.
  3. Methods
YOLOv8 was once regarded as the pinnacle of contemporary object detection models, but it is worth noting that the YOLO series is also evolving continuously, leading to significant advances in this field. This provides direction for our future in-depth research. However, the focus of this study remains on the discussion of YOLOv8. The original YOLOv8 model’s performance is limited in facing the unique challenges of waste classification detection tasks. This limitation is particularly evident when dealing with garbage images captured from various angles. These images often have rich and varied backgrounds and contain a large number of objects of different sizes, making object detection more difficult.
To address this issue, this study innovatively modified the architecture of YOLOv8 to adapt to the task of garbage detection. This study adopted CG-HGNetV2, an improved network structure based on the HGNetV2 [
33] network, as the backbone network of YOLOv8. The new structure has been selected to replace the standard network utilized in the original model. This updated framework is capable of leveraging local features, surrounding context, and global context to enhance the accuracy of semantic segmentation. It efficiently extracts features through a hierarchical approach, leading to a significant reduction in the computational cost of the model. Furthermore, this study has integrated an attention module known as MSE-AKConv, which plays a key role in directing the network’s focus towards the essential components of the target. Through such improvements, the network can more accurately lock onto and locate the positions of large waste objects. Furthermore, this study introduces a new method to replace CIoU [
35]. On the basis of calculating IoU, it also considers the outer boundary of the two rectangles. By calculating the minimum distance between the boundaries, this method handles cases where boundaries are close but not overlapping. It offers a more detailed similarity assessment than does traditional IoU.
This substitution means that our model can learn and adapt faster, and it has also significantly improved the precision of bounding box regression. These strategic adjustments and optimizations have enhanced the performance of the improved YOLOv8 model in the field of garbage detection, offering a more robust and efficient solution for this domain. The improved model is better, not only in terms of object detection accuracy, but also relative to model training efficiency. Through these structural adjustments and enhancements, this study advances the implementation of the YOLOv8 model in the field of waste classification. The three-dimensional structural diagram of this study is shown in 
Figure 2.
  3.1. Lightweight Backbone Network CG-HGNetV2
Accurate and efficient detection is key to garbage detection. However, the YOLOv8 model’s feature extraction relies mainly on 3 × 3 convolution operations. This leads to an increase in model parameters and computational costs, making it unsuitable for rapid detection. Therefore, this paper adopts the network structure of CG-HGnetV2 for optimization. This design enables effective operation, even in resource-constrained environments, while maintaining high accuracy and real-time performance. The structural diagram of this architecture is shown in 
Figure 3.
Figure 3 shows the overall architecture of CG-HGNetV2. From the data in the figure, it can be determined that CG-HGNetV2 adopts a multi-stage design, mainly including: (a) the initial HG StemBlock, (b) multiple CG-HG stages, and (c) a context guide block (specifically integrated in HG Stage 1). In the StemBlock, the main function is preliminary feature extraction, which consists of four parts: convolutional layers, batch normalization (BN), activation functions, and pooling layers. The main role of StemBlock is to reduce the input resolution, decrease the subsequent computational load, increase the number of channels, enrich feature representation, and preliminarily extract low-level features (such as edges and textures).
 The function of the CG-HG stage is mainly deep feature extraction and processing. Regarding the connections between the CG-HG stages, the output of each CG-HG stage serves as the input for the next stage, progressively extracting higher-level features. Additionally, the outputs of different HG stages may be used as multi-scale features, which is beneficial for detecting targets of different sizes. Integrating the context guide block in HG stage 1 can enhance the model’s understanding of the global context at an earlier phase, empowering subsequent stages to better handle scale variations and spatial relationships in object detection.
The CG-HGNetV2 network algorithm combines multiple advanced network design concepts, including Ghost convolution, CSP structure, and context guide modules. The detailed algorithm description of the CG-HGNetV2 network is as follows:
- (1)
- The input image first undergoes preliminary feature extraction and downsampling through the StemBlock. 
- (2)
- The feature map sequentially passes through four HG stages, with each stage containing multiple HG blocks. 
- (3)
- After the first HG stage, the context guide block is applied to enhance context awareness. 
- (4)
- Each HG block uses Ghost convolution and CSP structure to improve efficiency and performance. 
- (5)
- The final feature map generates the final detection results through the detection head. 
Through this carefully designed architecture, CG-HGNetV2 can effectively extract rich features from the input images while considering computational efficiency and the importance of context information. The StemBlock lays the foundation for feature extraction, multiple HG stages progressively extract deep features, and the introduction of the context guide block enhances the model’s perception of global information. This combination enables CG-HGNetV2 to achieve excellent performance regarding object detection tasks.
This concept was initially proposed in CGNet [
39], with the fundamental principle being to mimic the human visual system’s reliance on contextual information for understanding scenes. The CG block is used to capture local features, surrounding context, and global context information. Therefore, the CG-HGNetV2 designed in this study aims to fully leverage local features, surrounding context, and global context. This facilitates the establishment of connections between local and global contexts in the new structure, enhancing the accuracy and stability of the model. Additionally, this design structure enhances the model’s generalization ability, improving its performance in more complex situations. The structure of the context guided block module is shown in 
Figure 4.
The primary concept of the architecture in this study is to employ a hierarchical approach for feature extraction. This enables the learning of complex patterns at different scales and levels of abstraction, thereby enhancing the network’s capacity to process intricate image data. This layered and efficient processing is particularly advantageous for demanding tasks such as image classification. Precise prediction is crucial in recognizing complex patterns and features at different scales. The HG-block plays a key role as well, being a core component of the network, designed to process data in a hierarchical manner. Each HG-block may handle different levels of data abstraction, allowing the network to learn from both low-level and high-level features. The structure diagram of the HG-block is shown in 
Figure 5.
Another major feature of the architecture in this study, CG-HGNetV2, is the adoption of lightweight convolution. LightConvBNAct employs a lightweight convolutional structure. It utilizes a two-step convolution process. Firstly, a 1 × 1 convolution is employed for reducing feature dimensions or expansion, without the use of an activation function. This step diminishes the parameter count, while maintaining the current spatial dimensions of the feature map. Subsequently, a group convolution is executed, which is responsible for extracting spatial features. Each output channel is processed by a corresponding input channel through a convolution kernel, achieving the effect of depthwise convolution. Using group convolution significantly reduces the computational load and model parameters.
Assuming that the convolution kernel is square-shaped, K represents the dimensions of the convolution kernel, with both width and height represented by K.  is the number of input feature map channels.  is the number of output feature map channels.  and  are the height and width of the output feature map, respectively.  and  are the height and width of the input feature map, respectively. The computational complexity of standard convolution is shown as follows:
In the discrete case, two-dimensional convolution can be expressed as:
The input matrix is I, and the kernel matrix is K. The kernel matrix K has a shape of m × n. P and Q are the width and height of the kernel, respectively. To obtain the element at position (i, j) in the output matrix O, the kernel matrix K is slid over the input matrix I. At each step, the corresponding elements of I and K are multiplied, and then these products are summed together.
For a convolution layer, the output can be expressed as:
Here, 
yl is the L-level output, 
xl is the input, 
Wl is the convolution kernel, 
bl is the bias, and 
f is the activation function.
        
The computational expense of the lightweight convolution is divided into two parts, with the computation amount for the 1 × 1 convolution as shown in Equation (5):
The computational cost of the group convolution is as shown in Equation (6):
The integration of lightweight convolution with the context guided block (CGB) optimizes the network by reducing parameters through 1 × 1 and group convolutions, leveraging CGB for enhanced feature expressiveness. The 1 × 1 convolution efficiently merges and reduces feature channels, thereby decreasing computational complexity. Additionally, group convolution further reduces computational costs by independently processing the feature map groups. This combined approach not only lowers computational expenses, but also reduces model parameters by fusing local and global features via CGB. Consequently, this design maintains network expressiveness, enabling adequate performance in resource-constrained environments.
In view of the limitations of YOLOv8s in the detection of small objects, we mainly solve this problem using the following techniques:
- (1)
- Adding high-resolution feature maps, i.e., adding more upper sampling layers to the network structure to generate higher-resolution feature maps. This can provide additional fine-grained spatial information, which is conducive to the detection of small objects. 
- (2)
- Introducing attention mechanism, i.e., spatial attention and channel attention mechanisms, such as an SE (squeeze-and-excitation) module, are introduced into the algorithm. This helps the model better focus on the features of the small objects. 
- (3)
- Employing data enhancement, i.e., using enhancement techniques, such as random cropping, amplification, etc., and consider the use of Mosaic, MixUp, and other advanced data enhancement methods for small objects. 
- (4)
- Initiating loss function improvement. By modifying the loss function, the penalty for small object detection errors is increased. 
- (5)
- Introducing auxiliary tasks; i.e., adding auxiliary tasks such as edge detection or semantic segmentation. This can help the model learn more detailed features, which is conducive to small object detection. 
- (6)
- Incorporating cascade detection to achieve a two-stage detection strategy; the second stage focuses on the fine detection of small objects. 
- (7)
- Utilizing post-processing optimization to improve the non-maximum suppression (NMS) algorithm, i.e., by using Soft-NMS or DIoU-NMS. This helps reduce the chance of small objects being deleted by mistake. 
  3.2. Effective Attention Mechanism
Addressing the garbage detection challenge in the “Huawei Cloud” datasets involves overcoming significant obstacles due to the heterogeneity of waste object sizes and intricate distribution patterns. Conventional convolution operations are inherently flawed in two respects. Firstly, these operations are constrained to local receptive fields, failing to assimilate information from distant areas, with a rigid sampling architecture. Secondly, the invariable sampling configurations and square kernel shapes exhibit limited adaptability to dynamic target variations. To surmount these limitations, our approach incorporates an adaptive kernel convolution mechanism, dubbed AKConv [
34], into the architectural framework. AKConv utilizes a novel algorithm to determine the initial coordinates of the convolution kernels, irrespective of their dimensions. It introduces variable offsets to dynamically modify the sampling contours in response to target alterations. This innovation markedly diminishes both the computational demands and the storage prerequisites of the model, concurrently elevating the precision of garbage entity detection.
This study presents the incorporation of the Mish [
40] activation function as a substitute for the SiLU activation function within AKConv. Additionally, after the convolution (conv) sequence operations, an SEnet [
41] module is added. This allows the SEnet module to recalibrate the channel importance of the feature maps output by the convolution layer before producing the final result. This method ensures that the module can fully utilize the advanced features provided by the convolution layer. The new module is named MSE-AKconv. 
Figure 6 illustrates the workflow diagram of MSE-AKConv.
The input image exhibits the following dimensions (C, H, W), where C represents the number of channels, and H and W represent the vertical and horizontal extents, respectively. AKConv uniquely provides the initial sampling shape for the convolution kernel. Following the Conv2d operation on the input image, the sampling shape is adjusted using learned displacements. The resulting feature map undergoes resampling, reshaping, re-convolution, and normalization before being output through the Mish activation mechanism and undergoing further processing by the SEnet module.
Next, this research provides a detailed explanation of the derivation process of the attention mechanism. First, in this work, the function 
g(
X) is defined (Equation (7)).
        
        where 
 and 
 are weight matrices, and 
 and 
 are bias terms.
        
L represents the loss function, and bn indicates the influence of element changes on the loss. Wn represents the influence of element changes on the overall loss.  denotes the partial derivative of the loss function L with respect to the weight matrix  at the n-th layer. The above steps are repeated several times until the model converges. Through this process, , , , and  will gradually adjust to their optimal values, enabling the attention mechanism to effectively focus on the important features.
The softmax is then calculated, the as shown in Equation (14).
        
Finally, the attention weights are applied to the original features, as shown in Equation (15).
        
X represents the input features, typically a multi-dimensional tensor with the shape (batch_size, channels, height, width). g(X) denotes a function, usually a small neural network, used to calculate the attention scores.  represents the computed attention weights. X′ denotes the weighted output features. ⊙ represents the element-wise multiplication (Hadamard product).
Compared to standard convolutions, MSE-AKConv offers more options, with convolution parameters increasing linearly with kernel size. It uniquely reduces model parameters and computational costs. Traditional convolution operations allow for the parameters to grow quadratically with the kernel size. The kernel typically has dimensions of 
K × K, where 
K represents both height and width. The total parameters within a layer, designated as 
, can be determined accurately using Equation (16).
        
If considering the bias term, with one bias per output channel, then the total number of parameters should be increased by the number of output channels 
, as shown in Equation (17):
MSE-AKConv allows for the flexible linear adjustment of convolution kernel parameters, meeting specific needs and effectively managing model complexity and computational demands. The parameter count (
N) can be linearly adjusted based on various factors, such as task complexity or computational efficiency optimization. The calculation for the parameter count (
) is shown in Equation (18).
        
This study introduces the Mish activation function into the AKConv framework. This modification is driven by the superior gradient behavior and smoothness of the Mish function. It significantly enhances the capability of the model to identify waste targets. Such an enhancement mitigates the risks associated with gradient disappearance or excessive accumulation, thereby elevating the precision and robustness of the model. Furthermore, the adoption of the Mish activation function bolsters the generalization capacity of the model, rendering it more efficient in tackling complex environments.
After the convolution sequence operations, an SEnet module is added. In this way, the SE module can further enhance the model’s ability to capture the dynamic relationship between channels. This is based on the advanced feature representation of AKConv, thus optimizing the model performance. This integrated method utilizes spatial attention to dynamically adjust the position of the convolution kernels. Additionally, it strengthens the model’s ability to process channel dimension information through the SE module. The structure of the SEnet module is illustrated in 
Figure 7.
The feature map formula containing the MSE-AKConv attention mechanism is shown in Equation (20). 
X represents the initial input feature map. 
 represents the adjusted feature map obtained through positional offset and bilinear interpolation. 
Conv(
) denotes the feature rearrangement and convolution operation on the adjusted feature map. Mish indicates the application of the Mish activation function to the convolved feature map. 
SE represents the application of the SEnet module. 
 is the final output feature map. Mish indicates the application of the Mish activation function to the convolved feature map. The formula for Mish is as shown in Equation (19).
        
Under this design, the ability to linearly adjust the number of parameters allows for more flexible control over the number of parameters, which can be adjusted according to actual needs. This ensures model performance while optimizing the model’s computational efficiency and resource usage. In contrast, the number of parameters in traditional convolution operations increases exponentially with the square of the convolution kernel size. This means that as the size of the convolution kernel increases, the number of parameters quickly grows, leading to increased complexity and computational cost of the model. MSE-AKConv optimizes the performance by removing unnecessary parameters, thereby reducing the demand for storage and computational resources. This not only accelerates the model’s training and inference speed, but also decreases energy consumption.
  3.3. The Improved Loss Function MPDIoU
In the foundational YOLOv8 algorithm, the bounding box regression employs the CIoU loss function, which, despite its utility, exhibits several limitations. To begin with, the CIoU loss lacks mechanisms to equitably address the disparities between challenging and simpler samples, a critical aspect for enhancing model robustness. Additionally, it incorporates the aspect ratio as a penalizing component within its formulation. This approach, however, falls short in accurately capturing discrepancies between the predicted and actual bounding boxes that share identical aspect ratios, yet diverge in their width and height dimensions. Moreover, the CIoU calculation intricately involves inverse trigonometric operations, thereby escalating the computational demand on the model’s arithmetic resources. The formula for 
CIoU is shown in Equations (19)–(23):
The intersection over union (
IoU) is defined as the proportion of the overlapping region relative to the combined area between the forecasted bounding box and the ground truth box. The parameters involved in this formula are further elucidated in 
Figure 8. Specifically, 
 denotes the Euclidean distance between the centroids of the forecasted and ground truth boxes. Moreover, h and w represent the height and width of the forecasted box, respectively. On the other hand, 
 and 
 signify the ground truth height and width of the frame, respectively. 
 and 
 represent the height and width of the smallest bounding box that encompasses both the forecasted and ground truth boxes.
Acknowledging the shortcomings of CIoU in the context of waste classification, our investigation proposes the adoption of MPDIoU [
42] as an alternative loss function. The newly applied MPDIoU loss function aims to refine the efficacy and precision of bounding box regression. It addresses both intersecting and non-overlapping bounding box regression challenges. It incorporates considerations for center point distances and discrepancies in dimensions. This is achieved by employing a similarity metric for the bounding boxes grounded on the minimal point distance. The adoption of MPDIoU simplifies the computational framework, thereby accelerating the convergence of the model and enhancing the precision of the regression outcomes. The architecture of the revised loss function is depicted in 
Figure 9.
Within 
Figure 9, entities A and B signify the predicted and ground truth bounding boxes, respectively. The coordinates for the top-left and bottom-right vertices of bounding box A are denoted by 
 and 
, while 
 and 
 represent the corresponding coordinates for bounding box B. The variables 
 and 
 are utilized to delineate the spatial separations between the top-left and bottom-right vertices of the actual and forecasted bounding boxes, in that order. The computations for 
 and 
 are facilitated by the application of Equations (24) and (25).
Through the borders of entity 
A and predicted value 
B, the sum of squares of the difference between the x and y coordinates is obtained, as follows.
        
Subsequently, 
 can be derived from 
 and 
, as expressed by Equations (26) and (27):
The calculation of IoU is shown in Equation (28).
        
Compared to standard IoU, MPDIoU adds a penalty for differences in box sizes, which helps achieve more precise bounding box regression.
Through the derivation of the parameter values of Equations (24)–(28), the calculation framework of MPDIoU is simplified, the convergence speed of the model is accelerated, and the accuracy of the regression results is improved.
Compared to GIoU, the calculation method of 
GIoU is shown in Equation (29).
        
In Equation (29), A represents the predicted box, B represents the ground truth box, and C represents the smallest enclosing box covering both A and B.
MPDIoU focuses on the difference in the perimeters of the boxes, while GIoU considers the smallest enclosing rectangle covering both boxes.
Compared to 
DIoU, the calculation method of DIoU is shown in Equation (30).
        
In Equation (30), IoU represents the traditional intersection over union.  denotes the Euclidean distance between the center points of the predicted box and the ground truth box, b represents the center point of the predicted box, while  represents the center point of the ground truth box, and c denotes the diagonal length of the smallest enclosing rectangle that covers both the predicted box and the ground truth box.
MPDIoU employs the minimum perimeter distance instead of the center point distance, which may be more sensitive to targets of different shapes.
Gradient analysis of MPDIoU is shown in Equation (31):
In Equation (31),  represents the partial derivative of MPDIoU loss with respect to the predicted box P.  represents the partial derivative of IoU with respect to the predicted box P.  represents the balancing parameter used to adjust the relative importance of the IoU term and the MPD term. MPD represents the minimum perimeter distance. c represents the distance between the center points of the predicted box and the ground truth box.  represents the partial derivative of MPD with respect to the predicted box P.  represents the partial derivative of c with respect to the predicted box P.
This gradient expression shows how MPDIoU simultaneously considers changes in overlap and shape differences. Theoretically, MPDIoU offers several advantages. It is more sensitive to differences in box sizes, enabling fine adjustments. Its computation is relatively simple, potentially offering better optimization efficiency. It combines the benefits of IoU with the considerations of shape.
  4. Experiments
  4.1. Dataset
In this research, the model under consideration was trained and assessed utilizing the waste image dataset from the “Huawei Cloud” Garbage Classification Competition. The HUAWEI-40 garbage categorization challenge dataset comprises 14,964 images, annotated with 44 types of labels, including disposable lunch boxes, book paper, power banks, leftover food, bags, trash bins, and so on. All images in the dataset were collected via mobile phones from people’s daily lives. In this experiment, the “Huawei Cloud” datasets was segmented into 10,474, 2992, and 1498 images for training, testing, and validation, respectively. A sample of images from the dataset is shown in 
Figure 10.
The division of samples into training, test, and validation sets occurred during the model’s training phase, with the allocation following a 7:2:1 ratio for the training, test, and validation sets, respectively. In the training set, 
Figure 11 illustrates the distribution of data across 44 garbage dataset categories, along with the specifics of the label boxes.
  4.2. Experimental Platform and Evaluation Criteria
The experiment was performed on a system running Ubuntu 20.04, employing Pytorch 1.11.0, Python 3.8.10, and CUDA 11.3 for its operational framework. The infrastructure for model training was supported by RTX 3090 GPUs. Throughout the experiment, uniform hyperparameters were applied across the training, validation, and testing phases to ensure consistency. The specified parameters included a training epoch count of 200, a batch size of 64, and an image resolution of 640 × 640 pixels. Notably, the training process proceeded without the application of pre-trained weights to the model.
The methodology for evaluating the experimental outcomes hinged on the cross-validation technique. After the phases of training and validation against designated datasets, the model underwent a conclusive performance appraisal utilizing the test dataset. In this comprehensive evaluation, the network’s performance was gauged using four pivotal metrics: precision (P), recall (R), model size, and mean average precision (mAP). To assess these metrics accurately, it was imperative to employ parameters like TP (true positive, reflecting accurate positive identifications), FP (false positive, signifying erroneous positive identifications), and FN (false negative, indicating incorrect negative identifications). Additionally, the Intersection over union (IoU) metric was utilized to quantitatively assess the extent of overlap between the forecasted bounding boxes and the factual ground truth. This metric is expressed as a ratio of their combined union. The precision metric is specifically calculated as the quotient of the quantity of accurately identified positive instances. This calculation is based on the total number of instances flagged by the model, as delineated in Equation (32).
By entering 
TP into the model, 
FP obtains the precision parameter value.
        
Recall, as a metric, quantifies the fraction of positively labeled instances that are correctly identified by the model out of the total population of actual positive samples. The computation of recall is presented in Equation (33).
By entering 
TP from the data in the model, 
FP obtains the recall parameter value.
        
Average precision (AP) is characterized as the region encompassed by the curve depicting the correlation between precision and recall. The computation of this metric is delineated in Equation (34), providing a quantitative measure of the network’s performance.
The parameters obtained from Equations (32) and (33) are substituted into Equation (34) to obtain the weighted average of the average accuracy (
AP) values.
        
The metric of mean average precision (mAP) serves as a quantifier for model detection efficacy across various categories. It is calculated as the weighted mean of the average precision (AP) values for each category. This calculation methodology is encapsulated in Equation (35), providing a thorough evaluation of the model’s effectiveness in detecting diverse categories.
The weighted average of the average accuracy (
AP) values obtained in Equation (34) is processed to obtain the metric value of the average accuracy (
mAP).
        
Within Equation (35), the term APi denotes the value associated with I, possessing a categorical index. N signifies the number of sample categories within the training dataset, which, for the purposes of this study, is identified to be 44. The notation mAP0.5 is utilized to describe the mean average precision of the detection model at an intersection over union (IoU) threshold of 0.5. Conversely, mAP0.5:0.95 is employed to articulate the mean average precision across an IoU threshold range from 0.5 to 0.95, in increments of 0.05.
To better evaluate model lightweighting, we introduced the metrics of GFLOPs and model parameters. GFLOPs serves as a metric for evaluating the complexity of a model or algorithm, whereas parameter denotes the model’s size.
  4.3. Experimental Result Analysis
  4.3.1. Before and After Improvement
During the experimental phase, the efficacy of the novel model was assessed utilizing the “Huawei Cloud” datasets. This evaluation, juxtaposed with the outcomes of the YOLOv8s model, elucidates the superior capability of the proposed algorithm in the realm of waste detection. The enhanced performance of the proposed model in comparison to that of YOLOv8s is systematically documented in 
Table 1 and 
Table 2. These metrics are based on the results from 200 epochs of experiments. These tables delineate the performance metrics for the “Huawei Cloud” test and validation datasets, respectively. The findings indicate that the introduced algorithm surpasses the performance of the YOLOv8s model in terms of efficacy.
In the test dataset, the proposed algorithm achieved a 4.80% improvement in precision (P) and a 1.27% improvement in mean average precision at an IoU threshold of 0.5 (mAP@0.5). Additionally, the model size is smaller than that of YOLOv8s. Compared to the original model, there was a reduction of 0.1 GFLOPs and a decrease in the number of parameters by 0.73 million, as shown in 
Table 1.
In the validation dataset, compared to the original model, the proposed algorithm achieved a 4.80% increase in P and a 0.5% increase in mAP@0.5, as shown in 
Table 2. The new model has shown significant performance improvements using the “Huawei Cloud” datasets, indicating its effectiveness in garbage classification detection. Our method has significantly improved performance in terms of model lightness and accuracy.
To more precisely evaluate the model’s performance, we generated the PR curves for the model at an intersection over union (IOU) threshold of 0.5, both prior to and following the improvements during the testing phase, as depicted in 
Figure 12 and 
Figure 13.
The area under the curve (AUC-PR), a frequently utilized metric to evaluate model performance, signifies that a larger AUC-PR corresponds to superior performance across diverse precision–recall combinations. The enhanced model clearly demonstrates a higher AUC-PR.
  4.3.2. Ablation Experiment
To substantiate the efficiency of the algorithm suggested, an evaluative ablation study was performed, utilizing the “Huawei Cloud” datasets. The initial model employed was YOLOv8s [
36]. Various enhancement techniques described herein were incrementally incorporated into this model, either singly or in a composite manner. This was done to assess the enhancement of each method’s performance in regards to object detection.
Table 3 presents the results of an ablation study conducted on the “Huawei Cloud” datasets. This experiment involved implementing several enhancements into the foundational YOLOv8s framework. These enhancements included replacing the original backbone with CG-HGNetV2 (as detailed in 
Table 3 (+CG-HG)), integrating the MSE-AKConv attention module (as detailed in 
Table 3 (+MSE-AK)), and implementing MPDIoU (as detailed in 
Table 3 (+MPDIoU)).
 To provide a more intuitive understanding of the significance of each improvement method, we offer the following visual representations, as depicted in 
Figure 14.
Based on the data from 
Table 4 and 
Figure 14, it is evident that the integration of lightweight improvements into the YOLOv8s model resulted in a reduction of 6.55% in regards to the model parameter count and 0.03% of computational costs, while maintaining good detection performance. The experimental data also shows that after improvement by CG-HGnetV2, there was a 7.63% reduction in the parameters, with a minimal decrease in accuracy, attributed to the hierarchical feature extraction approach utilized by the structure. The CGB structure employs channel-wise convolutions in the local feature extractor and context extractor to reduce inter-channel computational costs and conserve memory, resulting in a substantial reduction in the parameter count and computational cost, while effectively mitigating accuracy loss.
Furthermore, this study introduces the innovative MSE-AKConv design, which reduces traditional convolution operations by adjusting the initial sampling shape through learned displacements, enhancing adaptability to target variations. This operation dynamically adjusts the convolution kernel’s sampling shape during training, effectively improving detection accuracy and robustness, while reducing sensitivity to target variations and enhancing garbage detection capability. These improvements result in a 1.24% increase in mAP compared to that of the original YOLOv8 model [
43].
In 
Table 4, the precision metrics for the selected object identifications in the “Huawei Cloud” test dataset are detailed. Our dataset examination reveals that the algorithm we propose substantially elevates the precision of detecting various targets within previously unencountered scenarios. Specifically, enhancements in the detection precision of items such as beverage cans, bags, and old clothes were notably significant, with increments of 27.20%, 26.40%, and 21.80%, respectively, despite the potential for a slight reduction in the precision of the detection of certain objects upon integrating new modules. However, the cumulative effect on overall detection precision is unequivocally positive.
To graphically demonstrate the effectiveness of this study, the visualization results are shown in 
Figure 15. The ablation experiment was conducted across various models under identical conditions, as depicted in 
Figure 15. From the image comparison, it is clear that the YOLOv8s model exhibits cases of both missed and false detections. Upon adding or improving modules on the basis of the original model YOLOv8s, the image detection performance is enhanced compared to that of the YOLOv8s model. This significantly improves the recall and precision, increasing the certainty of detection for each target.
  4.3.3. Mainstream Model Comparison Experiments
During the experimental phase, the efficacy of the novel model was assessed utilizing the “Huawei Cloud” datasets. This evaluation, juxtaposed with the outcomes of the YOLOv8s model, elucidates the superior capability of the proposed algorithm in the realm of waste detection. The enhanced performance of the proposed model in comparison to that of YOLOv3-tiny [
24], YOLOv5s [
28], YOLOv6s [
44], YOLOv7-tiny [
30], and YOLOv8s [
36] is systematically documented in 
Table 5 and 
Table 6. These tables delineate the performance metrics for the “Huawei Cloud” test and validation datasets, respectively.
To present the comparison results more intuitively, the specific comparison results are illustrated in 
Figure 16.
Analyzing the data in 
Table 5 and 
Table 6, and from 
Figure 16, it is evident that among the numerous models, YOLOv5 and YOLOv6 exhibit relatively lower detection accuracy. Furthermore, they also possess larger parameter counts and high computational costs, requiring more computing resources and time for training and inference. In contrast, YOLOv3-tiny, as a lightweight model, does not demand as much computational cost. However, its detection performance is also not satisfactory, sacrificing a certain degree of perceptual capability. The model presented in this study showcases a lightweight design that significantly enhances its feature fusion and extraction functions. Importantly, the algorithm optimizes the balance between detection speed and accuracy, achieving the highest mean average precision (mAP) while utilizing fewer computational resources.
For a more intuitive comparison of the detection results, the partial dataset detection results are shown in 
Figure 17. The experiments were conducted under the same conditions to compare the enhanced YOLOv8 model with YOLOv3-tiny [
24], YOLOv5s [
28], YOLOv6s [
44], YOLOv7-tiny [
30], and YOLOv8s models.
From the image comparison, it is evident that other mainstream models exhibit instances of missed and false detections. In contrast, our improved model demonstrates enhanced image detection accuracy when compared to that of the YOLOv8s model, with significantly reduced occurrences of missed and false detections. This improvement notably enhances both recall and precision, thereby increasing the certainty of detection for each target.
  5. Conclusions
This study has improved the existing architecture of YOLOv8s, making the revised waste identification model suitable for deployment on edge devices. The results from various experiments indicate that this approach not only achieves higher accuracy, but also incurs lower operational costs. By integrating CG-HGnetV2 as the primary network, there has been a significant reduction in the model’s parameters. Additionally, incorporating the MSE-AKconv attention module within the convolutional layers has greatly improved the model’s accuracy. To further enhance regression accuracy, we adopted the MDPIoU loss function, which has also accelerated the convergence process of the network.
The experimental findings reveal an enhancement in garbage detection precision by 4.8%, an enhancement in recall rate by 0.10%, and an enhancement in mAP@0.5 by 1.30% over the results of the YOLOv8s model. This is achieved while reducing model parameters by 6.55% and computational demand by 0.03% GFLOPs. This improvement renders the model apt for high-accuracy applications within environments constrained by memory capacity and computational power, such as embedded systems. Additionally, when benchmarked against alternative models, our novel approach exhibits superior detection capabilities. This provides an important research foundation for urban environmental management and the achievement of sustainable development goals.
This study also possesses some limitations. While the model performs well in resource-constrained environments, its robustness in handling complex or changing scenarios has not been fully validated. Additionally, the model’s performance in identifying different types of garbage may vary, especially for objects with diverse shapes or high levels of occlusion. The focus of future work will be on optimizing the algorithms to improve efficiency, reduce energy consumption, and enhance processing speed. Despite the reduction in model parameters and computational demands, further algorithm optimization to decrease energy consumption and improve processing speed remains a key direction for future research.