A Novel YOLOv3 Algorithm-Based Deep Learning Approach for Waste Segregation: Towards Smart Waste Management

: The colossal increase in environmental pollution and degradation, resulting in ecological imbalance, is an eye-catching concern in the contemporary era. Moreover, the proliferation in the development of smart cities across the globe necessitates the emergence of a robust smart waste management system for proper waste segregation based on its biodegradability. The present work investigates a novel approach for waste segregation for its effective recycling and disposal by utilizing a deep learning strategy. The YOLOv3 algorithm has been utilized in the Darknet neural network framework to train a self-made dataset. The network has been trained for 6 object classes (namely: cardboard, glass, metal, paper, plastic and organic waste). Moreover, for comparative assessment, the detection task has also been performed using YOLOv3-tiny to validate the competence of the YOLOv3 algorithm. The experimental results demonstrate that the proposed YOLOv3 methodology yields satisfactory generalization capability for all the classes with a variety of waste items.


Introduction
The rapid explosion in industrialization, urbanization, and global population rate is an attention-grabbing concern, pertaining to environmental degradation. With the global population expanding at an alarming rate, there has been terrific degradation of the environment, resulting in its dreadful condition. As per the published report (2019), India annually generates more than 62 million tons (MT) of solid waste, out of which only 43 MT of the waste is collected, 11.9 MT is treated and almost 31 MT is dumped in landfill sites [1]. Attributable to the existing environmental concerns and improper management of waste, the world encounters gargantuan deleterious effects on the economy, public health and, essentially, the environment. This has shifted the above-all focus towards the worldwide progression of smart cities for guaranteeing effective and smart urban waste management. Moreover, the recycling of waste opens the gateway for research and development and introduces waste to the wealthy business model for sustainable development. However, concerns have been raised towards the requisite of segregation of waste based on its biodegradable and non-biodegradable behavior.
Usually, in the Indian context, wastes consist of paper, plastic, rubber, metal, glass, textiles, organics, sanitary products, electricals and electronics, hazardous substances (paint, spray and chemical) and infectious materials (hospital and clinical), which can be broadly classified as biodegradable (BD) and nonbiodegradable (NBD) waste with their respective share of 52% and 48% [2]. Further, according to the recent Indian government reports, the

•
The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table 2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following key contributions: • The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table  2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following key contributions: • The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table  2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following key contributions: • The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table  2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following key contributions: • The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table  2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following key contributions: • The main contribution of the present investigation is to endorse the efficiency of machine learning and/or deep learning techniques (particularly, YOLO family) for waste segregation based on the broad biodegradable properties of garbage, as these techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a garbage image dataset, consisting of 6437 images and distributed among six classes (cardboard, glass, plastic, paper, metal, and organic waste), usually visible in household garbage.
The remainder of this paper is organized as follows. Section 3 describes the dataset utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used to train the model are presented in Section 5. In Section 6, the experimental results are presented and discussed. To end, Section 7 gives the concluding remarks of the present work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public areas which are frequently disposed of by commuters, pedestrians and occasionally during commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very first attempt to segregate these items based on the biodegradable property of the material, therefore, a garbage dataset consisting of the most commonly seen waste items need to be developed. For this purpose, 7826 images were acquired in JPEG format using the camera of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessing and cleaning of the collected data, 6437 (82%) images were utilized to form a self-made real time dataset wherein each image was labeled with the name of the class to which it belonged and its type (BD/NBD). In the present investigation, these cleaned sample images were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the waste items assigned to these defined classes, along with their volume size, is provided in Table  2, and their class distribution is represented in Figure 1. manner. To achieve this goal, the work presented in this paper makes the following ke contributions: • The main contribution of the present investigation is to endorse the efficiency o machine learning and/or deep learning techniques (particularly, YOLO family) fo waste segregation based on the broad biodegradable properties of garbage, as thes techniques have never been incorporated in this regard, to the best knowledge and belief of authors.

•
The other subcontribution of the present investigation is the development of a gar bage image dataset, consisting of 6437 images and distributed among six classe (cardboard, glass, plastic, paper, metal, and organic waste), usually visible i household garbage.
The remainder of this paper is organized as follows. Section 3 describes the datase utilized in the present investigation. A brief sketch of the YOLOv3 algorithm is presented in Section 4. Then, details of the system specifications and parametric settings used t train the model are presented in Section 5. In Section 6, the experimental results are pre sented and discussed. To end, Section 7 gives the concluding remarks of the presen work.

Dataset
The present investigation emphasizes urban wastes in the vicinity of public area which are frequently disposed of by commuters, pedestrians and occasionally durin commercial events. Here, we examine a number of waste items commonly encountered in the surroundings, including BD and NBD items. However, since this was the very firs attempt to segregate these items based on the biodegradable property of the materia therefore, a garbage dataset consisting of the most commonly seen waste items need to b developed. For this purpose, 7826 images were acquired in JPEG format using the camer of an Apple iPhone XR (64GB) with 1280 × 960 pixel resolution. After the preprocessin and cleaning of the collected data, 6437 (82%) images were utilized to form a self-mad real time dataset wherein each image was labeled with the name of the class to which i belonged and its type (BD/NBD). In the present investigation, these cleaned sample im ages were grouped into six classes, namely cardboard, glass, plastic, paper, metal and organic waste, as illustrated in Table 1. Furthermore, a detailed description of the wast items assigned to these defined classes, along with their volume size, is provided in Tabl 2, and their class distribution is represented in Figure 1. Table 1. Illustration of sample images with their respective class.    Further, in general, most of the waste items presented in Table 2 belong to only one class. However, it has been observed that some captured images of a few items, such as pizza box, have been made by combining a very thin plastic coating and cardboard, but to efficiently differentiate between these with-and without-plastic coating boxes even with human eyes is very challenging. Moreover, the incorporation of these special cases significantly enhances the complexity and computational overhead. Therefore, in the present analysis, these objects have been considered as objects of the parent class. Another issue that arises with the development of the present garbage dataset is that, in some of the acquired images, one class itself contains one or many other classes; for example, polybags which may contain other class items within them. However, because of the limitation of visual object recognition, these kinds of issues have been resolved by considering only the visible object and classifying accordingly.

Methodology: YOLOv3 Algorithm
YOLO (You Only Look Once) is one of the most prominent state-of-the-art deep learning techniques [22] which enables simultaneous object detection and classification.
To accomplish the object-detection task, earlier techniques (R-CNN and its variations) employed a pipeline execution architecture, which involves multiple steps. Due to the Further, in general, most of the waste items presented in Table 2 belong to only one class. However, it has been observed that some captured images of a few items, such as pizza box, have been made by combining a very thin plastic coating and cardboard, but to efficiently differentiate between these with-and without-plastic coating boxes even with human eyes is very challenging. Moreover, the incorporation of these special cases significantly enhances the complexity and computational overhead. Therefore, in the present analysis, these objects have been considered as objects of the parent class. Another issue that arises with the development of the present garbage dataset is that, in some of the acquired images, one class itself contains one or many other classes; for example, polybags which may contain other class items within them. However, because of the limitation of visual object recognition, these kinds of issues have been resolved by considering only the visible object and classifying accordingly.

Methodology: YOLOv3 Algorithm
YOLO (You Only Look Once) is one of the most prominent state-of-the-art deep learning techniques [22] which enables simultaneous object detection and classification.
To accomplish the object-detection task, earlier techniques (R-CNN and its variations) employed a pipeline execution architecture, which involves multiple steps. Due to the pipeline architecture and the necessity of the separate training of each individual component, slow speed is achieved along with the increased complexity in optimization. These drawbacks are overcome by YOLO, which transforms object detection into a single regression problem. This performs the simultaneous prediction of multiple bounding boxes and their class probabilities. Unlike sliding window and region proposal-based techniques, the training in YOLO is carried out on full images, thereby directly optimizing the detection performance. However, the real-time speed, end-to-end training capability, along with high average precision and generalization capability of YOLOv3 substantiates its efficiency in performing complex object detection tasks, including significantly small objects.
In general, the YOLOv3 algorithm (as illustrated in Figure 2) simply takes an input image, passes it through a neural network (similar to CNN) to produce an output vector of bounding boxes and class predictions. YOLOv3 extracts a single image, which is then resized to 416 × 416, that serves as the input to the YOLOv3 neural network. The architecture of the YOLOv3 neural network employed in the darknet-53 framework is illustrated in Figure 3. It consists of convolutional layers, residual layers, upsampling layers, and skip (shortcut) connections. The comprehensive details about the architecture of YOLOv3 are available in an extensive volume of literature works [24]. pipeline architecture and the necessity of the separate training of each individual component, slow speed is achieved along with the increased complexity in optimization. These drawbacks are overcome by YOLO, which transforms object detection into a single regression problem. This performs the simultaneous prediction of multiple bounding boxes and their class probabilities. Unlike sliding window and region proposal-based techniques, the training in YOLO is carried out on full images, thereby directly optimizing the detection performance. However, the real-time speed, end-to-end training capability, along with high average precision and generalization capability of YOLOv3 substantiates its efficiency in performing complex object detection tasks, including significantly small objects. In general, the YOLOv3 algorithm (as illustrated in Figure 2) simply takes an input image, passes it through a neural network (similar to CNN) to produce an output vector of bounding boxes and class predictions. YOLOv3 extracts a single image, which is then resized to 416 × 416, that serves as the input to the YOLOv3 neural network. The architecture of the YOLOv3 neural network employed in the darknet-53 framework is illustrated in Figure 3. It consists of convolutional layers, residual layers, upsampling layers, and skip (shortcut) connections. The comprehensive details about the architecture of YOLOv3 are available in an extensive volume of literature works [24].   pipeline architecture and the necessity of the separate training of each individual component, slow speed is achieved along with the increased complexity in optimization. These drawbacks are overcome by YOLO, which transforms object detection into a single regression problem. This performs the simultaneous prediction of multiple bounding boxes and their class probabilities. Unlike sliding window and region proposal-based techniques, the training in YOLO is carried out on full images, thereby directly optimizing the detection performance. However, the real-time speed, end-to-end training capability, along with high average precision and generalization capability of YOLOv3 substantiates its efficiency in performing complex object detection tasks, including significantly small objects. In general, the YOLOv3 algorithm (as illustrated in Figure 2) simply takes an input image, passes it through a neural network (similar to CNN) to produce an output vector of bounding boxes and class predictions. YOLOv3 extracts a single image, which is then resized to 416 × 416, that serves as the input to the YOLOv3 neural network. The architecture of the YOLOv3 neural network employed in the darknet-53 framework is illustrated in Figure 3. It consists of convolutional layers, residual layers, upsampling layers, and skip (shortcut) connections. The comprehensive details about the architecture of YOLOv3 are available in an extensive volume of literature works [24].   The YOLOv3 neural network takes an input image to return an output vector (Figure 4). The output vector consists of the following parameters: • Prediction Probability (Pc): A probability that each bounding box contains a detectable object.  In YOLOv3, the prediction of bounding boxes is carried out by utilizing the dimension clusters as anchor boxes. Four coordinates (tx, ty, tw, th) are predicted for each bounding box by the YOLOv3 neural network. From the top left corner of the image, in this case, the cell is offset by (Cx, Cy) and the width and height of the bounding box prior are (Pw, Ph), then the corresponding predictions are expressed as Bx, By, Bw and Bh, respectively, as demonstrated in Figure 4. Moreover, if * denotes the ground truth corresponding to a certain coordinate prediction, then the difference of the ground truth value (calculated via the ground truth box) and the estimated prediction (i.e., * − * ) is the gradient. By inverting the equations mentioned in Figure 5, the ground truth value can be calculated. In YOLOv3, the score of an object for each bounding box is predicted by utilizing logistic regression. The score of the object is 1 if the overlapping of the bounding-box prior is greatest among all other bounding-box priors w.r.t. the ground truth object. The bounding-box priors, other than the best one, are ignored from prediction, even if their overlapping is greater than the threshold (0.5, in this case). Only one bounding box is assigned for each ground truth object in YOLOv3.  In YOLOv3, the prediction of bounding boxes is carried out by utilizing the dimension clusters as anchor boxes. Four coordinates (t x , t y , t w , t h ) are predicted for each bounding box by the YOLOv3 neural network. From the top left corner of the image, in this case, the cell is offset by (C x , C y ) and the width and height of the bounding box prior are (P w , P h ), then the corresponding predictions are expressed as B x , B y , B w and B h , respectively, as demonstrated in Figure 4. Moreover, ift * denotes the ground truth corresponding to a certain coordinate prediction, then the difference of the ground truth value (calculated via the ground truth box) and the estimated prediction (i.e.,t * − t * ) is the gradient. By inverting the equations mentioned in Figure 5, the ground truth value can be calculated. In YOLOv3, the score of an object for each bounding box is predicted by utilizing logistic regression. The score of the object is 1 if the overlapping of the bounding-box prior is greatest among all other bounding-box priors w.r.t. the ground truth object. The boundingbox priors, other than the best one, are ignored from prediction, even if their overlapping is greater than the threshold (0.5, in this case). Only one bounding box is assigned for each ground truth object in YOLOv3.

Performance Parameter Indices
Electronics 2021, 10, x FOR PEER REVIEW 6 of 20 The YOLOv3 neural network takes an input image to return an output vector (Figure 4). The output vector consists of the following parameters: • Prediction Probability (Pc): A probability that each bounding box contains a detectable object.  In YOLOv3, the prediction of bounding boxes is carried out by utilizing the dimension clusters as anchor boxes. Four coordinates (tx, ty, tw, th) are predicted for each bounding box by the YOLOv3 neural network. From the top left corner of the image, in this case, the cell is offset by (Cx, Cy) and the width and height of the bounding box prior are (Pw, Ph), then the corresponding predictions are expressed as Bx, By, Bw and Bh, respectively, as demonstrated in Figure 4. Moreover, if * denotes the ground truth corresponding to a certain coordinate prediction, then the difference of the ground truth value (calculated via the ground truth box) and the estimated prediction (i.e., * − * ) is the gradient. By inverting the equations mentioned in Figure 5, the ground truth value can be calculated. In YOLOv3, the score of an object for each bounding box is predicted by utilizing logistic regression. The score of the object is 1 if the overlapping of the bounding-box prior is greatest among all other bounding-box priors w.r.t. the ground truth object. The bounding-box priors, other than the best one, are ignored from prediction, even if their overlapping is greater than the threshold (0.5, in this case). Only one bounding box is assigned for each ground truth object in YOLOv3.

Performance Parameter Indices
The present investigation examines some of the fundamental key values [24] throughout the training phase to investigate the performance of YOLOv3 in waste segregation. These fundamental key values are as follows.

Precision
Precision is defined in terms of the ratio of the number of objects detected correctly to the number of total objects detected. Mathematically, precision can be computed as expressed by Equation (1).

Recall
Recall is evaluated in terms of the percentage of the number of objects which are correctly detected to the number of ground truth objects. Recall can be evaluated using Equation (2): where N TP = Number of True Positives, i.e., number of objects detected correctly; N FP = Number of False Positives, i.e., number of detected objects which could not correspond to the ground truth objects; N FN = Number of False Negatives, i.e., number of ground truth objects that could not be detected.

Intersection Over Union (IoU)
IoU is a well-known evaluation metric in object detection tasks, which is mathematically represented by Equation (3) and illustrated in Figure 6.
Electronics 2021, 10, x FOR PEER REVIEW 7 of 20 The present investigation examines some of the fundamental key values [24] throughout the training phase to investigate the performance of YOLOv3 in waste segregation. These fundamental key values are as follows.

Precision
Precision is defined in terms of the ratio of the number of objects detected correctly to the number of total objects detected. Mathematically, precision can be computed as expressed by Equation (1).

Recall
Recall is evaluated in terms of the percentage of the number of objects which are correctly detected to the number of ground truth objects. Recall can be evaluated using Equation (2): where NTP = Number of True Positives, i.e., number of objects detected correctly; NFP = Number of False Positives, i.e., number of detected objects which could not correspond to the ground truth objects; NFN = Number of False Negatives, i.e., number of ground truth objects that could not be detected.

Intersection Over Union (IoU)
IoU is a well-known evaluation metric in object detection tasks, which is mathematically represented by Equation (3) and illustrated in Figure 6.  Here, A and B represent the bounding boxes of prediction and ground truth, respectively.

Average Precision (AP)
For a specified threshold value of IoU, a precision-recall curve can be drawn after the identification of the values of precision and recall. The area under the precision-recall Here, A and B represent the bounding boxes of prediction and ground truth, respectively.

Average Precision (AP)
For a specified threshold value of IoU, a precision-recall curve can be drawn after the identification of the values of precision and recall. The area under the precision-recall curve is referred to as the Average Precision (AP), which can be expressed by Equation (4):

Mean Average Precision (mAP)
This signifies the mean of average precisions of all classes defined in the test model and is expressed by Equation (5) for N number of classes.

Loss Function
In the course of training, the sum of squared error loss [22] is used. The computation of the value of the loss function is one of the important criteria in evaluating the performance of YOLOv3 on the test model. Usually, the loss function is defined by Equation (6). Loss = Error coord + Error IoU + Error cls (6) Here, Error coord is the coordinate prediction error, which is expressed by Equation (7).
Here, x i ,ŷ i ,ŵ i ,ĥ i denote the values of the coordinate position, width and height of the predicted bounding box, respectively, and x i , y i , w i and h i signify the true or actual values. Further, λ coord ,S 2 , and B represents coordinate error weight, number of grids in the input image, and number of bounding boxes generated by each grid, respectively. The value of 1 obj ij will be 1 if the object falls into the jth bounding box in grid i. In Equation (6), Error IoU refers to the IoU error expressed by Equation (8).
where λ noobj , C i , andĈ i represent IoU error weight and predicted and true confidence, respectively. Additionally, Error cls denotes the classification error and is usually expressed by Equation (9). The Error cls corresponding to the ith grid is the addition of classification errors associated with all the objects within that grid.
The notation utilized in Equation (9) is described as below. c The specified class to which the detected object belongs; p true probability that the object belonging to class c is in grid i; p i predicted value.
In the YOLOv3 algorithm, the input image is distributed into a grid of cells of dimensions S × S, where each grid cell can predict three bounding boxes. Usually, YOLOv3 predicts the bounding boxes at three different scales. For determining the bounding box priors, the k-means clustering is used. In the present investigation, nine clusters and three

Anchor Boxes
In the case that the midpoints of multiple objects fall in the same grid cell, then the detection of those multiple objects becomes impossible. To avoid this issue, each object in the same grid was assigned with an anchor box. For instance, if we take three anchor boxes, then three predictions can be associated in a single grid cell. Each object is assigned to that anchor box, which has the highest IoU (Intersection over Union). If the IoU is less than the threshold value (here, set to 0.5), then that particular object will not be considered for detection. Thus, the detection of multiple objects in a single grid cell becomes possible by using the sole idea of anchor boxes, as illustrated in Figure 7.

Anchor Boxes
In the case that the midpoints of multiple objects fall in the same grid cell, then the detection of those multiple objects becomes impossible. To avoid this issue, each object in the same grid was assigned with an anchor box. For instance, if we take three anchor boxes, then three predictions can be associated in a single grid cell. Each object is assigned to that anchor box, which has the highest IoU (Intersection over Union). If the IoU is less than the threshold value (here, set to 0.5), then that particular object will not be considered for detection. Thus, the detection of multiple objects in a single grid cell becomes possible by using the sole idea of anchor boxes, as illustrated in Figure 7.

Non-Max Suppression (NMS)
Another problem encountered in object detection is the multiple detections of the same object, rather than detecting an object just once. The one-time detection of an object is feasible by using non-max suppression. The NMS algorithm compares the bounding box (with max Pc) with all other bounding boxes intersecting with it in a sequential manner. All the bounding boxes associated with the object with comparatively low Pc are suppressed, as demonstrated in Figure 8.

Non-Max Suppression (NMS)
Another problem encountered in object detection is the multiple detections of the same object, rather than detecting an object just once. The one-time detection of an object is feasible by using non-max suppression. The NMS algorithm compares the bounding box (with max P c ) with all other bounding boxes intersecting with it in a sequential manner. All the bounding boxes associated with the object with comparatively low P c are suppressed, as demonstrated in Figure 8. priors, the k-means clustering is used. In the present investigation, nine clusters and three scales were selected. Further, these clusters were evenly divided across scales and they were distributed as (27 × 34), (56 × 66), (118 × 177), (132 × 332), (225 × 234), (220 × 354), (349 × 285), (302 × 356), (376 × 367).

Anchor Boxes
In the case that the midpoints of multiple objects fall in the same grid cell, then the detection of those multiple objects becomes impossible. To avoid this issue, each object in the same grid was assigned with an anchor box. For instance, if we take three anchor boxes, then three predictions can be associated in a single grid cell. Each object is assigned to that anchor box, which has the highest IoU (Intersection over Union). If the IoU is less than the threshold value (here, set to 0.5), then that particular object will not be considered for detection. Thus, the detection of multiple objects in a single grid cell becomes possible by using the sole idea of anchor boxes, as illustrated in Figure 7.

Non-Max Suppression (NMS)
Another problem encountered in object detection is the multiple detections of the same object, rather than detecting an object just once. The one-time detection of an object is feasible by using non-max suppression. The NMS algorithm compares the bounding box (with max Pc) with all other bounding boxes intersecting with it in a sequential manner. All the bounding boxes associated with the object with comparatively low Pc are suppressed, as demonstrated in Figure 8.

The Training
In this work, the entire experimental platform configuration utilized for the training and evaluation of the YOLOv3 neural network are presented in Table 3. From the cleaned dataset of 6437 images, 80% of the images (5150 images) were used for training purposes and the remaining 20% (1287) were used for testing and validation. Usually, the performance of any deep learning model is highly influenced by the size of the dataset. Generally, training with a small dataset leads to overfitting and to handle such a problem, a transfer learning approach is used [25]. In this approach, a pretrained model is repurposed to accomplish similar detection tasks. This initiates with the training of a base network on a base dataset and task, and then the learned features are transferred to a second target network to be trained on a target dataset and task. This process will tend to work if the features are suitable for both base and target tasks, instead of being specific to the base task. Considering the performance of the YOLO algorithms being trained on the large-scale image dataset (COCO), this work transfers the pre-trained YOLOv3 and YOLOv3-tiny networks on COCO.
At the start of training, the initialization of weights was performed using pretrained weights for the convolutional layers darknet53.conv.74. As discussed earlier, the YOLOv3 neural network is trained for the detection of waste via six classes of objects (cardboard, glass, metal, paper, plastic and organic waste), as discussed in Table 2. Moreover, the performance of the YOLOv3 algorithm has been compared with that of YOLOv3-tiny. For this purpose, the parametric settings used to train the model via the YOLOv3 and YOLOv3-tiny algorithm are tabulated in Table 4. The whole investigation environment uses Visual Studio 2017 for the compilation of the entire script. (4 + 1 + 6) × 3 = 33 (4 + 1 + 6) × 3 = 33 * Represents the parameters modified in the original YOLOv3 CFG and YOLOv3-tiny CFG, respectively. Note: Filters usually depend on the number of classes, bounding box properties, prediction probability, and the number of masks, i.e., filters = {number of bounding box properties (4) + Prediction probability Pc (1) + Total number of classes (6)} × Number of masks, where mask denotes the indices of anchors (3).

Performance Evaluation, Results and Discussion
In the present work, the training was accomplished for 12,000 iterations and the total time taken to complete the training simulation was about 48 h on the mentioned platform (illustrated in Table 3) for the YOLOv3 algorithm. During the training simulation, the abovementioned performance parameter indices (including the AP of each class, recall, mAP and average IOU) were examined on a regular time interval. Table 5 presents the results obtained during the training phase with these performance parameter indices for YOLOv3 and YOLOv3-tiny. As evident from Table 5, after the 5000th iteration, mAP reaches around 94% by YOLOv3; however, YOLOv3-tiny yields only 45.96%. Thereafter, the mAP value settled down at approximately the same value (94.99%, best value) for YOLOv3. However, the mAP for the YOLOv3-tiny algorithm for the present experiment does not settle even after 12,000 iterations and attains the best value of 51.95%. The variations in average loss and mAP values w.r.t the number of iterations during training by YOLOv3 and YOLOv3-tiny algorithm are illustrated in Figure 9a,b, respectively. Additionally, for comparative analysis, training was also carried out using the YOLOv3-tiny algorithm on the same dataset, which took approximately 14 h on a similar system configuration, as shown in Table 3. As is evident from the results illustrated in Figure 9a,b, the average loss function value using YOLOv3 and YOLOv3-tiny are 0.6806 and 0.1525, respectively, after the completion of training (12,000 iterations). Conclusively, our simulation experimental training results indicate that the mAP value for YOLOv3 is 82.85% higher than that of the YOLOv3-tiny model, in reference with the best value, which strengthens and validates our hypothesis. instance, the best AP value (as illustrated in Table 5  From Figure 10, it is observed that YOLOv3 offers enhanced AP for each class during experimental training simulation. The trends obtained during the training for all the mentioned classes (Table 2) indicate that YOLOv3 is, again, demonstrating outstanding performance as compared to YOLOv3-tiny in terms of AP, as illustrated in Figure 10. For instance, the best AP value (as illustrated in Table 5) attained by YOLOv3 for various classes (cardboard, glass, plastic, paper, metal and organic waste) are 97.27%, 97.40%, 99.87%, 85.28%, 91.16% and 98.93%, respectively; however, YOLOv3-tiny attains 62.16%, 61.79%, 31.98%, 48.32%, 26.15% and 81.29%, respectively, for the same classes. To provide a comparative insight into the performance of these two approaches in terms of mAP (%), a comparative sketch is provided, illustrating the variations in mAP values with the increasing number of iterations during training by YOLOv3 and YOLOv3-tiny. This comparison is demonstrated in the form of chart, as portrayed in Figure 11.    Furthermore, a statistical analysis of the present work in terms of AP, mAP and detection speed (frames per second, i.e., FPS) is presented in Table 6. The data reveal the effectiveness of YOLOv3 over the YOLOv3-tiny algorithm. Furthermore, a statistical analysis of the present work in terms of AP, mAP and detection speed (frames per second, i.e., FPS) is presented in Table 6. The data reveal the effectiveness of YOLOv3 over the YOLOv3-tiny algorithm. After simulating the training, the test images were validated on the trained model. The test images correspond to the image garbage test set developed in this paper, which has 1287 images, including 165 cardboard, 163 glass, 146 metal, 312 paper, 317 plastic and 184 organic waste samples. The obtained experimental results demonstrate that the detection capability and prediction probability of YOLOv3 is significantly higher than YOLOv3-tiny, as visualized in Figure 12 and presented in Table 7. Most of the test images were accurately detected with acceptable prediction probability by YOLOv3. YOLOv3 gives true predictions of all the test images and is capable of detecting even small-size objects. However, YOLOv3-tiny gives false predictions for test images 1, 2, 6, 8 and 9. In addition, YOLOv3-tiny does not predicts test image 7 at all.    Table 8 presents the missed and false detection rates for both algorithms. Evidently, these detection rates are comparatively low (in the case of YOLOv3) when compared to the YOLOv3-tiny algorithm. Furthermore, the test results illustrate that the simulation time in the prediction to classify the objects using YOLOv3-tiny is lowered by approximately four-times more than YOLOv3, as shown in Table 7. This means that the computation speed of YOLOv3-tiny is significantly higher than YOLOv3. Conclusively, alt-   Table 8 presents the missed and false detection rates for both algorithms. Evidently, these detection rates are comparatively low (in the case of YOLOv3) when compared to the YOLOv3-tiny algorithm. Furthermore, the test results illustrate that the simulation time in the prediction to classify the objects using YOLOv3-tiny is lowered by approximately four-times more than YOLOv3, as shown in Table 7. This means that the computation speed of YOLOv3-tiny is significantly higher than YOLOv3. Conclusively, although the computational performance of YOLOv3-tiny is remarkable, the detection capability and prediction probabilities are not acceptable. Furthermore, to quantify the obtained results, a comparison of detection capability in test images has been made among the models developed by employing YOLOv3 and YOLOv3-tiny. As depicted in Table 9, the trained YOLOv3 model achieved superior detection capability compared to YOLOv3-tiny. It was able to detect most of the objects in the test images with significant prediction probability. Specifically, YOLOv3 achieved 100% detection accuracy for test images 2, 3, 4, 5, 6, and 7; however, YOLOv3-tiny achieved this for very simple test images (3, 4, and 5). Furthermore, the average accuracy for all the test images was been obtained as 85.29% and 26.47% for YOLOv3 and YOLOv3-tiny, respectively. Therefore, YOLOv3 dominates YOLOv3-tiny by a notable margin of 58.82%. However, YOLOv3 also struggles in the accurate detection of objects, particularly under occlusion and complex environmental conditions (test images 1, 8, and 9), as illustrated in Figure 12 and Table 9. This might be because of very small visual appearances and cluttered backgrounds.

Conclusions
This paper presented a novel application of the YOLOv3 algorithm for waste segregation as an aid to strengthen smart urban waste segregation and management framework. The neural network was trained on a self-made dataset of 6437 images of urban waste products for the detection of six classes of waste items. The obtained experimental and investigational results demonstrate the efficiency of the proposed work in the segregation of waste into two different categories-biodegradable and nonbiodegradable. The near real-time detection of waste was accomplished in this work. The quantitative comparison of the results obtained by YOLOv3 and YOLOv3-tiny endorse the efficacy of YOLOv3 in waste segregation. Furthermore, the improved prediction probability by YOLOv3 demonstrates its effectiveness over YOLOv3-tiny. Conclusively, the comparative analysis between YOLOv3-tiny and YOLOv3 quantified the percentage improvement in speed with reduced accuracy (due to the simplified architecture of YOLOv3-tiny) which, in turn, helped in understanding the accuracy-speed trade-off. Furthermore, the garbage image detection process consisted of many complexities, such as objects made up of more than one type of material and may inherited other-class objects. To deal such real time complexities, objects belonging to parent classes and objects of visible category only were considered; however, this opens the window for further research to more exactly classify garbage, depending upon the property of material(s). Additionally, the object detection strategy for waste segregation utilized in this work opens the gateway for the effective recycling and disposal of waste. However, the reduction in the time of detection along with exceptionally high prediction probability provides scope for further research. Future work will focus on the optimization of results, along with the prediction probability for other waste items in the real world.