Solid Waste Image Classiﬁcation Using Deep Convolutional Neural Network

: Separating household waste into categories such as organic and recyclable is a critical part of waste management systems to make sure that valuable materials are recycled and utilised. This is beneﬁcial to human health and the environment because less risky treatments are used at landﬁll and/or incineration, ultimately leading


Introduction
Solid waste management typically relies on community residents to manually separate household solid waste into two broad categories, namely organic and recyclable [1]. Organic wastes (typically derived from plants and animals) are biodegradable [2] and have enormous economic benefits because they can be treated to produce soil additives and methane [3,4]. Recyclable waste, on the other hand, includes reusable materials such as glass, metal, paper, and electronics, which can be transformed into new materials [5]. However, waste separation by residents is not rigorous due to various factors such as low subjective consciousness, limited knowledge of waste classification, etc. [6]. As such, further (manual) classification is usually undertaken by operators working at local waste management depots. This is inefficient and expensive, and unsorted solid waste often ends up in land-fill or openly dumped, thus presenting a huge burden to global public health as a result of high infection rates to people exposed to solid waste dumping sites [7].
Recent estimates suggest that only 13.5% of global waste is recycled, while 33% is directly dumped openly without classification [8]. Common hazards associated with dumping unsorted waste openly include soil contamination, surface and ground water pollution, greenhouse gas emissions, and reduced crop yield [9]. In fact, only 17.4% of global electronic waste is collected/recycled, the cost of which is estimated to be around 57 billion United States Dollar (USD) [10]. This was corroborated by The Ellen MacArthur Foundation [11] who argued that 32% of plastic packaging are not being collected and estimated the economic loss to be between 80 billion USD and 120 billion USD. As global waste growth is expected to exceed that of population growth by 2050 [8], this will not only have serious implications on ecological balance, but will also threaten the global sustainable development and human well-being. This calls for the development of tools that improve the automation of waste management.
Automatic recognition and detection of waste from images has become a popular choice to replace manual waste sorting, thanks to the rapid advances in computer vision and artificial intelligence. Many machine learning algorithms have been proposed to improve the accuracy of automatic waste classification [12][13][14]. In recent years however, deep neural networks [15], especially convolutional neural network, (CNN), have proven to be very effective in learning from existing data, achieving remarkable results in image classification [14,[16][17][18]. Thus, by taking images of solid waste as input data, CNNs can automatically classify waste into the relevant categories.
Various standard CNN architectures have been recently proposed to perform image classification tasks with high accuracy, such as VGGNet [19], AlexNet [20], ResNet [21], and DenseNet [22]. However, efficiency in terms of model size and development time is a major challenge posed by these standard models. This is because they are often pre-trained for more than one purpose. For example, VGGNet is trained for 1000 different categories and consists of 16 convolutional layers with 138 million parameters. This is generally appealing, but inefficient in cases where fewer layers are required to perform a specific task. A user without advanced knowledge to modify the architecture would normally have to train the whole model, resulting to large model sizes with high and unnecessary computational cost. Even when the architecture is modified (e.g., layer reduction/freezing), the resulting model is unlikely to be small, as evidenced by Hang et al. [23] who evaluated the efficiency of nine standard CNN architectures with layer reduction to suit their leaf disease classification task. The model sizes ranged from 45.1 MB to 558.4 MB. Therefore, building and training CNN from scratch is desirable due to the flexibility it offers to implement an architecture that fits specific task requirements without excessive use of system resources.
Another major challenge in CNN research is data paucity, because a large amount of data are required for CNN training. Although experimental data are becoming easier to access from public repositories, there is still a shortage of waste image datasets for model training. Among the publicly available waste classification datasets is Sekar's [24] available on Kaggle (www.kaggle.com accessed on 1 January 2022) public data repository which consists of 25,077 images of organic (13,966) and recyclable (11,111) waste materials. However, this training dataset is rather modest and unable to capture the characteristics of all solid waste categories accurately. There is still a lack of large-scale databases for waste classification on the scale of ImageNet (www.image-net.org/ accessed on 1 January 2022) which consists of 14,197,122 images organised into 21,841 categories. Also, CNN typically have large number of parameters, so the training process takes considerable time and resources; ultimately leading to large model sizes and computational time.
In this paper, we present a bespoke CNN architecture developed for waste image classification consisting of five convolutional 2D layers of various neuron sizes; followed by a number of fully connected layers. Experiments was based on Sekar's [24] waste classification dataset available on Kaggle. To overcome the drawback of insufficient data, augmentation methods [25] were applied to increase the amount of data available for training, validation, and testing. To investigate the possibility of training an efficient lightweight model with high performance and less computational demand; we trained the bespoke CNN architecture described in Section 3.3 with two different image resolutions (80 × 45 and 225 × 264) of the augmented version of Sekar's [24] waste classification dataset and compared performance in terms of accuracy, development time, and model size. As background, the image resolution of the original dataset is predominantly 225 × 264 pixels, so we considered downsizing the resolutions to 80 × 45 pixels to demonstrate how the bespoke CNN architecture can be used for different target applications. For example, web applications using high-resolution camera with no memory size constraints will likely benefit from the model with larger image pixels, while an embedded application using a low-cost device with a low-resolution camera and/or reduced memory size would benefit from the smaller model.
We initially considered performance comparison between our bespoke CNN architecture and other published studies that evaluated their approach with the same waste classification dataset [26][27][28]. However, fundamental flaws observed in the methods and validation approach used in these studies raised some questions about the reliability of their results. Unfortunately, information gaps in their experimental setup means that we could not self-reproduce their methods without a source code being provided. Other relevant studies either experimented with a fraction of the dataset (approximately 20% or less) [29][30][31], or merged with other similar datasets to increase the training data [32,33]. Thus, in the absence of a 'reliable' and/or 'reproducible' baseline approach, we trained a random guess classifier which forms the baseline against which the performance of our approach was compared. Performance evaluation was based on accuracy and cross-entropy loss metrics. The accuracy metric calculates how often predictions equals class labels [34,35], while crossentropy loss evaluates the divergence of predicted probability from actual class labels [36]. To encourage transparency and allow the reproducibility of our experiments, details regarding where to find data and code supporting the results reported in this paper are available at [37] (Data and code are available at: www.data.mendeley.com/datasets/n3gtgm9jxj/2 accessed on 1 January 2022), including the dataset generated during the study after initial cleanup, a Jupyter notebook (.ipynb) file useful to apply data augmentation on the dataset, and a Jupyter notebook (.ipynb) file to replicate the data split and experimental method/setup. This paper makes the following contributions: 1.
The provision of a reconstructed and represented version of an existing dataset for solid waste classification [24] (including source code) such that it can be used by other researchers to reproduce the experiments, improve results, and compare performance; 2.
The proposal of a bespoke, lightweight CNN framework based on image size reduction for waste classification, with low time and computation requirements and relatively high accuracy performance.
Other researchers who evaluated a variety of standard CNN architectures include Wang et al. [47], who achieved 86.19% accuracy with a fine-tuned VGGNet-19 tested on a self-composed waste data of 69,737 images. Castellano et al. [48] evaluated VGGNet-16 with waste data of 2527 images and achieved 85% accuracy. Radhika [49] found Mo-bileNetV2 [50] more accurate than ResNet, VGGNet, and InceptionNet with an accuracy of 98%. The evaluation dataset was not specified. However, Rahman et al. [51] achieved the best accuracy of 95% with ResNet-34, evaluated with a dataset consisting of 2527 images. Buelaevanzalina [52] achieved 83% accuracy with VGGNet-16; while Kusrini [53] achieved an f-score between 69-82% with YOLOv4 CNN evaluated on a multi-class dataset containing 3870 images of classes glass, metal, paper, and plastic.
Some researchers have developed bespoke CNN models for waste classification. Among them, Junjie et al. [54] implemented a hybrid CNN-ELM model and evaluated its performance with two public datasets (including TrashNet). The model was compared to a wide variety of standard CNN models and VGGNet-19 produced the best accuracy between 91% and 93% accuracy. The proposed CNN-ELM model only achieved 90% accuracy, but was 720 s faster. Alonso et al. [55] evaluated an unspecified CNN architecture with 3600 selfobtained waste images consisting of four class labels-paper, plastic, organic, and glass-and achieved f-score for each class between 59% and 75%. Mollá [32] generated ∼12K waste images from a combination of various sources and achieved between 65-85% accuracy with an unspecified CNN architecture. Other researchers that evaluated bespoke/unspecified CNN includes Liang [18] with 95% accuracy.
A problem commonly associated with CNN research is the shortage of training data, therefore many researchers have merged various open source datasets in their study. For example, Majchrowska et al. [56] merged 10 different datasets and achieved 75% accuracy with EfficientDet-D2. Sivakumar et al. [57] used a combination of four datasets (including TrashNet) to achieve up to 98% accuracy with a bespoke eight-layer CNN architecture. Faria et al. [33] created new 'OrgalidWaste' dataset containing around 5600 images with four classes-organic, glass, metal, and plastic. Of the five CNN architectures evaluated, VGGNet produced the best results with an accuracy of 88.42%.
Mulim et al. [27], Togaçar et al. [26], and Mallikarjuna et al. [28] are the only studies found to have evaluated their methods on Sekar's [24] waste classification dataset used in our experiments. Togaçar et al. [26] achieved a best accuracy of 99.95%. with an autoencoder network that simultaneously transformed the data from the image space to the feature space and used a CNN model to extract features. Support Vector Machine (SVM) was used as classifier in all experiments. Mallikarjuna et al. [28] achieved 90% with a four-layer CNN after transforming the data with the ImageDataGenerator class, provided by the Keras deep learning neural network library [34,58]. Mulim et al. [27] performed similar transformation on the same dataset and used it to train a 'modified' version of EfficientNet-B0 CNN model [59]. Their best accuracy is 96%.
Despite these achievements, there are several problems with existing CNN research and experimental practices. First, the variations in data size (some of which are extremely modest for training a CNN), data preprocessing technique and validation approach (including training, validation, and testing split) varies between research studies undertaken with the same dataset. Second and most importantly, there are often information gaps in the methodology and experimental setup which mean that the experiments cannot be reproduced without a source code being provided. Specifically, there are fundamental flaws and methodology intransparency within the three studies [26][27][28] that evaluated their CNN architectures on Sekar's [24] waste classification dataset; none of them actually provided a source code to reproduce their experiments.
For example, Togaçar et al. [26] used Irving's AutoEncoder [60] to reconstruct the original dataset, but failed to supply all the necessary parameters to replicate the data preprocessing steps. Then, the original and reconstructed dataset was combined to train three CNN architectures-AlexNet, GoogLeNet, and ResNet-50-with a transfer learning approach [61] to extract features. The features were subsequently reduced with Ridge Regression (RR) feature selection method [62] and used as input for training several SVM classifiers. Although the theoretical underpinning of RR and the CNN architectures were explained, the specific parameters used to implement them in the study were not specified. Other transparency issues that make the experiments irreproducible include lack of specific parameters for training the SVM classifiers as well as the data pre-processing steps applied to the experimental data. Specifically, the study reduced the original waste classification dataset [24] from 25,077 to 22,222 to balance the classes (i.e., 11,111 images per class). The selection was achieved by random sampling from the original dataset, but without the actual pre-processed data the experimental results are impossible to replicate because different subsets of the data will likely lead to different results. In addition, the accuracy of 96% reported in the paper is arguably incorrect due to fundamental flaws in the validation approach used in the study. For example, the authors reported that an 80:20 training and test split was applied to the experimental data during feature extraction experiments, but it is not clear how and what subset of the data was used for model validation during training. It seems that the test dataset was used during training for validation/parameter tuning, as well as testing after model training. Additionally, the reported 'final' accuracy of 96% was based on k-fold cross validation (k = 10) applied to SVM classifier (perhaps with the same test dataset used for feature extraction experiments). This validation method is more appropriate during training for parameter tuning and testing, and should 'ideally' be conducted on a different dataset unseen by the classifier during training.
A similar error was observed with Mulim et al. [27] who also used transfer learning approach to extract features from the same dataset before training a 'modified' EfficientNet-B0 CNN architecture [59] to generate a best accuracy of 96%. Firstly, the modifications made to EfficientNet-B0 were not explicitly stated to aid reproducibility of results. Additionally, the study made the same fundamental error of not using a separate validation dataset during training. Although an 80:20 training and test split was reported, it seems that the validation/parameter tuning performed during training was based on the test dataset, which makes the results unreliable.
Mallikarjuna et al. [28] achieved 90% accuracy with (what looked like) Sekar's waste classification dataset [24] by training a four-layer CNN architecture after performing image augmentation. However, there are huge inconsistencies, i.e., lack of transparency as well as numerous fundamental errors in the method and experimental setup. For example, conflicting information about the original data size makes the experiment impossible to replicate such as '. . . 20,000 images in data set which consists of 502 Organics and 1502 recycle. . . '. In another areas of the paper, the authors reported the data as consisting of '. . . 22,564 images in all belonging to two classes namely, 'Organic' and 'Recyclable' with 2513 images each'. An 80:20 training and data split was specified, but there is no indication of a validation set. In addition, the parameters used in the four-layer CNN architecture were not specified, thus making the experiment irreproducible.
In view of the methodology transparency issues surrounding existing studies that utilised Sekar's waste classification dataset [24], the research reported in this paper is timely to provide detailed information about the experimental data and experimental set up in a way that encourages methodology transparency and allows for reproducibility of results. This practice will facilitate cross comparison of methods, ultimately leading to a clear pathway to identify and improve on the state-of-the-art.

Materials and Methods
This section presents the experimental method and materials, including details of the experimental dataset, the data preprocessing steps undertaken to setup the experiments, and the method adopted to address the study aims.

Image Data Augmentation
Image augmentation is a useful technique used to increase the diversity of the training dataset such that realistic but random copies of the original image can be generated through simple transformations such as geometric and colour space changes, image cropping, noise injection, and random erasing. By expanding limited datasets, this procedure takes advantage of the capabilities of big data and is known to improve model performance [25]. The augmentation presented in this paper is based on the ImageDataGenerator class provided by the Keras deep learning neural network library [34,58]. Specifically, we used the ImageData-Generator class to perform 13 transformations on the images (i.e., image rotation and height and width shift); 6 geometric transformations (i.e., horizontal and vertical flip), 5 colour transformations (i.e., contrast, brightness, hue, saturation, and gamma); and 2 additional image manipulations (zoom and blur). The geometric transformations allows for images to be captured in different angles. The colour transformations were deemed necessary to simulate different exposition and luminosity conditions.

Method
We performed image classification tasks on the experimental data with the aim of finding the class (i.e., organic or recyclable) to which a new 'unseen' observation belongs. As noted in Section 2, standard machine learning methods such as SVM, decision trees, K-Nearest Neighbour (k-NN), etc. have been applied to classify images with varying levels of success. However, more recent methods such as deep neural network, especially CNN, have proven more successful for image classification [12,14]. In particular, CNN architectures such as VGGNet [19], AlexNet [20], ResNet [21], and DenseNet [22] have proven successful for image classification with high accuracy. However, models trained with these standard architectures take a large amount of system resources because they are often pre-trained for more than one purpose, which makes them inefficient in terms of model size and development time when dealing with specific requirements such as the waste image classification task presented in this paper. Thus, we developed a bespoke 5-layer CNN architecture presented in Figure 2.
To investigate the possibility of training an efficient light-weight model with performance and less computational/resource demand, the CNN architecture was trained with two different pixel sizes-80 × 45 and 225 × 264 pixels-of the augmented version of Sekar's [24] waste classification dataset described in Section 3.2. As background, the predominant image resolution size in the original dataset is 225 × 264, hence the selection. However, the smaller resolution size of 80 × 45 pixels was chosen arbitrarily. We considered downsizing the original images for the following reasons:

1.
To show how image resizing can be used to address the requirements of different applications, namely: a light-weight application for low-cost device with limited memory capacity and low-resolution camera and a robust application using highresolution camera without memory restriction.

2.
To investigate the variation in performance between the two models. The idea is to determine if smaller image resolution can achieve a relatively high performance, thus avoiding unnecessary waste of system resources in terms of model size and computational time. We also trained a random guess classifier which forms the baseline against which the performance of our bespoke CNN models was compared. This was deemed necessary due to the absence of 'reliable' and 'reproducible' existing work to perform a direct comparison. All experiments were conducted in 50 epochs to obtain the best parameter sets. Performance evaluation was based on accuracy and cross-entropy loss observed during model training, validation, and testing.
In classification tasks, accuracy metric calculates how often predictions equals class labels [34,35]. Its value can be represented mathematically as Equation (1): where t p is the number of positive instances predicted correctly; t n is the number of negative instances predicted correctly; f p is the number of positive instances predicted incorrectly; and f n is the number of negative instances predicted incorrectly. Cross-entropy loss is a common loss function used to optimise and evaluate classification models because its value reveals the magnitude of predicted probability divergence from the actual class labels. This value is pegged on the understanding of Softmax activation function that is usually placed at the end of CNN architectures to convert output logits (i.e., unnormalised predictions) into classification probabilities. For binary classification tasks, cross-entropy loss is defined mathematically as Equation (2): where t i is the truth value taking a value 0 or 1 and p i is the Softmax probability for the ith class.

Experimental Setup
The bespoke CNN architecture is presented in Figure 2, comprising a series of convolutions, plus activation and pooling operations, followed by a number of fully connected layers. Specifically, the CNN consists of 5 convolutional 2D layers of various neuron sizes, each with a ReLu activation function and 2D max pooling of 2 × 2 window size. The output of the convolution plus pooling operations is flattened and fed into 2 dense (fully connected) layers, with ReLu and softmax activation function, respectively. A dropout layer of value 0.5 is inserted between the dense layers to classify the given input training images into 2 full level classes. Dropout is by far the most popular regularisation technique for deep neural networks [65] and is known to add a fairly substantial gain to the model accuracy. It also prevents over-fitting, because a neuron is temporarily 'dropped' or disabled with probability p at each iteration during training. This means that all the inputs and outputs to this neuron will be disabled at the current iteration and resampled with probability p at every training step. In other words, a dropped out neuron at one iteration can be active at the next one. The hyperparameter p, commonly called dropout-rate, is typically defined as a number between 0.0 (no outputs from the layer) and 1.0 (no dropout). Dropout values between between 0.5 and 0.8 are recommended for a hidden layer [66] so we used 0.5 for our CNN architecture. This corresponds to 50% of the neurons being dropped out during training.
The initial input size shown in Figure 2 refers to the maximum input resolution (i.e., 225 × 264) for the augmented dataset used to train the larger model. The same architecture, with smaller image resolution (i.e., 80 × 45), was used for training the smaller model. The number of epochs used to train the network is 50 for all experiments (including the baseline). The training, validation, and testing experiments were performed on the augmented dataset split into 60% training, 15% validation, and 25% testing as shown in Table 2. Deep neural networks are trained based on the stochastic gradient descent optimisation algorithm, so error for the current state of the network is repeatedly estimated as part of the optimisation algorithm. This means that an error function (known as loss function) must be defined for estimating the loss of the model at each training iteration so that the weights can be updated to reduce the loss on the next evaluation. More importantly, the chosen loss function must be appropriate for the modelling task, in our case classification, and the output layer configuration must match the chosen loss function [67].
For the bespoke CNN architecture implemented in this paper, we used Keras built-in methods [35] for evaluation including accuracy and loss function. Specifically, we used the cross-entropy loss which is the default loss function for classification problems. In Keras, this is specified by compiling the trained model with categorical cross-entropy. Cross-entropy calculates a score that summarises the average difference between the actual and predicted probability distributions for all classes in the problem. The score is then minimised, and a perfect cross-entropy value is 0. For model optimisation, we used the Adadelta algorithm with a learning rate of 1.0 to match the exact form in Zeiler's original paper [68]. We specified accuracy and loss as the performance metrics.
The baseline model used for comparison was implemented by simply replacing the Softmax output probabilities from the CNNs with randomly generated floating point values between 0 and 1. This was repeated 50 times for each experiment to mimic the number of epochs used in the CNN experiments. We did not report development time and model size for the baseline model because it is impractical and unnecessary.

Results
In this section, we present the results obtained from the baseline model as well as the bespoke CNN architecture trained with small (80 × 45) and large (225 × 264) image resolutions. Aggregate measures derived from compiling the models were used for evaluation such as training, validation, and testing accuracy and loss. For simplicity, the performance of the three models are reported together in Table 3. However, the visualisation of the results is represented separately for each model.  As shown in Table 3, the small CNN model with 80 × 45 image resolution is relatively lighter than the large model by 1.27MB. The training time is also better with the small model (6.40 h), compared to the large model which took 65.46 h to train. This is particularly important when considering the type of application to deploy the model. For example, the small model would be suitable for embedded applications on low-cost devices with a low-resolution camera and/or limited memory size. On the other hand, the large model is memory demanding and would suit applications with high-resolution camera and no memory size constraints. Computational cost calculation for the baseline model is impractical, so we did not compare the baseline with our approach in terms of development time and model size. Direct comparison with standard CNN architectures such as VGGNet [19], AlexNet [20], ResNet [21], and DenseNet [22] was also deemed unnecessary for the research presented in this paper due to the following reasons: 1.
The standard CNN architectures were developed as part of the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [69], where researchers compete to correctly detect and/or classify objects and scenes in a large database consisting of 14,197,122 images organised into 21,841 categories. As such, the pre-trained models are inherently very large. 2.
Self-reported model size and development time will vary among research studies due to variations in the computer system specification, purpose of experiments, data size, etc. For example, Hang et al. [23] used nine standard CNN architectures (with some modifications, such as layer freezing) to classify plant leaf diseases, and compared model size and training time. InceptionNet-v2 [46] produced the smallest model size of 45.1 MB within 2187.3 s. This is super-fast when compared to the 6.40 h used to train our smaller bespoke CNN model that is only 1.08 MB. Their experiment was faster due to higher system specification (i.e., i7-8700k processor and 32GB RAM, accelerated by two NVIDIA GTX 1080TI GPUs). Based on these reasons, we believe that models trained with the standard CNN architectures are unlikely to result in lower model size and computational cost than our bespoke CNN models, even if they are modified and self-implemented with the experimental dataset used in this paper.
In terms of accuracy, our approach performed better than the baseline model, which produced 50.05% accuracy during training, validation, and testing. Thus, emphasis in this section is on the comparative performance between the CNN models. The smaller CNN model is generally more accurate than the larger one during training, validation, and testing. Specifically, the smaller model is 6.72% more accurate during training and 4.69% during testing. However, both models produced the same accuracy (79.21%) during validation. An important variation to note is the accuracy margin between training, validation, and testing per model, as huge differences may indicate how generalisable (or not) a models is. For example, the variation between validation and testing accuracy is minimal for both CNN models. In particular, the small model is 1.67% more accurate during testing than validation, but the large model degraded by 3.02% during testing than validation. The 'training to validation' and 'training to testing' variation is much higher for both models, which provokes an interesting discussion. For the large module, accuracy reduced by 10.35% from training to validation and 13.37% from training to testing. The small model exhibited a similar pattern, but with even larger accuracy reduction from 'training to validation' (17.07%) and 'training to testing' (14.40%). These variations can be seen clearly in Figures 3 and 4 for the large and small models, respectively.
There are many reasons why CNN models exhibit this behaviour, ranging from overfitting to model complexity; it could even be due to unrepresentative training, validation, and/or testing data. The model's ability to adapt properly to new data (i.e., generalisation) is highly influenced by how similar or dissimilar the unseen data, drawn from the same distribution, are from the one used to train the model. The loss function usually helps to unravel the reasons for fluctuating accuracy values from 'training to validation to testing'. For example, loss observed in the small model increased exponentially from training (0.1073) to validation (2.1885) to testing (5.4401). However, loss observed in the large model increased from training (0.2954) to validation (0.7083), but decreased to 0.5692 during testing. These observations are certainly not the classic case of 'loss decreases while accuracy increases', and the various reasons for this are discussed further in Section 5.

Discussion
In order to understand the fluctuations in loss and accuracy values observed in Section 4, it is important to explain the relationship dynamics between loss and accuracy. Intuitively, loss and accuracy are believed to be inversely correlated, where lower loss and higher accuracy should lead to better predictions. However, this is not the case in Table 3, where the smaller model with higher loss led to higher accuracy and better prediction than the larger model. Although surprising, this is not unheard of, as loss and accuracy are not necessarily exactly inversely correlated. While loss is a measure of the variation between class labels (0 or 1) and the raw prediction (which is typically a float), accuracy measures the difference between class labels and threshold prediction (also represented as 0 or 1). Therefore, raw prediction changes exponentially with loss, but accuracy is more 'resilient' because raw predictions will have to go over/under a certain threshold to actually affect the accuracy value.
To put this into context, let us consider the binary classification performed in the experiment, where the task is to predict whether an image is organic or recyclable waste. The raw prediction of the CNN is a sigmoid (outputting a float between 0 and 1), but the CNN is trained to output 1 if the image belongs to organic waste and 0 otherwise. In the results shown in Figure 3, two phenomena are happening at the same time, i.e., the classic 'loss decreases while accuracy increases' and the less classic 'loss increases while accuracy stays the same'. In the earlier phenomenon, some images with borderline predictions may be predicted better, and so the output class changes (e.g., an organic image whose prediction was 0.4 becomes 0.6). This is the classic 'loss decreases while accuracy increases' behaviour that is usually expected. However, some images with very poor predictions may continue to worsen (e.g., an organic image whose prediction was 0.3 becomes 0.2). This leads to the less classic 'loss increases while accuracy stays the same' phenomenon. It is also important to note that when cross-entropy loss is used for classification (as is the case with the experiments presented in this paper), bad predictions are penalised much more strongly than good predictions are rewarded. Thus, for an organic image, the loss is log(1prediction), which means that, even if many organic images are correctly predicted (low loss), a single misclassified organic image will have a high loss, hence disproportionately increasing the mean loss. This phenomena has been illustrated by other researchers [70] to show that increasing loss and stable accuracy could also be caused by good predictions being classified a little worse.
The second phenomenon (loss increases while accuracy stays the same) is more likely the case with the large model and less likely with the small model due to the level of 'loss to accuracy' asymmetry observed in Figure 3 vs. Figure 4. For the small CNN model shown in Figure 3, both accuracy and loss are increasing, which may indicate that the CNN is starting to over-fit, especially because both phenomena are happening at the same time. Specifically, the CNN seem to be learning patterns only relevant for the training set and not great for generalisation (classic phenomena) where some images from the validation set are being predicted really wrong, which has an amplified effect on the 'loss asymmetry'. At the same time, the CNN is still learning some patterns which are useful for generalisation (less classic phenomena) as more images are being correctly classified. In such cases, dropout usually helps in generalising the model. The bespoke CNN architecture presented in this paper uses a dropout of 0.5 (for both large and small models) which means that 50% of the network neurons are dropped during training whereas all the neurons are used for validation. A less aggressive dropout below 0.5 may bring the training and validation loss much closer, thus making the model more accurate during testing. That said, the asymmetry observed in the small model ( Figure 3) may also be due to other factors, such as model complexity and unrepresentative validation data. In the former case, the model may be too complicated for the task, and perhaps a reduction in the depth (number) of layers may resolve the problem. In the latter case, the training data may be unrepresentative compared to the validation data. The recommended solutions to such problems are to randomise the training-validation and testing data split and increase the experimental dataset, respectively. This seems very unlikely to be the cause of the problem in our experiments, because both recommendations have already been applied-the split (60% training, 15% validation, and 25% testing) was randomly drawn from the experimental data, and augmentation was used to increase the original dataset from 24,705 to 345,870 images as reported in Table 1. A possible observation from the literature (not taken into consideration in our experiments) is that some researchers [18] have discredited the original dataset as relatively confusing, with some images either being mislabelled and/or having multiple labels between organic and recyclable. This and other factors related to reducing model complexity and tuning the dropout will be investigated in future research.
The larger model (although less accurate) seems more generalisable based on the observations in Figure 4. The CNN exhibited only the less classic 'loss decreases while accuracy stays the same' behaviour. Specifically, the CNN peaked at epoch 9, with training loss (0.1591) and accuracy (0.9432), and validation loss (0.7830) and accuracy (0.8314). It is very common for researchers to apply early stoppage to their code at this point, thus leading to higher performance being reported. However, both training and validation loss seem to be going high at this point, which means that the model was probably over-fitting. The model seemed to stabilise around epoch 19 (training loss: 0.3558, training accuracy: 0.8712, validation loss: 0.6905, and validation accuracy: 0.8074) with the odd spikes in loss and accuracy. Indeed, the validation loss and accuracy values obtained at this point seem much closer to the overall results obtained during testing, as shown in Table 3. This shows that the large model is more robust and generalisable compared to the small model. Therefore, future research and experimentation is required to improve the effectiveness and generalisability of the proposed framework of the small model to achieve a balance between loss and accuracy.

Conclusions
We have investigated the automation of waste classification by evaluating the performance of a bespoke CNN architecture trained on two different image resolutions of Sekar's [24] dataset available on Kaggle. We acknowledge that several research works have been reported in this area that utilised the same dataset, but none to our knowledge has been explicit about their experimental setup, or provided source code which allows for their results to be replicated. As such, we implemented a random guess classifier which forms a baseline against which the performance of our approach was compared. As the task is a binary-class one involving a modest dataset size of 24,705 images, we used augmentation to increase the dataset to 345,870. We investigated the performance of two image resolution sizes of the dataset (large model: 225 × 264 and small model: 80 × 45) and compared the results. Our experiments show that the bespoke CNN performed better than the baseline classifier. The small model is relatively lighter than the large model by 1.27 MB, and the training time is also better with the small model (6.40 h) compared to the large model, which took 65.46 h to train. This means that the small model would be suitable for embedded applications on low cost devices with a low-resolution camera and/or limited memory size. On the other hand, the large model is memory demanding and would suit applications with high-resolution camera and no memory size constraints.
In terms of accuracy, the small model also performed better than the large one, but the large model seem more generalisable. However, the results obtained in the small model might just be a signal to the model complexity and/or original data veracity. For example the model might be too complex for the classification task, and literature evidence suggests that some images in the original data are either mislabelled and/or have multiple labels between organic and recyclable. In future work, a deeper analysis would be performed on the experimental data to identify and remove any mislabelled images before re-training the model. We will also consider various parameter tuning such as increasing the epoch and using a less aggressive dropout below the 50% currently in place.
Author Contributions: Conceptualisation, formal analysis, investigation, resources, writing-original draft preparation, review and editing, supervision, and project administration, N.N.; methodology, software, validation, data curation, and visualisation, J.B. and J.P. All authors have read and agreed to the published version of the manuscript.