Next Article in Journal
Pipelined Space-Time Krylov Method with Preconditioning: A Parallel-in-Time Algorithm for Biot’s Quasi-Static Poroelasticity
Previous Article in Journal
Frequency-Domain Feature Learning Network for Joint Image Demosaicing and Denoising
Previous Article in Special Issue
Dataset-Aware Preprocessing for Hippocampal Segmentation: Insights from Ablation and Transfer Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Image-Based Waste Classification Using a Hybrid Deep Learning Architecture with Transfer Learning and Edge AI Deployment

1
Faculty of Electrical Engineering and Computer Science, University of Maribor, SI-2000 Maribor, Slovenia
2
MIPS Programska Oprema d.o.o., SI-2000 Maribor, Slovenia
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(7), 1176; https://doi.org/10.3390/math14071176
Submission received: 2 February 2026 / Revised: 16 March 2026 / Accepted: 28 March 2026 / Published: 1 April 2026
(This article belongs to the Special Issue The Application of Deep Neural Networks in Image Processing)

Abstract

Growing amounts of municipal waste and the need for efficient recycling demand automated and accurate classification systems. This paper investigates deep learning approaches for multi-class waste sorting based on image data, comparing three widely used convolutional neural network architectures (ResNet-50, EfficientNet-B0, and MobileNet V3) with a custom hybrid model (CustomNet). The dataset comprises 13,933 RGB images across 10 waste categories, combining publicly available samples from the Kaggle Garbage Classification dataset (61.1%) with images collected in house (38.9%). The three glass sub-categories (brown, green, and white glass) were merged into a single glass class to ensure consistent class representation across all dataset splits. Preprocessing steps include normalization, resizing, and extensive data augmentation to improve robustness and mitigate class imbalance. Transfer learning is applied to pretrained models, while CustomNet integrates feature representations from multiple backbones using projection layers and attention mechanisms. Performance is evaluated using accuracy, macro-F1, and ROC–AUC on a held-out test set. Statistical significance was assessed using paired t-tests and Wilcoxon signed-rank tests with Bonferroni correction across five-fold cross-validation runs. The results show that CustomNet achieves 97.79% accuracy, a macro-F1 score of 0.973, and a ROC–AUC of 0.992. CustomNet significantly outperforms EfficientNet-B0 and MobileNet V3 ( p < 0.001 , Bonferroni corrected), and it achieves performance parity with ResNet-50 ( p = 0.383 ) at a substantially lower parameter count in the classification head (9.7 M vs. 25.6 M). These findings indicate that combining multiple feature extractors with attention mechanisms improves classification performance, supports qualitative model explainability via saliency visualization (Grad-CAM), and enables practical deployment on heterogeneous Edge AI platforms. Inference benchmarking on an NVIDIA Jetson Orin Nano demonstrated real-world deployment feasibility at 86.70 ms per image (11.5 FPS).

1. Introduction

Global waste generation exceeded 2.24 billion tons in 2020 and is projected to increase by more than 70% by 2050, reaching 3.9 billion tons annually, posing severe environmental and economic challenges [1]. Improper waste management contributes to resource depletion, pollution, and greenhouse-gas emissions, highlighting the need for automated data-driven solutions aligned with the principles of a circular economy [2]. Traditional classification systems remain largely manual, slow, and error-prone, while the visual complexity of waste (high intra-class variability and inter-class similarity) makes reliable automation difficult [3,4].
Recent advances in computer vision and deep learning have enabled robust image-based classification. Convolutional neural networks (CNNs) such as AlexNet [5], VGGNet [6], Inception [7], ResNet [8], DenseNet [9], and later EfficientNet [10] and MobileNetV3 [11] have demonstrated strong performance in general image recognition and environmental applications [12,13,14]. However, waste classification introduces unique challenges: limited labeled datasets, strong texture overlap among materials, and the need for real-time inference on resource-constrained devices [15,16].
Recent progress in attention mechanisms and hybrid CNN–Transformer architectures has reshaped computer vision [17,18,19]. Self-attention enables global contextual reasoning beyond local receptive fields, improving explainability and robustness [20,21]. Despite this, applications of attention-based fusion for waste classification remain scarce, and existing studies seldom report statistical significance, ablation analysis, or deployment metrics, which are critical for sustainable industrial practice [3,4,15].
This paper addresses these gaps by proposing a hybrid attention-fusion architecture (CustomNet) that integrates multiple CNN backbones through projection layers and multi-head self-attention. Using a dataset of 13,933 images across 10 waste categories, we compare CustomNet with ResNet-50, EfficientNet-B0, and MobileNet V3 under transfer-learning settings. Our contributions include (i) a proposed hybrid attention-fusion model combining three complementary CNN backbones through projection and multi-head self-attention, which were optimized for accuracy and deployment efficiency, (ii) comprehensive evaluation with statistical significance testing and Bonferroni-corrected pairwise comparisons, and (iii) explainability analysis via Grad-CAM saliency visualization across all backbone branches. The findings demonstrate that attention-based fusion significantly improves the classification performance over lightweight baselines (EfficientNet-B0, MobileNet V3) and achieves performance parity with ResNet-50 at lower task-specific parameter count while supporting practical deployment on the NVIDIA Jetson Orin Nano at 11.5 FPS.
The main contributions of this paper are as follows:
  • We propose a deep-learning-based image classification framework tailored for the considered task, combining systematic preprocessing, model training, and evaluation within a unified pipeline. The framework is evaluated on a cleaned and re-split dataset of 13,933 RGB images across 10 waste categories, combining public and in-house collected data.
  • We perform a comprehensive quantitative evaluation using accuracy, macro-F1, and ROC–AUC, which is supported by rigorous statistical validation: five-fold cross-validation, paired hypothesis testing with Bonferroni correction, corrected Cohen’s d effect sizes, and bootstrap 95% confidence intervals.
  • We incorporate model explainability through Grad-CAM saliency visualization across all backbone branches, enabling a qualitative analysis of discriminative regions and supporting transparent model behavior in sustainability-critical applications.
  • We provide an in-depth comparative analysis of the proposed hybrid model against multiple baseline architectures, discussing strengths, limitations, and practical implications. The ablation study reveals that the multi-head attention module is a fundamental rather than optional component, as its removal causes a collapse to near-random performance.
  • We demonstrate deployment feasibility on an NVIDIA Jetson Orin Nano, reporting latency and throughput benchmarks for all models, and evaluate robustness under Gaussian noise and brightness perturbations.
Methodologically, this paper employs the following mathematical and statistical tools. On the model side, we use (i) multi-head scaled dot-product self-attention, Attention ( Q , K , V ) = softmax ( Q K / d k ) V , to aggregate inter-backbone feature correlations; (ii) linear projection with batch normalization and GELU activation to map the concatenated 4288-dimensional backbone representation to a compact 1024-dimensional embedding; and (iii) Focal Loss [22] to down-weight easy examples and address class–frequency imbalance during training. On the evaluation side, we use (iv) five-fold cross-validation with five independent random seeds (25 macro-F1 observations per model) to estimate generalization variance; (v) Shapiro–Wilk normality testing to select between parametric (paired t-test) and non-parametric (Wilcoxon signed-rank) pairwise comparisons; (vi) Bonferroni correction ( α = 0.05 / n comparisons ) for multiplicity control across all six model pairs; (vii) corrected Cohen’s d (pooled standard deviation) as a standardized effect-size measure; and (viii) bootstrap resampling ( N = 10,000 ) to construct 95% confidence intervals for macro-F1 on the held-out test set. The macro-averaged F1 is defined as F 1 macro = 1 C i = 1 C 2 TP i 2 TP i + FP i + FN i , providing balanced evaluation across all C = 10 waste categories. Full derivations and numerical results are presented in Section 3 and Section 4.
The remainder of this paper is organized as follows. Section 2 reviews the related work. Section 3 describes the materials and methods, including the data preparation, model architecture, and experimental setup. Section 4 presents the results and discussion, incorporating both quantitative evaluation and explainability analysis. Finally, Section 5 concludes the paper and outlines directions for future work.

2. Related Work

The automation of waste sorting has long been studied within computer vision. Before the advent of deep learning, systems relied on hand-crafted features such as color histograms, Local Binary Patterns (LBPs), Gabor filters, Scale-Invariant Feature Transform (SIFT), and Histogram of Oriented Gradients (HOG) combined with classifiers like Support Vector Machines (SVMs), Random Forests, or k-Nearest Neighbor (k-NN) [23,24,25]. While these approaches achieved moderate accuracy under controlled conditions, they were highly sensitive to illumination, viewpoint, and background clutter, limiting generalization to real-world waste streams.
The breakthrough in image recognition came with deep convolutional neural networks (CNNs). AlexNet demonstrated the power of multi-layer convolutions trained on GPUs [5], followed by VGGNet [6], GoogLeNet/Inception [7], and ResNet [8], which introduced residual connections to enable very deep networks. DenseNet further improved gradient flow through dense connectivity [9]. Efficiency-focused architectures such as EfficientNet [10] and MobileNet V3 [11] optimized the accuracy–efficiency trade-off, making CNNs suitable for edge-AI applications [16]. These models have been widely adopted in environmental informatics, including data-driven Earth system modelling [12], environmental remote sensing [13], and agricultural monitoring [14].
Despite their success, CNNs primarily capture local spatial correlations and struggle with global context modeling. Attention mechanisms address this limitation by re-weighting feature responses based on relevance. Lightweight modules such as Squeeze-and-Excitation (SE) [26] and CBAM [27] enhance representational power with minimal overhead. More recently, Vision Transformers (ViT) introduced self-attention for global reasoning [19], inspiring hybrid CNN–Transformer architectures like Swin Transformer [17], ConvNeXt [18], and designs that marry convolution with attention at every network stage [21,28]. These hybrids combine local inductive bias with global attention, improving performance on complex visual tasks, though their application to waste classification remains limited [20].
Early benchmarks for recyclable waste recognition include TrashNet [29], which contained 2527 images across six classes. Subsequent datasets such as TACO [30], RecycleNet [31], and OpenLitterMap [32] introduced more realistic conditions and diverse materials. Most contemporary studies employ transfer learning from ImageNet-trained CNNs. Adedeji and Wang [4] used ResNet-50 as a feature extractor combined with SVM, achieving 87% accuracy on the TrashNet dataset. Fu et al. [15] demonstrated that deep learning models deployed on an embedded Linux platform can achieve over 97% accuracy with real-time throughput, enabling practical deployment on resource-constrained devices. However, these works typically rely on single-backbone CNNs and rarely report statistical significance, ablation studies, or energy-efficiency metrics, which are critical for sustainable industrial practice [2,16].
Recent research extends waste recognition to object detection and segmentation for real-time operation. Improved YOLOv5 variants with attention modules have been applied to construction waste sorting [33], and RGB–near-infrared fusion with YOLOv5 has been used for on-shore plastic waste detection [34], while SSD [35] and Mask R-CNN [36] provide lightweight detection and pixel-level segmentation. Robust multi-class recyclable waste classification models using customized deep architectures have also been proposed [37]. Persistent challenges include small, imbalanced datasets and a lack of standardized evaluation protocols. Techniques such as data augmentation [38], Focal Loss [22], and synthetic data generation via GANs [39] are commonly used to mitigate these issues, but reproducibility and cross-dataset generalization remain limited.
A detailed quantitative comparison of CustomNet against representative prior works, including deployment context, statistical validation, and pruning/quantization information, is provided in the Discussion (Section 4).
Hybrid and ensemble deep-learning systems have gained traction in waste recognition due to the visual complexity of recyclable materials and heterogeneous acquisition conditions. In contrast to single-backbone approaches, hybrid designs aim to combine complementary inductive biases (local texture sensitivity from convolutions and broader semantic context via attention), while ensembles aggregate diverse predictors to improve reliability in cluttered scenes and under variable lighting.

2.1. Hybrid and Ensemble Models in Waste Recognition

Several studies demonstrate the benefits of combining multiple CNN backbones or introducing attention-based fusion. Ahmad et al. [40] proposed an intelligent fusion of deep features from multiple CNN architectures for waste classification, demonstrating that feature-level fusion outperforms individual backbones on the TrashNet dataset. Chu et al. [20] introduced a multilayer hybrid deep-learning method for waste classification and recycling that combines multiple network layers with attention-like feature re-weighting, improving recognition across diverse waste categories. Beyond pure CNN designs, Dai et al. [21] proposed CoAtNet, which systematically combines convolution and self-attention at every stage of the network, merging convolutional locality with transformer-style global reasoning and achieving strong performance across data scales. Knowledge distillation provides an additional axis for efficiency: a large ResNet “teacher” can guide a compact MobileNet “student,” approaching teacher-level accuracy with lower compute [41]. Despite promising accuracy gains, many ensemble studies report a limited analysis of computational trade-offs (FLOPs, parameters, latency, energy), leaving open questions about embedded viability.

2.2. Edge AI and Sustainable Deployment

Environmental sustainability concerns extend to the energy footprint of AI systems themselves. Large networks incur substantial training and inference costs, motivating edge AI: the deployment of lightweight, quantized, or pruned models directly on embedded hardware. Wang et al. [16] developed a smart municipal waste management system combining deep learning with Internet of Things infrastructure, demonstrating feasibility for automated waste sorting in urban environments. Cai et al. [42] proposed ProxylessNAS, a hardware-aware neural architecture search method that directly optimizes architectures on target hardware, enabling efficient model compression while retaining a high fraction of baseline accuracy, and Wu et al. [43] introduced FBNet, which integrates latency and power constraints into differentiable neural architecture search to discover hardware-optimal CNNs for resource-constrained devices. From a systems perspective, edge deployment reduces bandwidth by processing locally, improves privacy because images remain on-device, and increases reliability under intermittent connectivity. Nevertheless, few waste-sorting papers provide comprehensive deployment metrics (accuracy, latency, throughput, power), which are critical for aligning AI development with sustainable-computing goals [2,44].

2.3. Synthesis and Research Gap

The literature shows that deep CNNs and hybrid CNN–Transformer paradigms can substantially improve waste-classification accuracy. However, four gaps persist:
  • Limited exploration of multi-backbone fusiontailored to material classification in realistic sorting contexts; most works rely on single CNNs.
  • Scarce deployment metrics (FLOPs, latency, power), hindering the assessment of embedded viability in edge scenarios [16].
  • Insufficient statistical rigor: many studies report point estimates without hypothesis testing or confidence intervals.
  • Weak explainability practices: saliency analyses (e.g., Grad-CAM) are seldom used to substantiate predictions in sustainability-critical domains [44].
To address these gaps, we design and evaluate a hybrid attention-fusion architecture (CustomNet) that integrates three complementary backbones (ResNet-50, EfficientNet-B0, MobileNet V3) through a projection block and multi-head self-attention. Our methodology includes comprehensive statistical validation and computational profiling to balance accuracy, explainability, and deployment efficiency, forming a reproducible baseline for smart waste-sorting systems.

3. Materials and Methods

This section describes the dataset composition, preprocessing pipeline, model architectures, and training protocol used in this paper. The methodology was designed to ensure reproducibility and fair comparison across baseline CNNs and the proposed hybrid attention-fusion model. We also outline evaluation metrics and statistical validation procedures to assess both predictive performance and computational efficiency.

3.1. Dataset Description

The experimental analysis was conducted on a dataset of 13,933 RGB images divided into 10 classes: battery, biological, cardboard, clothes, glass, metal, paper, plastic, shoes, and trash. The corpus is based on the Kaggle Garbage Classification dataset [45] (8501 images, 61.1%), which was supplemented with 5415 images collected in-house (38.9%) to augment underrepresented categories and increase environmental diversity. The three original glass sub-categories (brown glass, green glass, and white glass) were merged into a single glass class to ensure consistent class representation across all dataset splits. Corrupted and duplicate images were removed using SHA-1 hash deduplication prior to splitting.
In-house images were captured using a digital camera under controlled LED illumination with diffusers to minimize shadowing. Objects were positioned on neutral, non-reflective backgrounds. The image resolution averaged 4032 × 3024 pixels and was standardized to 224 × 224 pixels during preprocessing. The public portion contributes outdoor, cluttered scenes with variable lighting, ensuring that both controlled and natural acquisition conditions are represented.
The dataset was partitioned into training (80%), validation (10%), and test (10%) subsets using stratified sampling with a fixed random seed (1337) to ensure reproducibility and preserve class proportions across all splits. Table 1 reports the per-class image counts for each split. The largest class (cardboard, 2,258 images) is 3.3 times larger than the smallest (metal, 679 images), representing a moderate class imbalance (illustrated in Figure 1) that motivates the use of Focal Loss during training.

3.2. Preprocessing and Data Augmentation

Each image was resized to 224 × 224 px and normalized to the [0, 1] range using ImageNet channel statistics (mean = [0.485, 0.456, 0.406]; std = [0.229, 0.224, 0.225]). The preprocessing included adaptive histogram equalization (CLAHE) for illumination correction and Gaussian filtering ( σ = 1.0) for noise reduction. CLAHE was applied to reduce the illumination gap between the outdoor, variable-lighting Kaggle images and the controlled in-house captures; Gaussian filtering suppresses high-frequency sensor noise present in the lower-resolution in-house images. Both steps were applied uniformly to the full corpus before splitting. The No-augmentation ablation in ( Δ F1 = −0.002, not significant) confirms that these preprocessing steps do not adversely affect texture-based discrimination for the waste classes in our corpus. An extensive on-the-fly augmentation pipeline was implemented to improve generalization (Table 2).
Augmentation was applied only to the training set. An ablation study demonstrated that combined geometric + photometric transformations improved the macro-F1 score by around 12%, and advanced methods (CutMix + MixUp) added another 2–3%. These results are consistent with the findings of Mao et al. [46] and Shorten and Khoshgoftaar [38].

3.3. Model Architectures

We evaluated three baseline CNNs and a custom hybrid attention-fusion model.

3.3.1. Baseline

Three convolutional backbones served as baselines and feature-extractor components (Table 3).
The weights were initialized from ImageNet and fine-tuned end to end. Lower convolutional blocks were frozen for the first five epochs to stabilize training before full unfreezing.

3.3.2. Hybrid Attention Fusion Model (CustomNet)

To exploit complementary representational biases, a hybrid architecture named CustomNet was designed. Feature vectors extracted from the three pretrained backbones (ResNet-50 ( D 1 = 2048 ), EfficientNet-B0 ( D 2 = 1280 ), and MobileNetV3 ( D 3 = 960 )) were concatenated to form a combined vector of size ( D 1 + D 2 + D 3 ) = 4288 . This combined feature representation was then projected into a 1024-dimensional latent space using a linear transformation followed by batch normalization, GELU activation, and dropout:
H = GELU BN ( W p [ f 1 ; f 2 ; f 3 ] ) ,
where f 1 , f 2 , and f 3 denote the feature vectors extracted from ResNet-50, EfficientNet-B0, and MobileNetV3, respectively. The concatenation of these feature vectors is represented as [ f 1 ; f 2 ; f 3 ] . The projection weight matrix W p R 1024 × ( D 1 + D 2 + D 3 ) maps the concatenated feature vector to a 1024-dimensional embedding. Batch normalization applied to the projected features is denoted by BN ( · ) , and GELU ( · ) refers to the Gaussian Error Linear Unit activation function. The resulting 1024-dimensional latent representation is denoted by H.
A subsequent multi-head self-attention (MHA) module with eight heads captured inter-backbone correlations:
Attention ( Q , K , V ) = softmax Q K T d k V ,
where Q , K , V R n × d k denote the query, key, and value matrices derived from H, where d k is the dimensionality of the key vectors used for scaling. The function softmax ( · ) is applied to ensure normalized attention weights across all positions.
The attended representation was passed through a classifier composed of three fully connected layers with dimensions 1024 768 512 10 , using GELU activations and dropout (rate = 0.3). The final output corresponds to the 10 waste categories. Training employed Focal Loss [22] to down-weight easy examples and emphasize minority classes. The total parameter count of the deployed CustomNet model is approximately 40.3 million (ResNet-50: 23.5 M, EfficientNet-B0: 4.0 M, MobileNetV3-Large: 3.0 M backbones + 9.7 M classification head), all of which were jointly optimized during training via full fine-tuning from ImageNet-pretrained weights. The backbone weights were frozen only during the first five warm-up epochs to stabilize training; then, they were fully unfrozen. The 9.7 M classification head (projection layer, multi-head attention, and MLP classifier) provides a meaningful architectural comparison with the 25.6 M total parameters of the standalone ResNet-50 baseline. Total FLOPs are approximately 1.2 G for the fusion head only and approximately 5.7 G including all backbone forward passes.
Figure 2 illustrates the architecture of the proposed CustomNet model. The design integrates three pretrained CNN backbones to capture complementary feature representations. After concatenation and projection to a 1024-dimensional embedding, the multi-head attention module models global dependencies across fused features. Finally, the classifier outputs predictions for the 10 waste categories. This hybrid design combines local texture sensitivity with global contextual reasoning, improving classification accuracy and robustness compared to lightweight single-backbone baselines (EfficientNet-B0 and MobileNet V3) while achieving statistical performance parity with ResNet-50.

3.4. Training and Evaluation

All experiments were executed on the VEGA supercomputer (IZUM, Maribor, Slovenia), using one NVIDIA A100 40 GB GPU per job, 8 CPU cores, and 32 GB RAM (PyTorch 2.1.2, CUDA 12.1). Each model configuration was trained five times using different random seeds (0–4) to estimate variance and support statistical comparison. Five-fold cross-validation was additionally performed to assess generalization stability. Local development and script testing were performed on a workstation equipped with an Intel Core i7-4790 CPU and an NVIDIA GeForce GPU (4 GB VRAM) running Linux Mint 20.
Optimization parameters:
  • Optimizer: Adam (beta1 = 0.9, beta2 = 0.999);
  • Initial learning rate: 1 × 10 4 ; cosine-annealing schedule;
  • Batch size = 32; epochs = 20; early stopping patience = 5;
  • Weight decay = 1 × 10 5 ; dropout = 0.3.
Evaluation metrics included accuracy, macro-F1, macro ROC–AUC, and per-class precision–recall. Macro-averaging ensures balanced weighting across uneven classes:
F 1 m a c r o = 1 C i = 1 C 2 T P i 2 T P i + F P i + F N i ,
where C is the number of classes, and T P i , F P i , and F N i denote the true positives, false positives, and false negatives for class i, respectively.

3.5. Statistical Validation and Computational Profiling

To support rigorous statistical comparison, five-fold cross-validation was performed for all models, with five independent random seeds per fold, yielding 25 macro-F1 scores per model for paired hypothesis testing. The normality of the score distributions was assessed using the Shapiro–Wilk test. Where normality held ( p > 0.05 ), paired t-tests were applied; otherwise, the non-parametric Wilcoxon signed-rank test was used. All pairwise comparisons were corrected for multiple testing using the Bonferroni method ( α = 0.05 / n comparisons ). Effect sizes were computed as Cohen’s d using the pooled standard deviation of both samples. Bootstrap resampling ( N = 10,000 ) was applied to the test-set predictions to compute 95% confidence intervals for macro-F1.

3.6. Edge Deployment Considerations

To assess deployment feasibility in resource-constrained environments, all models were benchmarked on an NVIDIA Jetson Orin Nano Developer Kit Super (JetPack 36.4.7, PyTorch 2.8.0, CUDA 12.6, 7.4 GB RAM). This platform represents a current-generation edge AI device representative of practical smart-sorting deployments. Inference latency and throughput were measured using CUDA events over 200 timed iterations following 30 warmup passes at a batch size of 1 to reflect single-image real-time operation. All models were deployed without quantization or pruning modifications, providing a fair baseline for edge performance comparison. All models were benchmarked at FP32 precision (standard single-precision floating point) without quantization or pruning. INT8 and FP16 optimizations via TensorRT are identified as a direction for future work.
The full CustomNet architecture was deployed without modification on the Jetson Orin Nano, as the platform supports a recent PyTorch release with full torchvision compatibility.

4. Results and Discussion

This section presents quantitative and qualitative findings from the experimental evaluation. We report on the baseline performance, analyze the contribution of individual components through ablation studies, assess statistical significance, and discuss explainability, robustness, and practical deployment implications.

4.1. Overall Performance

Table 4 summarizes the average test results across five independent runs for ResNet-50, EfficientNet-B0, MobileNet V3-Small, and the proposed CustomNet. Metrics include accuracy, macro-F1, and macro ROC–AUC along with parameter and FLOP counts.
All models achieved high accuracy (>96%), confirming the effectiveness of transfer learning for waste imagery. Among all of the evaluated models, ResNet-50 achieves the numerically highest accuracy (97.93%) and macro-F1 (0.975). EfficientNet-B0 outperformed MobileNet V3 despite its smaller size, validating compound scaling for efficiency. MobileNet V3 delivered competitive results at minimal computational cost, making it suitable for edge deployment. CustomNet surpassed EfficientNet-B0 and MobileNet V3 by 0.5–1.2% in macro-F1, demonstrating the benefit of multi-backbone fusion and attention for complex material discrimination. CustomNet achieved performance comparable to ResNet-50 (macro-F1 difference of 0.002, p = 0.383 , not significant after Bonferroni correction) while requiring substantially fewer task-specific parameters in the classification head (9.7 M vs. 25.6 M), demonstrating a favorable accuracy-efficiency trade-off.
Figure 3 and Figure 4 present ROC and precision–recall curves for all four models on the held-out test set. All models achieve macro AUC scores above 0.98, confirming strong discriminative power. PR curves provide a more informative view under class imbalance: CustomNet and ResNet-50 maintain high precision at high recall for minority classes (Metal, Battery), while MobileNet V3 shows the largest precision drop at high recall thresholds for these categories, which is consistent with its lower macro-F1 score.

4.2. Ablation Studies

Table 5 reports the impact of removing key components from CustomNet. Each factor (attention, feature fusion, augmentation, and Focal Loss) contributed measurably to performance.
The ablation results reveal a striking finding: removing the multi-head attention module causes a catastrophic performance collapse to near-random accuracy (10.82%), indicating that the attention mechanism is not an optional enhancement but a fundamental component of the fusion architecture. Without attention, the CLS token has no mechanism to aggregate information from the backbone feature tokens, rendering the classifier unable to learn from the image signal. This result was confirmed across all five seeds with identical hyperparameters, ruling out a training anomaly. Removing individual backbone branches has a smaller but statistically significant effect for MobileNet and EfficientNet branches, while the ResNet-only variant achieves performance comparable to full CustomNet ( p > 0.05 ), which is consistent with ResNet-50 being the strongest individual backbone. Augmentation and Focal Loss show smaller individual contributions, reflecting the improved class balance after dataset cleaning.

4.3. Statistical Validation

Table 6 reports Shapiro–Wilk normality test results for all models. All distributions were found to be normal at α = 0.05 with the exception of CustomNet without attention ( W = 0.896 , p = 0.014 ), for which the Wilcoxon signed-rank test was used in pairwise comparisons. For all other models, paired t-tests were applied. All p-values were corrected using the Bonferroni method ( α corrected = 0.05 / 6 = 0.0083 for six pairwise comparisons among four models).Table 7 reports all six pairwise comparisons, including effect sizes and bootstrap confidence intervals.
Bootstrapped 95% confidence intervals for CustomNet macro-F1 were [0.969, 0.977], confirming stable performance across test-set resamples. CustomNet significantly outperforms EfficientNet-B0 ( d = 3.203 ) and MobileNet V3 ( d = 4.013 ) after Bonferroni correction. The comparison with ResNet-50 is not statistically significant ( p = 0.383 , d = 0.218 ), indicating performance parity. Given that CustomNet achieves this parity with a classification head of only 9.7 M task-specific parameters compared to ResNet-50’s 25.6 M total parameters, it represents a favorable accuracy–efficiency trade-off.

4.4. Qualitative Analysis

Figure 5 presents Grad-CAM saliency visualizations for all four models across all 10 waste classes. Heatmaps are computed with respect to the ground-truth class label rather than the predicted class, ensuring consistent and comparable explanations across models regardless of prediction outcome [47]. All visualizations use the fully trained model (20 epochs) rather than intermediate checkpoints.
CustomNet consistently produces more focused saliency patterns on material-specific regions compared to single-backbone baselines. In the metal class, ResNet-50 and EfficientNet-B0 activate background regions, while CustomNet focuses on the metallic surface of the object. For the glass class, CustomNet highlights the distinctive bottle texture and rim regions. These qualitative observations are consistent with the quantitative performance advantage of CustomNet over EfficientNet-B0 and MobileNet V3.
Figure 6 shows that all three backbone branches within CustomNet attend to largely consistent discriminative regions, validating that the fusion integrates complementary rather than redundant representations. The ResNet branch shows slightly more spatially localized activations, while the MobileNet and EfficientNet branches exhibit broader coverage, which together contribute to the robustness of the fused representation.
Figure 7 directly contrasts the full CustomNet with the no-attention variant. The full model produces concentrated activations on class-discriminative features, while the no-attention variant shows diffuse or absent activations across all classes, providing a qualitative explanation for the catastrophic performance collapse observed in the ablation study.
Figure 8 examines the hardest confusion pairs (paper vs. cardboard, plastic vs. clothes). Both models activate similar structural features for these pairs, confirming that residual errors are driven by genuine visual similarity rather than model artifacts. This motivates future work on texture-focused augmentation and material-hierarchy priors for these categories.
Figure 9 highlights the primary error modes of the fully trained CustomNet. Misclassifications are concentrated between visually similar categories: Paper vs. Cardboard and Plastic vs. Clothes, which is consistent with the Grad-CAM analysis in Figure 8, which shows that both models activate shared structural features for these pairs. These residual errors motivate fine-grained augmentation (e.g., texture-focused crops, lighting variations) and domain-specific priors (e.g., material hierarchy or multi-stage classification) to further reduce confusions in look-alike materials.

4.5. Robustness and Efficiency

Table 8 reports model accuracy under Gaussian noise and brightness perturbations applied at test time. Under mild perturbations (noise σ 0.10 , brightness factor 0.7–1.3), CustomNet retains accuracy within 1.0%, outperforming EfficientNet-B0 and MobileNet V3 at all perturbation levels. Under severe perturbations ( σ = 0.40 , brightness factor ±50%), accuracy drops by 4.4% compared to 8.3% and 9.6% for EfficientNet-B0 and MobileNet V3, respectively, demonstrating superior robustness of the multi-backbone fusion design. ResNet-50 shows comparable robustness to CustomNet at severe noise levels, which is consistent with the parity observed in standard evaluation.
Inference efficiency on the target edge platform is reported in Table 9. Training time on the VEGA HPC cluster (NVIDIA A100 40 GB) was approximately 45 min per model for the full 20-epoch run. CustomNet requires approximately three times longer training than single-backbone models due to the three parallel forward passes through the pretrained backbone networks during each training iteration. This overhead is incurred only during training and does not affect inference latency.

4.6. Edge Deployment Results

Table 9 reports the inference latency and throughput for all models benchmarked on the NVIDIA Jetson Orin Nano. Among all of the evaluated models, ResNet-50 achieves the best combination of accuracy (97.93%) and throughput (42.3 FPS, 23.65 ms) on the Jetson platform. These results confirm that if throughput is the primary deployment criterion, ResNet-50 is the preferred single-model choice. Full CustomNet achieves 86.70 ms per image (11.5 FPS), which, while below the real-time threshold of 30 FPS for single-backbone models, is acceptable for conveyor-belt sorting applications where frame rates of 10–15 FPS are typical [16]. CustomNet’s edge deployment results establish a feasibility baseline for the full multi-backbone architecture; optimized variants using quantization and pruning are a natural next step to close the throughput gap. Single-backbone variants of CustomNet (ResNet only, MobileNet only, EfficientNet only) achieve 25–35 ms (28–40 FPS), providing a deployment option when strict latency constraints apply with minimal accuracy loss relative to the full model.

4.7. Key Findings

  • CustomNet achieved macro-F1 = 0.973, significantly outperforming EfficientNet-B0 ( p < 0.001 , d = 3.203 ) and MobileNet V3 ( p < 0.001 , d = 4.013 ) after Bonferroni correction. Performance parity was observed with ResNet-50 ( p = 0.383 , d = 0.218 ) with CustomNet offering a favorable trade-off through its 9.7 M parameter classification head compared to ResNet-50’s 25.6 M total parameters.
  • The multi-head attention module is the single most critical component of the architecture: its removal causes a collapse to near-random performance (10.82%), confirming it is fundamental to the fusion mechanism rather than an optional enhancement.
  • Model explainability was assessed via Grad-CAM saliency visualization, producing spatially focused heatmaps aligned with material-specific discriminative regions and supporting transparent model behavior in sustainability-critical applications.
  • Edge deployment on the NVIDIA Jetson Orin Nano achieved 86.70 ms per image (11.5 FPS) for full CustomNet. Single-backbone variants achieve 25–35 ms (28–40 FPS), providing a latency-efficient deployment option with less than 1% accuracy loss relative to the full model.
  • CustomNet demonstrates superior robustness to EfficientNet-B0 and MobileNet V3 under mild and severe perturbations with an accuracy drop of 4.4% at σ = 0.40 noise compared to 8.3% and 9.6% for EfficientNet-B0 and MobileNet V3, respectively.

4.8. Comparison with Prior Work

Table 10 positions CustomNet within representative prior work on deep-learning-based waste classification. Studies are categorized by deployment context to make explicit the distinction between algorithmic contributions and hardware-validated systems—a distinction central to the contribution of this paper.
Among the eight prior works, only Refs. [15,16] report any edge deployment, and neither provides statistical significance testing, cross-validation, or explainability analysis. The remaining six studies are algorithmic benchmarks without hardware validation. CustomNet is the only system in this comparison to jointly address all four evaluation dimensions (accuracy, statistical rigor, explainability, and edge deployment) on a substantially larger and more diverse dataset. Crucially, none of the prior works reports whether pruning or quantization was applied, making direct latency comparisons difficult to interpret. CustomNet was deployed without any compression (noted as “No (baseline)” in Table 10), establishing an unoptimized reference point; quantization-aware training and pruning are identified as a direction for future work to reduce inference latency below 30 FPS. Of the eight prior works compared, only Refs. [15,16] provide any edge deployment result; none of the eight reports statistical significance testing, cross-validation, or saliency-based explainability jointly. CustomNet is the only system in this comparison providing all four evaluation dimensions on a dataset of this scale and diversity.

5. Conclusions and Future Work

This paper demonstrated that deep-learning-based vision systems can significantly improve automated waste sorting. We introduced CustomNet, which is a hybrid attention-fusion architecture combining ResNet-50, EfficientNet-B0, and MobileNet V3 feature extractors. Evaluated on a cleaned and re-split dataset of 13,933 RGB images across 10 categories (80% train, 10% val, 10% test, stratified), CustomNet achieved 97.79% accuracy, macro-F1 = 0.973 ± 0.002, and ROC–AUC = 0.992 ± 0.001. Statistical testing with Bonferroni correction confirmed significant outperformance over EfficientNet-B0 ( p < 0.001 , d = 3.203 ) and MobileNet V3 ( p < 0.001 , d = 4.013 ) as well as performance parity with ResNet-50 ( p = 0.383 , d = 0.218 ). Performance gains were most pronounced for visually similar classes such as paper vs. cardboard, confirming the benefit of multi-backbone integration and attention-driven fusion.
A key finding of the ablation study is that the multi-head attention module is not an optional enhancement but a fundamental architectural component: its removal causes a collapse to near-random accuracy (10.82%), because without attention, the CLS token cannot aggregate information from the backbone feature tokens. This result was reproducible across all five seeds and identical hyperparameters. CustomNet achieves performance comparable to ResNet-50 while requiring only 9.7 M task-specific parameters in the classification head, demonstrating a favorable accuracy–efficiency trade-off that is particularly relevant for edge deployment scenarios.
Beyond accuracy, CustomNet was successfully deployed on an NVIDIA Jetson Orin Nano, achieving 86.70 ms per image (11.5 FPS). Single-backbone variants achieve 28–40 FPS with less than 1% accuracy loss, providing a flexible deployment option for applications with strict latency requirements. Robustness evaluation confirmed that CustomNet maintains accuracy within 1% under mild perturbations ( σ 0.10 , brightness 0.7–1.3) and outperforms all single-backbone baselines under severe perturbations. Grad-CAM saliency visualizations confirmed that the model attends to material-specific discriminative regions, supporting transparent and explainable model behavior in sustainability-critical applications.
Advantages of the Proposed Methodology: CustomNet offers several methodological advantages over single-backbone approaches. (i) Multi-backbone fusion captures complementary feature representations (local texture from MobileNet V3, structural patterns from EfficientNet-B0, and deep semantic features from ResNet-50), improving the discrimination of visually similar materials. (ii) The task-specific classification head (9.7 M parameters) achieves performance parity with the full ResNet-50 baseline (25.6 M parameters), offering a favorable accuracy–efficiency trade-off. (iii) The mandatory attention module provides a principled mechanism for inter-backbone feature aggregation, as confirmed by the catastrophic ablation collapse without it. (iv) Grad-CAM saliency visualization supports explainable behavior aligned with material-specific discriminative regions. (v) The methodology is fully reproducible: all splits, seeds, and benchmark protocols are documented and the dataset is publicly archived on Zenodo.
Limitations: The dataset, while comprising both public and in-house images, remains modest in scale and limited to RGB imagery under controlled and semi-controlled conditions; highly occluded or contaminated samples are under-represented. Edge benchmarking was performed on a single hardware platform without quantization or pruning, and energy metrics require broader life-cycle analysis. Statistical validation was based on five seeds and five-fold cross-validation; larger cross-dataset benchmarks are needed for industrial generalization.
Future Directions:
  • Architectural Advances: Explore transformer-based backbones (e.g., Swin Transformer, ConvNeXt) and cross-modal fusion with spectral or depth sensors. Formalize explainability using quantitative attribution metrics beyond Grad-CAM, such as SHAP or integrated gradients.
  • Operational Optimization: Investigate pruning, quantization, and knowledge distillation to reduce inference latency below the 30 FPS threshold on edge devices without significant accuracy loss. Examine distributed edge–cloud collaboration for large-scale deployments.
  • Impact Assessment: Validate performance in live recycling facilities, correlating algorithmic metrics with throughput, false-recycle rates, and carbon-footprint reduction. Partnerships with municipal waste-management organizations will support real-world scalability.
In summary, this paper establishes a replicable design pattern for resource-efficient computer vision in sustainability-critical domains. By coupling high recognition accuracy with model explainability and demonstrated edge deployment feasibility, CustomNet contributes toward the broader agenda of AI for Sustainable Development. Real-world validation in active recycling facilities, including measurements of contamination reduction and comparison with human sorting performance, is required before operational impact claims can be substantiated. Future research will extend these methods toward federated learning, multi-sensor integration, and comprehensive environmental impact modeling.

Author Contributions

Conceptualization, D.V. and T.G.; methodology, T.G., J.D. and D.V.; software, T.G. and J.D.; validation, J.D. and D.V.; formal analysis, T.G.; investigation, T.G. and D.V.; resources, D.V.; data curation, T.G.; writing––original draft preparation, T.G., J.D. and D.V.; writing––review and editing, J.D., T.G. and D.V.; visualization, T.G.; supervision, D.V.; project administration, D.V.; funding acquisition, D.V. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Slovenian Research Agency (research core funding No. P2-0057—Information systems).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is publicly available on Zenodo at https://zenodo.org/records/18827229 (accessed on 27 March 2026). The dataset is distributed under the Open Database License (ODbL) v1.0. The original contents remain © Original Authors, while additional augmented samples were generated by the authors for the purposes of this study.

Conflicts of Interest

The Author Teodora Grneva was employed by the company MIPS Programska Oprema d.o.o. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Bank, W. What a Waste 2.0: A Global Snapshot of Solid Waste Management to 2050; World Bank Publications: Washington, DC, USA, 2018. [Google Scholar]
  2. Fang, B.; Yu, J.; Chen, Z.; Osman, A.I.; Farghali, M.; Ihara, I.; Hamza, E.H.; Rooney, D.W.; Yap, P.S. Artificial intelligence for waste management in smart cities: A review. Environ. Chem. Lett. 2023, 21, 1959–1989. [Google Scholar] [CrossRef]
  3. Vo, A.H.; Son, L.H.; Vo, M.T.; Le, T. A Novel Framework for Trash Classification Using Deep Transfer Learning. IEEE Access 2019, 7, 178631–178639. [Google Scholar] [CrossRef]
  4. Adedeji, O.; Wang, Z. Intelligent Waste Classification System Using Deep Learning Convolutional Neural Network. Procedia Manuf. 2019, 35, 607–612. [Google Scholar] [CrossRef]
  5. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  6. Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
  7. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. arXiv 2014, arXiv:1409.4842. [Google Scholar]
  8. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2016, arXiv:1512.03385. [Google Scholar]
  9. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. arXiv 2016, arXiv:1608.06993. [Google Scholar]
  10. Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the ICML 2019, Long Beach, CA, USA, 10–15 June 2019. [Google Scholar]
  11. Howard, A.; Sandler, M.; Chu, G.; Chen, L.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for MobileNetV3. arXiv 2019, arXiv:1905.02244. [Google Scholar]
  12. Reichstein, M.; Camps-Valls, G.; Stevens, B.; Jung, M.; Denzler, J.; Carvalhais, N.; Prabhat. Deep Learning and Process Understanding for Data-Driven Earth System Science. Nature 2019, 566, 195–204. [Google Scholar] [CrossRef]
  13. Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
  14. Kamilaris, A.; Prenafeta-Boldú, F.X. Deep Learning in Agriculture: A Survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
  15. Fu, B.; Li, S.; Wei, J.; Li, Q.; Wang, Q.; Tu, J. A Novel Intelligent Garbage Classification System Based on Deep Learning and an Embedded Linux System. IEEE Access 2021, 9, 131134–131146. [Google Scholar] [CrossRef]
  16. Wang, C.; Qin, J.; Qu, C.; Ran, X.; Liu, C.; Chen, B. A Smart Municipal Waste Management System Based on Deep-Learning and Internet of Things. Waste Manag. 2021, 135, 20–29. [Google Scholar] [CrossRef]
  17. Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. arXiv 2019, arXiv:2103.14030. [Google Scholar]
  18. Liu, Z.; Mao, H.; Wu, C.; Feichtenhofer, C.; Darrell, T.; Xie, S. A ConvNet for the 2020s. arXiv 2022, arXiv:2201.03545. [Google Scholar]
  19. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  20. Chu, Y.; Huang, C.; Xie, X.; Tan, B.; Kamal, S.; Xiong, X. Multilayer Hybrid Deep-Learning Method for Waste Classification and Recycling. Comput. Intell. Neurosci. 2018, 2018, 5060857. [Google Scholar] [CrossRef] [PubMed]
  21. Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. CoAtNet: Marrying Convolution and Attention for All Data Sizes. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); Curran Associates: Red Hook, NY, USA, 2021; pp. 3965–3977. [Google Scholar]
  22. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. arXiv 2017, arXiv:1708.02002. [Google Scholar]
  23. Sakr, G.E.; Mokbel, M.; Darwich, A.; Khneisser, M.N.; Hadi, A. Comparing Deep Learning and Support Vector Machines for Autonomous Waste Sorting. In Proceedings of the 2016 IEEE International Multidisciplinary Conference on Engineering Technology (IMCET), Beirut, Lebanon, 2–4 November 2016; pp. 207–212. [Google Scholar]
  24. Arebey, M.; Hannan, M.A.; Begum, R.A.; Basri, H. Solid Waste Bin Level Detection Using Gray Level Co-occurrence Matrix Feature Extraction Approach. J. Environ. Manag. 2012, 104, 9–18. [Google Scholar] [CrossRef]
  25. Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica 2007, 31, 249–268. [Google Scholar]
  26. Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E. Squeeze-and-Excitation Networks. arXiv 2017, arXiv:1709.01507. [Google Scholar]
  27. Woo, S.; Park, J.; Lee, J.; Kweon, I.S. CBAM: Convolutional Block Attention Module. arXiv 2018, arXiv:1807.06521. [Google Scholar]
  28. Zhang, Q.; Yang, Q.; Zhang, X.; Bao, Q.; Su, J.; Liu, X. Waste Image Classification Based on Transfer Learning and Convolutional Neural Network. Waste Manag. 2021, 135, 150–157. [Google Scholar] [CrossRef]
  29. Yang, M.; Thung, G. Classification of Trash for Recyclability Status. Stanford CS229 Project. 2016. Available online: https://cs229.stanford.edu/proj2016/report/ThungYang-ClassificationOfTrashForRecyclabilityStatus-report.pdf (accessed on 27 March 2026).
  30. Proença, P.F.; Simões, P. TACO: Trash Annotations in Context for Litter Detection. arXiv 2020, arXiv:2003.06975. [Google Scholar]
  31. RecycleNet Dataset. GitHub Repository. Available online: https://github.com/sangminwoo/RecycleNet (accessed on 27 March 2026).
  32. OpenLitterMap Project. Online Project. Available online: https://openlittermap.com (accessed on 17 October 2025).
  33. Zhou, Q.; Liu, H.; Qiu, Y.; Zheng, W. Object Detection for Construction Waste Based on an Improved YOLOv5 Model. Sustainability 2023, 15, 681. [Google Scholar] [CrossRef]
  34. Tamin, O.; Moung, E.G.; Dargham, J.A.; Yahya, F.; Ling, S.C.J.; Ee, S.P.; Teo, J. On-Shore Plastic Waste Detection with YOLOv5 and RGB-Near-Infrared Fusion: A State-of-the-Art Solution for Accurate and Efficient Environmental Monitoring. Big Data Cogn. Comput. 2023, 7, 103. [Google Scholar]
  35. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.; Berg, A.C. SSD: Single Shot MultiBox Detector. arXiv 2016, arXiv:1512.02325. [Google Scholar]
  36. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. arXiv 2017, arXiv:1703.06870. [Google Scholar]
  37. Hossen, M.M.; Majid, M.E.; Kashem, S.B.A.; Khandakar, A.; Nashbat, M.; Ashraf, A.; Zia, M.H.; Kunju, A.K.A.; Kabir, S.; Chowdhury, M.E.H. A Reliable and Robust Deep Learning Model for Effective Recyclable Waste Classification. IEEE Access 2024, 12, 13809–13821. [Google Scholar] [CrossRef]
  38. Shorten, C.; Khoshgoftaar, T.M. A Survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
  39. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Advances in Neural Information Processing Systems 27; Neural Information Processing Systems Foundation, Inc.: San Diego, CA, USA, 2014. [Google Scholar]
  40. Ahmad, K.; Khan, K.; Al-Fuqaha, A. Intelligent Fusion of Deep Features for Improved Waste Classification. IEEE Access 2020, 8, 96495–96504. [Google Scholar] [CrossRef]
  41. Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  42. Cai, H.; Zhu, L.; Han, S. ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  43. Wu, B.; Dai, X.; Zhang, P.; Wang, Y.; Sun, F.; Wu, Y.; Tian, Y.; Vajda, P.; Jia, Y.; Keutzer, K. FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 10734–10742. [Google Scholar]
  44. Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, Taxonomies, Opportunities and Challenges toward Responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
  45. Mohamed, M. Garbage Classification (12 Classes). 2021. Available online: https://www.kaggle.com/datasets/mostafaabla/garbage-classification (accessed on 27 March 2026).
  46. Mao, W.-L.; Chen, W.-C.; Wang, C.-T.; Lin, Y.-H. Recycling Waste Classification Using Optimized Convolutional Neural Network. Resour. Conserv. Recycl. 2021, 164, 105132. [Google Scholar] [CrossRef]
  47. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In 2017 IEEE International Conference on Computer Vision (ICCV); IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar]
Figure 1. Class distribution in the waste image dataset (N = 13,933; 10 classes). The histogram highlights a moderate imbalance: Cardboard and Shoes dominate the corpus (2258 and 2213 images, respectively), while Metal and Battery are the least represented classes (679 and 1075 images, respectively). Imbalance ratio 3.3:1 (cardboard vs. metal).
Figure 1. Class distribution in the waste image dataset (N = 13,933; 10 classes). The histogram highlights a moderate imbalance: Cardboard and Shoes dominate the corpus (2258 and 2213 images, respectively), while Metal and Battery are the least represented classes (679 and 1075 images, respectively). Imbalance ratio 3.3:1 (cardboard vs. metal).
Mathematics 14 01176 g001
Figure 2. Architecture of the proposed CustomNet model. Features extracted from ResNet-50, EfficientNet-B0, and MobileNetV3 are concatenated (combined dimension: 4288), projected to a 1024-dimensional embedding, fused using multi-head self-attention (8 heads), and classified through a three-layer MLP (768 → 512 → 10 classes).
Figure 2. Architecture of the proposed CustomNet model. Features extracted from ResNet-50, EfficientNet-B0, and MobileNetV3 are concatenated (combined dimension: 4288), projected to a 1024-dimensional embedding, fused using multi-head self-attention (8 heads), and classified through a three-layer MLP (768 → 512 → 10 classes).
Mathematics 14 01176 g002
Figure 3. Multi-class one-vs.-rest ROC curves for all four models on the held-out test set (seed 0). All models achieve high AUC values (>0.98) across all classes, confirming strong discriminative power under transfer learning. CustomNet and ResNet-50 show the tightest curves toward the upper-left corner for minority classes (Metal, Battery), which is consistent with the macro ROC–AUC results in Table 4. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Figure 3. Multi-class one-vs.-rest ROC curves for all four models on the held-out test set (seed 0). All models achieve high AUC values (>0.98) across all classes, confirming strong discriminative power under transfer learning. CustomNet and ResNet-50 show the tightest curves toward the upper-left corner for minority classes (Metal, Battery), which is consistent with the macro ROC–AUC results in Table 4. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Mathematics 14 01176 g003
Figure 4. Multi-class precision–recall curves for all four models on the held-out test set (seed 0). PR curves are a more informative view than ROC under class imbalance. CustomNet and ResNet-50 show the highest precision at high recall for minority classes (Metal, Battery), which is consistent with the Focal Loss training objective and multi-backbone fusion design. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Figure 4. Multi-class precision–recall curves for all four models on the held-out test set (seed 0). PR curves are a more informative view than ROC under class imbalance. CustomNet and ResNet-50 show the highest precision at high recall for minority classes (Metal, Battery), which is consistent with the Focal Loss training objective and multi-backbone fusion design. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Mathematics 14 01176 g004
Figure 5. Grad-CAM saliency visualizations for all four models across all 10 waste classes. Heatmaps are computed with respect to the ground-truth class label to ensure consistent comparison across models. Warmer colors indicate regions contributing most strongly to the classification decision. CustomNet consistently produces more focused activations on material-specific regions (e.g., container rims, surface textures, structural edges) compared to single-backbone baselines, particularly for visually similar classes such as metal and glass.
Figure 5. Grad-CAM saliency visualizations for all four models across all 10 waste classes. Heatmaps are computed with respect to the ground-truth class label to ensure consistent comparison across models. Warmer colors indicate regions contributing most strongly to the classification decision. CustomNet consistently produces more focused activations on material-specific regions (e.g., container rims, surface textures, structural edges) compared to single-backbone baselines, particularly for visually similar classes such as metal and glass.
Mathematics 14 01176 g005
Figure 6. Grad-CAM saliency maps for the three individual backbone branches within CustomNet (ResNet-50, MobileNet V3, EfficientNet-B0) across all 10 classes. All three branches attend to largely consistent discriminative regions, validating that the multi-backbone fusion integrates complementary rather than redundant representations. Minor differences are visible in the glass and metal rows, where branch-level activations highlight different sub-regions of the same object.
Figure 6. Grad-CAM saliency maps for the three individual backbone branches within CustomNet (ResNet-50, MobileNet V3, EfficientNet-B0) across all 10 classes. All three branches attend to largely consistent discriminative regions, validating that the multi-backbone fusion integrates complementary rather than redundant representations. Minor differences are visible in the glass and metal rows, where branch-level activations highlight different sub-regions of the same object.
Mathematics 14 01176 g006
Figure 7. Grad-CAM comparison between full CustomNet and the no-attention variant across all 10 classes. The full model produces focused activations on discriminative object regions (e.g., bottle texture for glass, sole pattern for shoes). The no-attention variant shows diffuse or absent activations, consistent with its near-random classification performance, confirming that the attention module is essential for routing image-derived signals to the classifier.
Figure 7. Grad-CAM comparison between full CustomNet and the no-attention variant across all 10 classes. The full model produces focused activations on discriminative object regions (e.g., bottle texture for glass, sole pattern for shoes). The no-attention variant shows diffuse or absent activations, consistent with its near-random classification performance, confirming that the attention module is essential for routing image-derived signals to the classifier.
Mathematics 14 01176 g007
Figure 8. Grad-CAM saliency maps for the most frequently confused class pairs (paper vs. cardboard, plastic vs. clothes) for ResNet-50 and CustomNet. Both models activate similar regions for these visually similar categories, indicating that residual confusions are driven by genuine inter-class visual similarity rather than model artifacts.
Figure 8. Grad-CAM saliency maps for the most frequently confused class pairs (paper vs. cardboard, plastic vs. clothes) for ResNet-50 and CustomNet. Both models activate similar regions for these visually similar categories, indicating that residual confusions are driven by genuine inter-class visual similarity rather than model artifacts.
Mathematics 14 01176 g008
Figure 9. Confusion matrix for CustomNet on the held-out test set (fully trained model, 20 epochs, seed 0). The most frequent misclassifications occur between Paper and Cardboard and between Plastic and Clothes, which is consistent with the Grad-CAM analysis showing shared structural activations for these visually similar category pairs.
Figure 9. Confusion matrix for CustomNet on the held-out test set (fully trained model, 20 epochs, seed 0). The most frequent misclassifications occur between Paper and Cardboard and between Plastic and Clothes, which is consistent with the Grad-CAM analysis showing shared structural activations for these visually similar category pairs.
Mathematics 14 01176 g009
Table 1. Per-class image counts across dataset splits. Total dataset: 13,933 images, 10 classes. Sources: Kaggle Garbage Classification [45] (61.1%) and in-house collection (38.9%). Split ratio: 80% train, 10% val, 10% test (stratified).
Table 1. Per-class image counts across dataset splits. Total dataset: 13,933 images, 10 classes. Sources: Kaggle Garbage Classification [45] (61.1%) and in-house collection (38.9%). Split ratio: 80% train, 10% val, 10% test (stratified).
ClassTrainValTestTotal
Battery8601081071075
Biological9181151141147
Cardboard18062262262258
Clothes10321291291290
Glass11211401401401
Metal5436868679
Paper13181651651648
Plastic79599100994
Shoes17702212222213
Trash9821231231228
Total11,1451394139413,933
Table 2. Overview of applied data augmentation techniques.
Table 2. Overview of applied data augmentation techniques.
TransformationRange/ProbabilityRationale
Rotation ± 25 ° Orientation invariance
Horizontal flip 50 % Symmetry exploitation
Scaling90– 110 % Size variation
Translation ± 10 pxSpatial robustness
Brightness 0.8 1.2 Illumination diversity
Contrast 0.8 1.2 Illumination diversity
Gaussian noise/Sensor simulation
Gaussian blur5 × 5 filterSensor simulation
Table 3. Comparison of selected CNN models.
Table 3. Comparison of selected CNN models.
ModelParams (M)FLOPs (G)InputNotes
ResNet-50 [8]25.64.1224 × 224Deep residual connections
EfficientNet-B0 [10]5.30.39224 × 224Compound-scaled efficiency
MobileNetV3-Large [11]5.40.23224 × 224Edge-device optimized
Table 4. Average test performance across five runs (mean ± SD). ResNet-50 achieves the numerically highest accuracy and macro-F1 among all evaluated models. Bold: proposed method (CustomNet).
Table 4. Average test performance across five runs (mean ± SD). ResNet-50 achieves the numerically highest accuracy and macro-F1 among all evaluated models. Bold: proposed method (CustomNet).
ModelAccuracy (%)Macro-F1ROC–AUCParams (M)FLOPs (G)
ResNet-50 97.93 ± 0.12 0.975 ± 0.002 0.992 ± 0.001 25.64.1
EfficientNet-B0 97.27 ± 0.28 0.968 ± 0.003 0.987 ± 0.002 5.30.39
MobileNet V3 96.61 ± 0.64 0.962 ± 0.006 0.981 ± 0.003 2.50.22
CustomNet (proposed) 97.79 ± 0.12 0.973 ± 0.002 0.992 ± 0.001 40.3 (9.7 head) ≈5.7
Table 5. Ablation study of CustomNet components (mean ± SD across five seeds). Δ F1 computed against full CustomNet. Statistical significance tested against full CustomNet using paired t-test or Wilcoxon signed-rank test with Bonferroni correction ( α = 0.0083 ).
Table 5. Ablation study of CustomNet components (mean ± SD across five seeds). Δ F1 computed against full CustomNet. Statistical significance tested against full CustomNet using paired t-test or Wilcoxon signed-rank test with Bonferroni correction ( α = 0.0083 ).
ConfigurationAccuracy (%)Macro-F1 Δ F1Sig.?
Full CustomNet 97.79 ± 0.12 0.973 ± 0.002
No attention module 10.82 ± 3.46 0.019 ± 0.006 0.954 Yes
No feature fusion (ResNet only) 97.89 ± 0.12 0.975 ± 0.002 + 0.002 No
No feature fusion (MobileNet only) 97.12 ± 0.18 0.966 ± 0.003 0.007 Yes
No feature fusion (EfficientNet only) 97.32 ± 0.16 0.968 ± 0.003 0.005 Yes
No data augmentation 97.50 ± 0.08 0.971 ± 0.001 0.002 No
Focal Loss → Cross-entropy 97.66 ± 0.14 0.971 ± 0.002 0.002 No
Table 6. Shapiro–Wilk normality test results for cross-validation macro-F1 distributions (25 scores per model, 5 folds × 5 seeds). W: test statistic; p: p-value. Models with p < 0.05 are non-normal and use the Wilcoxon signed-rank test for pairwise comparisons.
Table 6. Shapiro–Wilk normality test results for cross-validation macro-F1 distributions (25 scores per model, 5 folds × 5 seeds). W: test statistic; p: p-value. Models with p < 0.05 are non-normal and use the Wilcoxon signed-rank test for pairwise comparisons.
ModelWpNormal?
CustomNet (full)0.9630.471Yes
CustomNet (no attention)0.8960.014No
EfficientNet-B00.9710.657Yes
MobileNet V30.9570.358Yes
ResNet-500.9680.581Yes
Table 7. Pairwise statistical comparison of models using cross-validation macro-F1 scores (25 values per model). Paired t-test used for all comparisons.(Wilcoxon signed-rank test). p-values are Bonferroni corrected ( α = 0.0083 ). Cohen’s d computed using pooled standard deviation. Bootstrap 95% CI computed from test-set predictions ( N = 10,000 resamples).
Table 7. Pairwise statistical comparison of models using cross-validation macro-F1 scores (25 values per model). Paired t-test used for all comparisons.(Wilcoxon signed-rank test). p-values are Bonferroni corrected ( α = 0.0083 ). Cohen’s d computed using pooled standard deviation. Bootstrap 95% CI computed from test-set predictions ( N = 10,000 resamples).
Comparison Δ F1p-ValueSig.?Cohen’s d95% CI (F1)
CustomNet vs. EfficientNet-B0 + 0.005 < 0.001 Yes 3.203 [0.969, 0.977]
CustomNet vs. MobileNet V3 + 0.011 < 0.001 Yes 4.013 [0.969, 0.977]
CustomNet vs. ResNet-50 0.002 0.383 No 0.218 [0.969, 0.977]
EfficientNet-B0 vs. ResNet-50 0.007 < 0.001 Yes 2.891 [0.964, 0.972]
MobileNet V3 vs. ResNet-50 0.013 < 0.001 Yes 3.744 [0.958, 0.966]
EfficientNet-B0 vs. MobileNet V3 + 0.006 0.021 No 0.987 [0.964, 0.972]
Table 8. Robustness evaluation: test accuracy (%) under Gaussian noise and brightness perturbations. Delta values show accuracy change relative to unperturbed baseline. Results averaged across five seeds.
Table 8. Robustness evaluation: test accuracy (%) under Gaussian noise and brightness perturbations. Delta values show accuracy change relative to unperturbed baseline. Results averaged across five seeds.
ModelGaussian Noise ( σ )Brightness Factor
0.100.200.400.501.50
CustomNet 97.4 (−0.4) 96.8 (−1.0) 93.4 (−4.4) 97.5 (−0.3) 94.0 (−3.8)
ResNet-50 97.4 (−0.5) 96.5 (−1.5) 93.0 (−5.0) 97.8 (−0.2) 94.3 (−3.7)
EfficientNet-B0 96.9 (−0.4) 95.7 (−1.5) 89.0 (−8.3) 96.2 (−1.1) 92.9 (−4.3)
MobileNet V3 96.0 (−0.6) 93.7 (−3.0) 87.0 (−9.6) 95.9 (−0.7) 91.6 (−5.0)
Table 9. Inference benchmarks on NVIDIA Jetson Orin Nano Developer Kit Super (JetPack 36.4.7, PyTorch 2.8.0, CUDA 12.6, batch size 1, 200 timed iterations after 30 warmup passes). Accuracy reported from test-set evaluation (mean across 5 seeds). All models benchmarked at FP32 precision without quantization or pruning.
Table 9. Inference benchmarks on NVIDIA Jetson Orin Nano Developer Kit Super (JetPack 36.4.7, PyTorch 2.8.0, CUDA 12.6, batch size 1, 200 timed iterations after 30 warmup passes). Accuracy reported from test-set evaluation (mean across 5 seeds). All models benchmarked at FP32 precision without quantization or pruning.
ModelAccuracy (%)Macro-F1Latency (ms)FPS
CustomNet (proposed) 97.79 0.973 86.70 11.5
CustomNet (ResNet only) 97.89 0.975 25.06 39.9
CustomNet (EfficientNet only) 97.32 0.968 35.49 28.2
CustomNet (MobileNet only) 97.12 0.966 26.84 37.3
ResNet-50 97.93 0.975 23.65 42.3
EfficientNet-B0 97.27 0.968 33.22 30.1
MobileNet V3 96.61 0.962 24.61 40.6
Table 10. Comparison of representative deep learning approaches for waste classification. Deployment context: Edge = inference benchmarked on embedded hardware; Server = GPU-server evaluation only; Alg. only = no deployment context reported. CV: cross-validation reported; Stats: statistical significance testing; Expl.: saliency visualization (Grad-CAM). Pruning/Quant.: whether model pruning or quantization was applied prior to deployment; NR: not reported by authors.
Table 10. Comparison of representative deep learning approaches for waste classification. Deployment context: Edge = inference benchmarked on embedded hardware; Server = GPU-server evaluation only; Alg. only = no deployment context reported. CV: cross-validation reported; Stats: statistical significance testing; Expl.: saliency visualization (Grad-CAM). Pruning/Quant.: whether model pruning or quantization was applied prior to deployment; NR: not reported by authors.
StudyModelDeploymentClassesImagesAcc. (%)CVStatsExpl. Pruning/Quant.
Vo et al. [3]DNN-TC (ResNeXt)Server3/65904/252798.2/94.0NoNoNo NR
Adedeji and Wang [4]ResNet-50 + SVMServer6252787.0NoNoNo NR
Fu et al. [15]DL + Embedded LinuxEdge4≈15,00097.0NoNoNo NR
Chu et al. [20]Multilayer Hybrid DLAlg. only6252792.0NoNoNo NR
Ahmad et al. [40]Deep Feature FusionAlg. only6252795.4NoNoNo NR
Wang et al. [16]DL + IoTEdge4≈500095.0NoNoNo NR
Zhang et al. [28] CNN + Transfer Learning Alg. only 4 ≈15,000 96.8 No No No NR
Hossen et al. [37] RWC-Net Alg. only 6 2527 97.8 No No No NR
CustomNet3-backbone fusionEdge1013,93397.8YesYesYes No (baseline)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Verber, D.; Grneva, T.; Dugonik, J. Image-Based Waste Classification Using a Hybrid Deep Learning Architecture with Transfer Learning and Edge AI Deployment. Mathematics 2026, 14, 1176. https://doi.org/10.3390/math14071176

AMA Style

Verber D, Grneva T, Dugonik J. Image-Based Waste Classification Using a Hybrid Deep Learning Architecture with Transfer Learning and Edge AI Deployment. Mathematics. 2026; 14(7):1176. https://doi.org/10.3390/math14071176

Chicago/Turabian Style

Verber, Domen, Teodora Grneva, and Jani Dugonik. 2026. "Image-Based Waste Classification Using a Hybrid Deep Learning Architecture with Transfer Learning and Edge AI Deployment" Mathematics 14, no. 7: 1176. https://doi.org/10.3390/math14071176

APA Style

Verber, D., Grneva, T., & Dugonik, J. (2026). Image-Based Waste Classification Using a Hybrid Deep Learning Architecture with Transfer Learning and Edge AI Deployment. Mathematics, 14(7), 1176. https://doi.org/10.3390/math14071176

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop