Author Contributions
Conceptualization, D.V. and T.G.; methodology, T.G., J.D. and D.V.; software, T.G. and J.D.; validation, J.D. and D.V.; formal analysis, T.G.; investigation, T.G. and D.V.; resources, D.V.; data curation, T.G.; writing––original draft preparation, T.G., J.D. and D.V.; writing––review and editing, J.D., T.G. and D.V.; visualization, T.G.; supervision, D.V.; project administration, D.V.; funding acquisition, D.V. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Class distribution in the waste image dataset (N = 13,933; 10 classes). The histogram highlights a moderate imbalance: Cardboard and Shoes dominate the corpus (2258 and 2213 images, respectively), while Metal and Battery are the least represented classes (679 and 1075 images, respectively). Imbalance ratio 3.3:1 (cardboard vs. metal).
Figure 1.
Class distribution in the waste image dataset (N = 13,933; 10 classes). The histogram highlights a moderate imbalance: Cardboard and Shoes dominate the corpus (2258 and 2213 images, respectively), while Metal and Battery are the least represented classes (679 and 1075 images, respectively). Imbalance ratio 3.3:1 (cardboard vs. metal).
Figure 2.
Architecture of the proposed CustomNet model. Features extracted from ResNet-50, EfficientNet-B0, and MobileNetV3 are concatenated (combined dimension: 4288), projected to a 1024-dimensional embedding, fused using multi-head self-attention (8 heads), and classified through a three-layer MLP (768 → 512 → 10 classes).
Figure 2.
Architecture of the proposed CustomNet model. Features extracted from ResNet-50, EfficientNet-B0, and MobileNetV3 are concatenated (combined dimension: 4288), projected to a 1024-dimensional embedding, fused using multi-head self-attention (8 heads), and classified through a three-layer MLP (768 → 512 → 10 classes).
Figure 3.
Multi-class one-vs.-rest ROC curves for all four models on the held-out test set (seed 0). All models achieve high AUC values (>0.98) across all classes, confirming strong discriminative power under transfer learning. CustomNet and ResNet-50 show the tightest curves toward the upper-left corner for minority classes (Metal, Battery), which is consistent with the macro ROC–AUC results in
Table 4. (
a) ResNet-50. (
b) EfficientNet-B0. (
c) MobileNet V3. (
d) CustomNet (proposed).
Figure 3.
Multi-class one-vs.-rest ROC curves for all four models on the held-out test set (seed 0). All models achieve high AUC values (>0.98) across all classes, confirming strong discriminative power under transfer learning. CustomNet and ResNet-50 show the tightest curves toward the upper-left corner for minority classes (Metal, Battery), which is consistent with the macro ROC–AUC results in
Table 4. (
a) ResNet-50. (
b) EfficientNet-B0. (
c) MobileNet V3. (
d) CustomNet (proposed).
Figure 4.
Multi-class precision–recall curves for all four models on the held-out test set (seed 0). PR curves are a more informative view than ROC under class imbalance. CustomNet and ResNet-50 show the highest precision at high recall for minority classes (Metal, Battery), which is consistent with the Focal Loss training objective and multi-backbone fusion design. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Figure 4.
Multi-class precision–recall curves for all four models on the held-out test set (seed 0). PR curves are a more informative view than ROC under class imbalance. CustomNet and ResNet-50 show the highest precision at high recall for minority classes (Metal, Battery), which is consistent with the Focal Loss training objective and multi-backbone fusion design. (a) ResNet-50. (b) EfficientNet-B0. (c) MobileNet V3. (d) CustomNet (proposed).
Figure 5.
Grad-CAM saliency visualizations for all four models across all 10 waste classes. Heatmaps are computed with respect to the ground-truth class label to ensure consistent comparison across models. Warmer colors indicate regions contributing most strongly to the classification decision. CustomNet consistently produces more focused activations on material-specific regions (e.g., container rims, surface textures, structural edges) compared to single-backbone baselines, particularly for visually similar classes such as metal and glass.
Figure 5.
Grad-CAM saliency visualizations for all four models across all 10 waste classes. Heatmaps are computed with respect to the ground-truth class label to ensure consistent comparison across models. Warmer colors indicate regions contributing most strongly to the classification decision. CustomNet consistently produces more focused activations on material-specific regions (e.g., container rims, surface textures, structural edges) compared to single-backbone baselines, particularly for visually similar classes such as metal and glass.
Figure 6.
Grad-CAM saliency maps for the three individual backbone branches within CustomNet (ResNet-50, MobileNet V3, EfficientNet-B0) across all 10 classes. All three branches attend to largely consistent discriminative regions, validating that the multi-backbone fusion integrates complementary rather than redundant representations. Minor differences are visible in the glass and metal rows, where branch-level activations highlight different sub-regions of the same object.
Figure 6.
Grad-CAM saliency maps for the three individual backbone branches within CustomNet (ResNet-50, MobileNet V3, EfficientNet-B0) across all 10 classes. All three branches attend to largely consistent discriminative regions, validating that the multi-backbone fusion integrates complementary rather than redundant representations. Minor differences are visible in the glass and metal rows, where branch-level activations highlight different sub-regions of the same object.
Figure 7.
Grad-CAM comparison between full CustomNet and the no-attention variant across all 10 classes. The full model produces focused activations on discriminative object regions (e.g., bottle texture for glass, sole pattern for shoes). The no-attention variant shows diffuse or absent activations, consistent with its near-random classification performance, confirming that the attention module is essential for routing image-derived signals to the classifier.
Figure 7.
Grad-CAM comparison between full CustomNet and the no-attention variant across all 10 classes. The full model produces focused activations on discriminative object regions (e.g., bottle texture for glass, sole pattern for shoes). The no-attention variant shows diffuse or absent activations, consistent with its near-random classification performance, confirming that the attention module is essential for routing image-derived signals to the classifier.
Figure 8.
Grad-CAM saliency maps for the most frequently confused class pairs (paper vs. cardboard, plastic vs. clothes) for ResNet-50 and CustomNet. Both models activate similar regions for these visually similar categories, indicating that residual confusions are driven by genuine inter-class visual similarity rather than model artifacts.
Figure 8.
Grad-CAM saliency maps for the most frequently confused class pairs (paper vs. cardboard, plastic vs. clothes) for ResNet-50 and CustomNet. Both models activate similar regions for these visually similar categories, indicating that residual confusions are driven by genuine inter-class visual similarity rather than model artifacts.
Figure 9.
Confusion matrix for CustomNet on the held-out test set (fully trained model, 20 epochs, seed 0). The most frequent misclassifications occur between Paper and Cardboard and between Plastic and Clothes, which is consistent with the Grad-CAM analysis showing shared structural activations for these visually similar category pairs.
Figure 9.
Confusion matrix for CustomNet on the held-out test set (fully trained model, 20 epochs, seed 0). The most frequent misclassifications occur between Paper and Cardboard and between Plastic and Clothes, which is consistent with the Grad-CAM analysis showing shared structural activations for these visually similar category pairs.
Table 1.
Per-class image counts across dataset splits. Total dataset: 13,933 images, 10 classes. Sources: Kaggle Garbage Classification [
45] (61.1%) and in-house collection (38.9%). Split ratio: 80% train, 10% val, 10% test (stratified).
Table 1.
Per-class image counts across dataset splits. Total dataset: 13,933 images, 10 classes. Sources: Kaggle Garbage Classification [
45] (61.1%) and in-house collection (38.9%). Split ratio: 80% train, 10% val, 10% test (stratified).
| Class | Train | Val | Test | Total |
|---|
| Battery | 860 | 108 | 107 | 1075 |
| Biological | 918 | 115 | 114 | 1147 |
| Cardboard | 1806 | 226 | 226 | 2258 |
| Clothes | 1032 | 129 | 129 | 1290 |
| Glass | 1121 | 140 | 140 | 1401 |
| Metal | 543 | 68 | 68 | 679 |
| Paper | 1318 | 165 | 165 | 1648 |
| Plastic | 795 | 99 | 100 | 994 |
| Shoes | 1770 | 221 | 222 | 2213 |
| Trash | 982 | 123 | 123 | 1228 |
| Total | 11,145 | 1394 | 1394 | 13,933 |
Table 2.
Overview of applied data augmentation techniques.
Table 2.
Overview of applied data augmentation techniques.
| Transformation | Range/Probability | Rationale |
|---|
| Rotation | | Orientation invariance |
| Horizontal flip | | Symmetry exploitation |
| Scaling | 90– | Size variation |
| Translation | px | Spatial robustness |
| Brightness | – | Illumination diversity |
| Contrast | – | Illumination diversity |
| Gaussian noise | / | Sensor simulation |
| Gaussian blur | 5 × 5 filter | Sensor simulation |
Table 3.
Comparison of selected CNN models.
Table 3.
Comparison of selected CNN models.
| Model | Params (M) | FLOPs (G) | Input | Notes |
|---|
| ResNet-50 [8] | 25.6 | 4.1 | 224 × 224 | Deep residual connections |
| EfficientNet-B0 [10] | 5.3 | 0.39 | 224 × 224 | Compound-scaled efficiency |
| MobileNetV3-Large [11] | 5.4 | 0.23 | 224 × 224 | Edge-device optimized |
Table 4.
Average test performance across five runs (mean ± SD). ResNet-50 achieves the numerically highest accuracy and macro-F1 among all evaluated models. Bold: proposed method (CustomNet).
Table 4.
Average test performance across five runs (mean ± SD). ResNet-50 achieves the numerically highest accuracy and macro-F1 among all evaluated models. Bold: proposed method (CustomNet).
| Model | Accuracy (%) | Macro-F1 | ROC–AUC | Params (M) | FLOPs (G) |
|---|
|
ResNet-50
| | | | 25.6 | 4.1 |
|
EfficientNet-B0
| | | | 5.3 | 0.39 |
|
MobileNet V3
| | | | 2.5 | 0.22 |
|
CustomNet (proposed)
| | | | (9.7 head)
| ≈5.7 |
Table 5.
Ablation study of CustomNet components (mean ± SD across five seeds). F1 computed against full CustomNet. Statistical significance tested against full CustomNet using paired t-test or Wilcoxon signed-rank test with Bonferroni correction ().
Table 5.
Ablation study of CustomNet components (mean ± SD across five seeds). F1 computed against full CustomNet. Statistical significance tested against full CustomNet using paired t-test or Wilcoxon signed-rank test with Bonferroni correction ().
| Configuration | Accuracy (%) | Macro-F1 | F1 | Sig.? |
|---|
| Full CustomNet | | | — | — |
| No attention module | | | |
Yes
|
| No feature fusion (ResNet only) | | | |
No
|
| No feature fusion (MobileNet only) | | | |
Yes
|
| No feature fusion (EfficientNet only) | | | |
Yes
|
| No data augmentation | | | |
No
|
| Focal Loss → Cross-entropy | | | |
No
|
Table 6.
Shapiro–Wilk normality test results for cross-validation macro-F1 distributions (25 scores per model, 5 folds × 5 seeds). W: test statistic; p: p-value. Models with are non-normal and use the Wilcoxon signed-rank test for pairwise comparisons.
Table 6.
Shapiro–Wilk normality test results for cross-validation macro-F1 distributions (25 scores per model, 5 folds × 5 seeds). W: test statistic; p: p-value. Models with are non-normal and use the Wilcoxon signed-rank test for pairwise comparisons.
| Model | W | p | Normal? |
|---|
| CustomNet (full) | 0.963 | 0.471 | Yes |
| CustomNet (no attention) | 0.896 | 0.014 | No |
| EfficientNet-B0 | 0.971 | 0.657 | Yes |
| MobileNet V3 | 0.957 | 0.358 | Yes |
| ResNet-50 | 0.968 | 0.581 | Yes |
Table 7.
Pairwise statistical comparison of models using cross-validation macro-F1 scores (25 values per model). Paired t-test used for all comparisons.(Wilcoxon signed-rank test). p-values are Bonferroni corrected (). Cohen’s d computed using pooled standard deviation. Bootstrap 95% CI computed from test-set predictions ( resamples).
Table 7.
Pairwise statistical comparison of models using cross-validation macro-F1 scores (25 values per model). Paired t-test used for all comparisons.(Wilcoxon signed-rank test). p-values are Bonferroni corrected (). Cohen’s d computed using pooled standard deviation. Bootstrap 95% CI computed from test-set predictions ( resamples).
| Comparison | F1 | p-Value | Sig.? | Cohen’s d | 95% CI (F1) |
|---|
| CustomNet vs. EfficientNet-B0 | | |
Yes
|
3.203
|
[0.969, 0.977]
|
| CustomNet vs. MobileNet V3 | | |
Yes
|
4.013
|
[0.969, 0.977]
|
| CustomNet vs. ResNet-50 | | |
No
|
0.218
|
[0.969, 0.977]
|
| EfficientNet-B0 vs. ResNet-50 | | |
Yes
|
2.891
|
[0.964, 0.972]
|
| MobileNet V3 vs. ResNet-50 | | |
Yes
|
3.744
|
[0.958, 0.966]
|
| EfficientNet-B0 vs. MobileNet V3 | | |
No
|
0.987
|
[0.964, 0.972]
|
Table 8.
Robustness evaluation: test accuracy (%) under Gaussian noise and brightness perturbations. Delta values show accuracy change relative to unperturbed baseline. Results averaged across five seeds.
Table 8.
Robustness evaluation: test accuracy (%) under Gaussian noise and brightness perturbations. Delta values show accuracy change relative to unperturbed baseline. Results averaged across five seeds.
| Model | Gaussian Noise () | Brightness Factor |
|---|
| 0.10 | 0.20 | 0.40 | 0.50 | 1.50 |
|---|
| CustomNet |
97.4 (−0.4)
|
96.8 (−1.0)
|
93.4 (−4.4)
|
97.5 (−0.3)
|
94.0 (−3.8)
|
| ResNet-50 |
97.4 (−0.5)
|
96.5 (−1.5)
|
93.0 (−5.0)
|
97.8 (−0.2)
|
94.3 (−3.7)
|
| EfficientNet-B0 |
96.9 (−0.4)
|
95.7 (−1.5)
|
89.0 (−8.3)
|
96.2 (−1.1)
|
92.9 (−4.3)
|
| MobileNet V3 |
96.0 (−0.6)
|
93.7 (−3.0)
|
87.0 (−9.6)
|
95.9 (−0.7)
|
91.6 (−5.0)
|
Table 9.
Inference benchmarks on NVIDIA Jetson Orin Nano Developer Kit Super (JetPack 36.4.7, PyTorch 2.8.0, CUDA 12.6, batch size 1, 200 timed iterations after 30 warmup passes). Accuracy reported from test-set evaluation (mean across 5 seeds).
All models benchmarked at FP32 precision without quantization or pruning.
Table 9.
Inference benchmarks on NVIDIA Jetson Orin Nano Developer Kit Super (JetPack 36.4.7, PyTorch 2.8.0, CUDA 12.6, batch size 1, 200 timed iterations after 30 warmup passes). Accuracy reported from test-set evaluation (mean across 5 seeds).
All models benchmarked at FP32 precision without quantization or pruning.
| Model | Accuracy (%) | Macro-F1 | Latency (ms) | FPS |
|---|
| CustomNet (proposed) |
97.79
|
0.973
|
86.70
|
11.5
|
| CustomNet (ResNet only) |
97.89
|
0.975
|
25.06
|
39.9
|
| CustomNet (EfficientNet only) |
97.32
|
0.968
|
35.49
|
28.2
|
| CustomNet (MobileNet only) |
97.12
|
0.966
|
26.84
|
37.3
|
| ResNet-50 |
97.93
|
0.975
|
23.65
|
42.3
|
| EfficientNet-B0 |
97.27
|
0.968
|
33.22
|
30.1
|
| MobileNet V3 |
96.61
|
0.962
|
24.61
|
40.6
|
Table 10.
Comparison of representative deep learning approaches for waste classification. Deployment context: Edge = inference benchmarked on embedded hardware; Server = GPU-server evaluation only; Alg. only = no deployment context reported. CV: cross-validation reported; Stats: statistical significance testing; Expl.: saliency visualization (Grad-CAM).
Pruning/Quant.: whether model pruning or quantization was applied prior to deployment; NR: not reported by authors.
Table 10.
Comparison of representative deep learning approaches for waste classification. Deployment context: Edge = inference benchmarked on embedded hardware; Server = GPU-server evaluation only; Alg. only = no deployment context reported. CV: cross-validation reported; Stats: statistical significance testing; Expl.: saliency visualization (Grad-CAM).
Pruning/Quant.: whether model pruning or quantization was applied prior to deployment; NR: not reported by authors.
| Study | Model | Deployment | Classes | Images | Acc. (%) | CV | Stats | Expl. |
Pruning/Quant.
|
|---|
| Vo et al. [3] | DNN-TC (ResNeXt) | Server | 3/6 | 5904/2527 | 98.2/94.0 | No | No | No |
NR
|
| Adedeji and Wang [4] | ResNet-50 + SVM | Server | 6 | 2527 | 87.0 | No | No | No |
NR
|
| Fu et al. [15] | DL + Embedded Linux | Edge | 4 | ≈15,000 | 97.0 | No | No | No |
NR
|
| Chu et al. [20] | Multilayer Hybrid DL | Alg. only | 6 | 2527 | 92.0 | No | No | No |
NR
|
| Ahmad et al. [40] | Deep Feature Fusion | Alg. only | 6 | 2527 | 95.4 | No | No | No |
NR
|
| Wang et al. [16] | DL + IoT | Edge | 4 | ≈5000 | 95.0 | No | No | No |
NR
|
|
Zhang et al. [28]
|
CNN + Transfer Learning
|
Alg. only
|
4
|
≈15,000
|
96.8
|
No
|
No
|
No
|
NR
|
|
Hossen et al. [37]
|
RWC-Net
|
Alg. only
|
6
|
2527
|
97.8
|
No
|
No
|
No
|
NR
|
| CustomNet | 3-backbone fusion | Edge | 10 | 13,933 | 97.8 | Yes | Yes | Yes |
No (baseline)
|