Next Article in Journal
A Six-Year, Spatiotemporally Comprehensive Dataset and Data Retrieval Tool for Analyzing Chlorophyll-a, Turbidity, and Temperature in Utah Lake Using Sentinel and MODIS Imagery
Previous Article in Journal
Synergistic Evolution or Competitive Disruption? Analysing the Dynamic Interaction Between Digital and Real Economies in Henan, China, Based on Panel Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Limitations of Influence-Based Dataset Compression for Waste Classification

1
Chair of Waste Processing Technology and Waste Management, Technical University of Leoben, 8700 Leoben, Austria
2
Siemens Aktiengesellschaft, 1210 Vienna, Austria
3
Pro2Future GmbH, 8010 Graz, Austria
*
Author to whom correspondence should be addressed.
Data 2025, 10(8), 127; https://doi.org/10.3390/data10080127
Submission received: 1 July 2025 / Revised: 1 August 2025 / Accepted: 4 August 2025 / Published: 7 August 2025

Abstract

Influence-based data selection methods, such as TracIn, aim to estimate the impact of individual training samples on model predictions and are increasingly used for dataset curation and reduction. This study investigates whether selecting the most positively influential training examples can be used to create compressed yet effective training datasets for transfer learning in plastic waste classification. Using a ResNet-18 model trained on a custom dataset of plastic waste images, TracIn was applied to compute influence scores across multiple training checkpoints. The top 50 influential samples per class were extracted and used to train a new model. Contrary to expectations, models trained on these highly influential subsets significantly underperformed compared to models trained on either the full dataset or an equally sized random sample. Further analysis revealed that many top-ranked influential images originated from different classes, indicating model biases and potential label confusion. These findings highlight the limitations of using influence scores for dataset compression. However, TracIn proved valuable for identifying problematic or ambiguous samples, class imbalance issues, and issues with fuzzy class boundaries. Based on the results, the utilized TracIn approach is recommended as a diagnostic instrument rather than for dataset curation.

1. Introduction

The success of modern deep learning models is critically dependent on the quality and composition of the training data. In the domain of waste management, data quality poses a significant challenge. With adequate datasets, classification models working with waste can achieve high accuracies up to 98% [1]. Publicly available datasets often lack the necessary diversity, contain limited relevant classes, and fail to reflect real-world conditions [2]. Consequently, such datasets are inadequate for training classification models for deployment in operational waste sorting facilities.
Due to the heterogeneity of waste particles, creating large, diverse datasets for model training is necessary. However, such experimentally acquired datasets are not balanced due to the nature of the waste stream. Additionally, specific particle types (e.g., common bottles from large brands) are often overrepresented in the data, which introduces biases into the training set. Other objects are underrepresented; for example, they appear only once in the data. These inherent issues in the dataset promote problems like overfitting of the classification model [3]. Bockreis et al. highlighted these issues with a few-shot learning experiment, where a ResNet18 model trained on 20 images outperformed a model trained on the complete dataset [4]. Based on these findings and the conditions regarding waste sorting, the hypothesis is formed that a small, compressed, balanced dataset that reflects the entirety of the possible particles could lead to higher classification accuracies than training on an unbalanced, larger dataset.
While extensive research has been devoted to optimizing network architectures and training procedures, the impact of individual training samples on model predictions remains largely underexplored. Understanding how specific data points or images influence model behavior is vital for improving model robustness, debugging datasets, and enhancing overall data quality.
Recent advances in influence estimation techniques, particularly the Tracking Influence (TracIn) method, provide a scalable and practical way to estimate the contribution of individual training samples to a specific test prediction. TracIn estimates how much a training image influences a model’s prediction by comparing the gradients of the training and test images at various checkpoints throughout training. The method tracks how the loss on the test input would have changed if a particular training example had not been used, allowing it to identify which training points positively or negatively influence the prediction [5]. This approach represents an iteration on traditional Leave-One-Out retraining methods. Leave-One-Out retraining methods work by leaving one training image out, retraining the model on this new dataset, and then using the left-out image as a test image. This process is repeated for all images in the training set. While Leave-One-Out methods provide exact influence estimates, they are time-consuming and computationally expensive [6]. In contrast, TracIn leverages gradient similarity to efficiently quantify influence without requiring retraining.
While prior work has explored TracIn’s ability to explain individual predictions, little is known about whether highly influential training samples are also suitable for training models in a data-efficient manner. Therefore, the emerging question is whether a high influence on the model means these images are highly valuable for learning.
To answer this question, an experimental image dataset, named DWRL, of plastic waste particles from a local collection site was recorded in an industry-near environment. The data was processed and prepared for model training. A convolutional neural network (ResNet18) was trained and evaluated. TracIn was employed to analyze the influence of individual training samples on model predictions. Multiple training checkpoints were analyzed to examine how influence evolves over time to capture a more nuanced view of model behavior throughout training. The analysis was performed in a highly parallelized manner to reduce the computational time. The most influential images for each class were identified, visualized, and compiled into a dataset. The presented methodology enables transparent inspection of the training process, provides a reproducible framework for analyzing model–data interactions, and provides insights into the inner workings of the model and its biases.

2. Materials and Methods

An overview of the materials and methods used is presented in Figure 1. The used post-consumer plastics were sampled at a local waste collection center. The material was prepared for data recording, which includes drum-sieving, sorting materials into predefined classes, and class verification.
In the model training phase, a convolutional neural network is trained using a fixed learning rate of 0.0001, a batch size of 32, and 20 training epochs on the full dataset. All model checkpoints are saved throughout training to enable later analysis.
TracIn influence estimation was then applied to assess the contribution of individual training images to the predictions of the trained model. Influence estimation allows identification of training examples that positively influenced the model’s predictions. Subsequently, the 50 most influential training images per class are identified. These were used to retrain a classification model and to perform a visual inspection. Finally, the retrained model is subjected to performance evaluation, and the results are used to derive key findings and recommendations, which are summarized in the Section 3 and Section 4.
All computational tasks were performed on an MSI Vector 17 HX laptop equipped with an Intel Core i9-13980HX CPU, an Nvidia RTX 4080 mobile GPU with 12 GB of VRAM, and 32 GB of DDR5 RAM.

2.1. Sampling and Data Collection

Four plastic waste samples, each weighing approximately 30 kg, were obtained from a local waste collection center in Leoben, Austria. The samples were further processed in the Digital Waste Research Lab (DWRL), a research facility of the Chair of Waste Processing Technology and Waste Management at the Technical University of Leoben. The DWRL is a near-industry-scale facility specializing in waste sorting and on-time waste characterization with various sensor equipment, including near infrared (NIR) hyperspectral imaging, 3D laser triangulation, RGB cameras, and Metal Detectors [7]. The sample preparation was based on the processes typically implemented in Austrian plastic sorting plants [8].
First, the samples were drum-sieved using 60 mm screen plates for 180 s, with a rotational speed of 5 rpm. The rotating drum was constructed as an equilateral octagonal prism, with each side measuring 454 mm and a drum depth of 1000 mm [9]. Only particles over 60 mm were further processed. The individual plastic particles were sorted into seven material classes: polyethylene terephthalate (PET), polypropylene (PP), beverage cartons or Tetra Pak (Tetra), polyvinyl chloride (PVC), polystyrene (PS), and “other”. These classes represent typical target fractions in plastic sorting plants in Austria. The first sorting step was performed using the DWRL’s NIR sensor-based sorting system (EVK Helios G2, EVK DI Kerschhaggl GmbH, Raaba, Austria). All particles were validated by hand using a Thermo Fischer Scientific Inc.™ (Waltham, MA, USA) microPHAZIR™ NIR handheld spectrometer with the manufacturer-supplied calibration and database. The class “other” consists of anything else, such as non-plastics, material non-classifiable by NIR, or multilayer packaging.
The data acquisition was conducted with 5 MP Basler ace2Pro a2A2448-23gcPRO (Basler AG, Ahrensburg, Germany) industrial cameras paired with Tamron M117FM06-RG 6 mm f2.4 lenses. The cameras were set to auto-acquisition mode with an upper exposure time limit of 12,000 µs and an analog gain of 9 dB. A more detailed setting list is available in [2]. All recordings were conducted on a running conveyor belt at 0.5 m/s and a width of 1200 mm to reflect real-world conditions in a waste sorting plant. The captured data was processed on a Siemens Industrial PC (IPC) 427E using the Basler Pylon Viewer (7.5.0.15479) and subsequently uploaded to the cloud for further processing.

2.2. Two-Stage Approach

This study employed a two-stage approach for plastic waste classification to address the complexity of differentiating between overlapping visually similar plastic materials under varying environmental and imaging conditions. In the first stage, an object detection model detects individual objects from the background (conveyor belt). In the second stage, these cropped objects are passed to a dedicated classification model trained to identify different material classes. This modular design offers several advantages: It allows independent optimization of detection and classification tasks. Prior studies have shown that two-stage architectures outperform end-to-end models in industrial classification tasks where object localization and detailed material recognition are critical [10]. Furthermore, the modular separation of these two tasks allows the reuse of general-purpose detectors and facilitates domain adaptation in the classification stage, making the system more flexible and scalable [11].

2.2.1. Used Object Detector

For object detection, an SSD Lite object detector [12] was used and trained on a Synthetic Dataset. The dataset comprises 4000 synthetically generated images. Each image displays monochromatic waste particles of various shapes on a conveyor belt featuring a repeating texture pattern. Early stopping based on the mean Average Precision (mAP) was implemented to prevent overfitting. Data augmentations were applied to enhance further model robustness, including rotations, color shifting, and filtering, as our objects were invariant to these transformations. The augmentation process expands the synthetic training dataset, enabling the model to learn more generalized features. After training, the SSD Lite was utilized for batch predictions on the DWRL dataset. A portion of the predictions was reviewed and corrected if necessary. These portions were selected based on the predictions’ entropy score [12]. This active learning approach helped to reduce the manual labeling effort. After that, the detector was retrained on the portion of the DWRL dataset, which was corrected in the same fashion as before. Another batch prediction on all the DWRL data showed that this was sufficient. The detection data of the objects on the images was used to cut out the objects on the conveyor belt to form the classification dataset.

2.2.2. Classification Model

For classification, a ResNet-18 model pre-trained on ImageNet was used. ResNet-18 was chosen for its proven effectiveness in feature extraction for transfer learning and its ability to deliver high accuracy with relatively low computational cost, which is particularly important for computationally intensive methods, such as TracIn. The final fully connected layer was replaced with a new layer comprising seven output nodes for DWRL, corresponding to the number of classes in the dataset. The data, consisting of the cut-out objects from the object detection model, was split into training, validation, and test sets. All images were resized to 224 × 224 pixels, converted to tensors, and normalized using standard ImageNet statistics (mean: [0.485, 0.456, 0.406]; standard deviation: [0.229, 0.224, 0.225]) [13]. The model was trained for 20 epochs on the GPU, and after every epoch, a snapshot of the model was saved for use as a checkpoint for the TracIn process. The learning rate was set to 0.0001, the batch size was set to 32, and the Adam Optimizer was used. In the training process, no augmentation was applied. The performance and accuracy of the model in the test set were evaluated using the best-performing model in the validation set.

2.3. Influence Estimation–TracIn

The goal was to identify the training images with the greatest positive or negative impact on the trained ResNet18 classification model with selected test samples. Additionally, the aim was to identify inherent issues with the dataset and gain insights into the decision-making process of the classification model.
To estimate the influence of individual training samples on model predictions, TracIn was employed, which is particularly suitable for large-scale datasets and constrained computational resources [5]. Alternative methods, such as Influence Functions [14] or Data Shapley [15] were not considered due to their prohibitive computational requirements, instability in deep learning settings, or lack of scalability. While theoretically grounded, Influence Functions require Hessian inversion and are known to be unreliable for non-convex loss landscapes common in deep networks [16]. Data Shapley and Leave-One-Out Retraining [17] provide more exact influence estimates but are computationally infeasible on datasets of this size (14,978 images). Representer point methods [18] were not applicable due to architectural constraints and the non-linearity of the final classification layers. Therefore, TracIn was selected as a practical and effective compromise, enabling influence estimation at scale while maintaining interpretability.
Rather than analyzing a single model state, four checkpoints were evaluated corresponding to models after 5, 10, 15, and 20 training epochs. These checkpoints represent different stages of the model’s learning progression. Due to the high computational cost of TracIn, only four model checkpoints were considered. Therefore, the computational time was reduced by 80% (roughly 200 h). This choice was based on the hypothesis that training samples with initially high positive influence may later hinder model generalization or contribute to overfitting. The TracIn procedure was applied as follows:
  • For each selected test image, a gradient vector was computed using the CrossEntropyLoss function.
  • The same gradient computation was performed for each training image.
  • The influence score for each checkpoint was determined as the dot product between the training and test gradients.
  • This procedure was repeated for each of the four model checkpoints and summed up, resulting in the influence score of one training image. The formula used for the TracIn approach is presented in Figure 1.
  • The influence scores were normalized per class using min–max normalization (see (1)), resulting in values between 0 and 1.
Infuence norm = Infuence min ( Influence ) class max ( Influence ) class min ( Influence ) class
Gradient computations and influence scoring were performed individually for each test image, which is a very expensive computation. The analysis was parallelized to accelerate processing using Python’s multiprocessing module with 16 worker processes. The tqdm library [19] was used to display progress bars for transparency and monitoring during execution. The influence scores computed for each test image and model checkpoint were serialized and saved using Python’s pickle module. After all computations, the stored files were loaded, and the influence scores were aggregated across checkpoints to obtain a more robust estimate of each training image’s overall influence.

2.4. Selection and Visualization of Top 50 Influential Images

The training samples with the highest positive normalized influence scores were identified for each material class. The top 50 images per class were selected based on the few-shot learning experiments conducted in prior works [2], which indicated that models trained on similarly sized subsets could perform equally well or better in waste classification tasks. These images were stored in a structured output directory to facilitate subsequent visual inspection and analysis. The saved files were named and organized to reflect their rank and class, facilitating qualitative interpretation. These images were bundled into a class-balanced influence-based dataset named DWRL_top50. This dataset was later used to train and evaluate a ResNet-18 classification model.

2.5. Evaluation and Comparison

All trained ResNet-18 models were evaluated on the same test set from the original DWRL dataset to ensure comparability of results. Both overall accuracy and accuracy per class were calculated to compare the models. Accuracy (2) is one of the most used metrics in classification tasks. It measures the overall proportion of correct predictions made by the model across all classes. It is defined as
Accuracy = Number   of   Correct   Predictions Total   Number   of   Predictions
While accuracy gives a sense of overall model performance, it can be misleading in imbalanced datasets, where the model may appear to perform well by simply predicting the majority class. Per-class accuracy evaluates the classification performance for each class individually by computing accuracy over the samples of a single class (3). It is defined as
Per Class   Accuracy i = Correct   Predictions   for   Class i Total   Samples   of   Class   i
The per-class accuracy metric is useful when assessing performance on unbalanced datasets or in multi-class problems, as it highlights which classes are underperforming, highlighting performance gaps that the overall accuracy might mask [20].
Precision was calculated as the ratio of true positive predictions to all positive predictions made by the model, reflecting the model’s ability to avoid false positives and indicating the reliability of positive class assignments (4).
Precision = True   Positives True   Positives + False   Positives
Additionally, the recall and the F1-Score were calculated. Recall, also known as sensitivity or true positive rate, is a performance metric that measures the proportion of actual positive instances correctly identified by a classification model (5). It is defined as
Recall = True   Positives True   Positives + False   Negatives
A high recall indicates that the model has a high true positive rate, which is especially important in tasks where missing a positive case is costly.
F1-Score is the harmonic mean of precision and recall, providing a balanced measure that accounts for both false positives and false negatives (6). It is especially useful when the class distribution is imbalanced.
F 1   Score = 2 Precision Recall Precision + Recall
This metric combines the strengths of precision and recall into a single score, making it the preferred choice for evaluating classification models in practical applications [20].
Confusion matrices were used to visualize the results and identify reasons for misclassification.

3. Results

In this section, all results are presented in detail. First, the data acquisition results are presented, followed by the object detection results and the classification model trained on the full dataset. The results of the TracIn processes are presented, and the top 50 images per class selected using this method are analyzed. A model trained on the DWRL_top50 compressed dataset is evaluated and compared to the model trained on the full dataset.

3.1. Data Acquisition

The preconditioned material was recorded in four batches, each consisting of seven runs corresponding to the material classes. All recordings were conducted on running conveyor belts at a speed of 0.5 m/s. Only one material class was captured in a single run to reduce the labeling effort. The framerate of the recording cameras was tied to the speed of the used conveyor belt. Therefore, every object on the belt was captured only once to avoid redundant images of the same objects. In total, 8297 images were captured. In Table 1, a more detailed overview is presented.

3.2. Detected Objects

All raw images were processed using the SSD Lite object detection model. The detection created annotations in the COCO format. Additionally, images with bounding boxes were saved, and the cropped objects were extracted and stored as individual samples. An example of a recorded image with object detections and the cropped objects is shown in Figure 2. These cropped objects were checked manually, but no adjustments were necessary.
The detected objects from all four data batches were combined into a single classification dataset. Overall, 14,978 objects were detected by the object detector. The detected objects were randomly split into 70% training, 15% validation, and 15% test sets. For reproducibility, the dataset split was stratified by class. The test set was used to evaluate all trained models. Table 2 provides the number of detected objects per class and dataset splits.

3.3. Model Training on the DWRL Dataset

The ResNet-18 classification model was trained using the full training and validation set. The model training was conducted on PyTorch 2.1 and accelerated on the GPU using Nvidia’s Compute Unified Device Architecture (CUDA) application programming interface (API). The total of the 20 training epochs took 25 min to compute. Each epoch was saved as a checkpoint for subsequent analysis with TracIn. The model reached an overall accuracy of 76.17% on the test set. The per-class precision, recall, and F1-Score are shown in Table 3.
Additionally, a confusion matrix (see Figure 3) was calculated to show the percentage of correct classifications and misclassifications per class. The model trained on the full dataset showed weaknesses in classifying PS and PP. The weakest class, PS, showed significant misclassifications to PP and other.

3.4. TracIn

The Python code used for the TracIn process is provided as Supplementary Material. All test images were used to compute the normalized influence scores for each training image. The 50 images with the highest positive impact per class were identified for each class. Figure 4 illustrates the top five most influential images per class, including their normalized scores. Influential images originating from classes other than the target class are highlighted with red boxes. This visualization helps identify whether influential images are class-consistent or originate from different classes and could provide possible explanations.
Analyzing the top 50 images, it is noticeable that there are images from another class in every class. These images indicate that cross-class influence is present throughout the dataset. This finding suggests that the model struggles to learn well-separated feature representations for certain class boundaries. The DWRL dataset is imbalanced, and cross-class influence disproportionately affects underrepresented classes. Classes with more than 2000 training images (PET, PP, PE, other) show significantly less influence from other material classes. In contrast, PS, with only 847 training samples, is notably influenced by “other”. This effect is exemplified by specific training images that exert a strong influence across multiple classes. For instance, the blue bottle shown in Figure 3 receives high influence scores in multiple classes, including PP, Tetra, and PVC. These objects posed a significant issue. Among the 350 influential samples, 89 (25.4%) were identified as being influential across multiple target classes. A breakdown of these problematic samples is as follows:
  • 55%: Thin plastic foils and transparent plastic plates are often visually similar to materials from multiple classes (PE, PP, PET, and other).
  • 15%: Partially captured bottles, where only labels or fragments were cropped, leading to ambiguous classifications (especially between PET and Tetra).
  • 13%: White, overexposed Styrofoam blocks lacking discernible features. Affecting the PVC class heavily.
  • 7%: Thin transparent trays resembling PE foils in appearance.
These multi-class influential samples significantly hindered model generalization, as they introduced misleading signals during training. The problem was particularly evident in smaller classes, such as PS and Tetra, which were strongly influenced by samples from larger classes, like PET, PP, and other. Figure 5 presents the complete set of the top 50 positively influential images for each class. The influence becomes more problematic for classes with fewer training samples. Several possible reasons may explain this behavior, including overlapping class features or insufficient training data for the model to generalize effectively.

3.5. Model Training on DWRL_top50 and Randomly Picked Images

The top 50 images per material class ranked by their positive influence score were bundled into the DWRL_top50 dataset (350 images). This dataset was subsequently used to train classification models. For this purpose, it was split into a training set (80%) and a validation set (20%). The model was trained using the same parameters as the one trained on the full DWRL dataset. Model performance was evaluated on the same test set used in the previous experiments to ensure comparability. The classification model trained on DWRL_top50 reached an overall accuracy of 28.24%, which was expected due to the high number of misclassified ambiguous images in the dataset (see Figure 5). In the next step, only images that positively influenced their own class (i.e., the class they originated from) were selected to create a clean dataset called DWRL_top50_clean. Training on this smaller but clean dataset resulted in a test accuracy of 36.57%, outperforming the original DWRL_top50 model despite using only 209 images instead of 350. The images of this dataset are distributed as follows:
  • 47 in PET;
  • 38 in PP;
  • 40 in PE;
  • 3 in Tetra;
  • 29 in PS;
  • 4 in PVC;
  • 48 in other.
Multiple ResNet18 models were trained using the same parameters to contextualize these test accuracies, with varying numbers of randomly selected images from the DWRL training set. Due to randomly picked images, the model was trained and evaluated 10 times, and the mean accuracy was calculated. Figure 6 presents the test accuracy as a function of the number of randomly selected images per class. Using 50 randomly selected images per class, the model achieved a mean test accuracy of 34.59% ± 3.64%. These results indicate that the performance of the DWRL_top50 dataset is lower than that of randomly selected images. In contrast, the cleaned DWRL_top50 dataset (averaging 29 images/class) yields higher accuracy than randomly sampled subsets; the comparison is limited by the severe class imbalance (4–48 images per class).

4. Discussion

The results show that TracIn is unsuitable for distilling large datasets into smaller, high-performing subsets for transfer learning. Selecting images based solely on their positive influence score tends to reinforce the biases of the model used in the TracIn computation. As a result, the new model trained on the reduced dataset inherits the same strengths and weaknesses as the model used to compute the TracIn scores.
The most significant advantage of TracIn is its ability to highlight issues inherent in the dataset or the model. The results indicate that models trained on images selected by TracIn will not outperform models trained with the same number of randomly selected images. For example, after the visualization, it was clear that the unbalanced nature of the dataset was a major issue for the smaller classes. Moreover, some objects appear to be highly influential in multiple classes with a high influence score.
These results hint that the model could not extract usable information from these objects. By screening the high-influence images, the model behavior becomes more interpretable. For example, objects with defined rectangular shapes and sharp edges are important for the class “Tetra”. Additionally, the class boundaries in the dataset show substantial overlap due to the objects’ similarity in the classes. For example, transparent foils made from different materials look very similar. The class “other” consists of multilayer packaging and misthrows, which influences most classes. These findings can also be seen in the linear Principal Component Analysis (PCA) in Figure 7. The analysis was conducted with the model trained on the full dataset with images from the test set. There, the overlap of the classes is clearly visible.
In addition to these observations, the sensitivity of the TracIn method to its parameters was assessed. The influence scores depend on several factors, including the number of training checkpoints used for scoring. In this study, four checkpoints were selected to represent different stages of training. This choice reflects a balance between computational cost and scoring precision. Preliminary testing showed that adding more checkpoints did not significantly change the ranking of influential samples. Moreover, the method demonstrated moderate sensitivity to the subset size and high sensitivity to the underlying model architecture. However, within the consistent experimental setup used here, the influence score rankings proved to be robust and reproducible across configurations.
The two-stage approach used in this study introduces an additional challenge: Objects are first detected and then cropped from the original images using an object detector. This leads to varying resolutions of the cropped object images, depending on the object’s size in the original frame. Small objects suffer from low-resolution crops, which may not contain sufficient visual information for accurate classification. A mechanical sieving step was introduced before the data collection to remove objects smaller than 60 mm from the material stream to mitigate this. The sieving ensured that all detected objects had a defined minimum size and resolution, allowing the classifier to extract meaningful features.
Furthermore, recommendations can be made based on these results. TracIn should be applied to a small experimental dataset as a diagnostic step prior to collecting a large-scale dataset because issues with the data can be identified early and corrected before capturing vast amounts of data. Such preliminary testing can help avoid costly data collection errors. In our case, the models struggle to correctly identify white foam and Styrofoam blocks. All these objects are overexposed. Therefore, the model was unable to extract meaningful features from them. Based on these findings, the exposure time or lighting setup for the capturing process should be adapted for these edge cases. Furthermore, TracIn could be used to assess the validity of class boundaries or whether they overlap with each other. If images appear influential across multiple classes, assigning them to a dedicated class might be advisable to reduce their influence on the other classes.
Beyond the application of TracIn for subset selection, related approaches from the field of representation learning provide additional perspectives for identifying and retaining the most informative samples during dataset compression. Recent methods, such as Weighted Correlation Embedding Learning (WCEL) [21], Guided Discrimination and Correlation Subspace Learning (GDCSL) [22], and Adaptive Dispersal and Collaborative Clustering (ADCC) [23], propose strategies to enhance the selection of representative data points by optimizing embedding space structures and maximizing inter-class separability. While these methods were initially developed in the context of domain adaptation, their mechanisms, such as weighted correlation embedding, discriminative subspace learning, and adaptive clustering, could be leveraged for dataset reduction tasks. Integrating such techniques with influence-based scoring like TracIn may improve the quality of compressed datasets by preserving not only influential but also diverse and representative samples, thus maintaining classification performance with fewer training images.

5. Conclusions

In this study, the influence of individual training examples on a ResNet-18 classification model was analyzed using a TracIn method. While TracIn successfully identified influential samples and revealed class imbalances and problematic training images, it proved ineffective for compressing large datasets into smaller, generalizable subsets without further processing. Models trained exclusively on positively ranked influential samples did not outperform those trained on randomly selected subsets of the same size. The results suggest that TracIn tends to reinforce the original model’s biases rather than mitigate them. Moreover, our findings highlight the limitations of TracIn when applied to datasets with significant class overlap and/or class imbalance.
Despite its limitations for dataset compression, TracIn remains a valuable tool for model interpretability and dataset auditing. It can help identify problematic samples, reveal class confusion, and improve data acquisition and labeling strategies. However, its use should be complemented with other influence-based methods, particularly when building compact yet effective training sets.
Future work should explore hybrid influence estimation strategies that combine the strengths of multiple techniques to inform data selection and improve model robustness. Ultimately, influence estimation should be viewed not as a substitute for thoughtful dataset design but as a diagnostic tool to support creating more reliable machine learning systems. Future dataset compression methods could benefit from embedding-driven selection strategies, such as WCEL, GDCSL, or ADCC, which aim to retain representative and discriminative structures in the reduced dataset.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/data10080127/s1, TracIn_MUL.py.

Author Contributions

Credit authorship contribution statement. J.A.: Writing—Original Draft, Visualization, Validation, Methodology, Formal Analysis, Data Curation, Software, and Conceptualization; L.B.: Data Curation, Formal Analysis; and Writing—Review; G.K.: Methodology, Conceptualization, and Writing—Review; B.H.: Software, Data Curation, and Formal Analysis; J.P.: Writing—Review, Project Administration, and Funding Acquisition; R.S.: Writing—Review, Supervision, Project Administration, and Funding Acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

The RecAIcle project (FFG project number: FO999892220) is funded by the FFG as part of the AI for Green 2021 (KP) call.

Data Availability Statement

The data presented in this study are available upon request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Abbreviations

The following abbreviations are used in this manuscript:
ADCCAdaptive Dispersal and Collaborative Clustering
APIapplication programming interface
CUDACompute Unified Device Architecture
DWRL Digital Waste Research Lab
GDCSLGuided Discrimination and Correlation Subspace Learning
IPCIndustrial PC
mAPmean Average Precision
NIRnear infrared
PCAPrincipal Component Analysis
PETpolyethene terephthalate
PPpolypropylene
PSpolystyrene
PVCpolyvinyl chloride
Tetrabeverage cartons or Tetra Pak
TracInTracking Influence
WCELWeighted Correlation Embedding Learning

References

  1. Koinig, G.; Kuhn, N.; Fink, T.; Lorber, B.; Radmann, Y.; Martinelli, W.; Tischberger-Aldrian, A. Deep learning approaches for classification of copper-containing metal scrap in recycling processes. Waste Manag. 2024, 190, 520–530. [Google Scholar] [CrossRef] [PubMed]
  2. Aberger, J.; Shami, S.; Häcker, B.; Pestana, J.; Khodier, K.; Sarc, R. Prototype of AI-powered assistance system for digitalisation of manual waste sorting. Waste Manag. 2025, 194, 366–378. [Google Scholar] [CrossRef] [PubMed]
  3. Mujahid, M.; Kına, E.; Rustam, F.; Villar, M.G.; Alvarado, E.S.; de La Torre Diez, I.; Ashraf, I. Data oversampling and imbalanced datasets: An investigation of performance for machine learning and feature engineering. J. Big Data 2024, 11, 87. [Google Scholar] [CrossRef]
  4. Bockreis, A.; Faulstich, M.; Flamme, S.; Kranert, M.; Mocker, M.; Nelles, M.; Quicker, P.; Rettenberger, G.; Rotter, V.S. Kreislauf- und Ressourcenwirtschaft. In Proceedings of the Wissenschaftskongress, Vienna, Austria, 15–16 February 2024; Innsbruck University Press, Deutsche Gesellschaft für Abfallwirtschaft e.V., Fakultät für Bau- und Umweltingenieurwesen der Technischen Universität Wien, Eds.; Innsbruck University Press: Innsbruck, Austria, 2024. [Google Scholar] [CrossRef]
  5. Pruthi, G.; Liu, F.; Sundararajan, M.; Kale, S. Estimating Training Data Influence by Tracing Gradient Descent. arXiv 2020, arXiv:2002.08484. [Google Scholar] [CrossRef]
  6. Hammoudeh, Z.; Lowd, D. Training data influence analysis and estimation: A survey. Mach. Learn. 2024, 113, 2351–2403. [Google Scholar] [CrossRef]
  7. Kandlbauer, L.; Sarc, R.; Pomberger, R. Großtechnische experimentelle Forschung im Digital Waste Research Lab und Digitale Abfallanalytik und -behandlung. Österr. Wasser Abfallw. 2024, 76, 32–41. [Google Scholar] [CrossRef]
  8. Umweltbundesamt. Sortierung und Recycling von Kunststoffabfällen in Österreich: Stand 2019. Available online: https://www.umweltbundesamt.at/fileadmin/site/publikationen/rep0744_hauptteil.pdf (accessed on 16 December 2022).
  9. Khodier, K.; Viczek, S.A.; Curtis, A.; Aldrian, A.; O’Leary, P.; Lehner, M.; Sarc, R. Sampling and analysis of coarsely shredded mixed commercial waste. Part I: Procedure, particle size and sorting analysis. Int. J. Environ. Sci. Technol. 2020, 17, 959–972. [Google Scholar] [CrossRef]
  10. Zhao, Z.-Q.; Zheng, P.; Xu, S.; Wu, X. Object Detection with Deep Learning: A Review. arXiv 2018, arXiv:1807.05511. [Google Scholar] [CrossRef] [PubMed]
  11. Kellenberger, B.; Marcos, D.; Tuia, D. Detecting Mammals in UAV Images: Best Practices to address a substantially Imbalanced Dataset with Deep Learning. Remote Sens. Environ. 2018, 216, 139–153. [Google Scholar] [CrossRef]
  12. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted Residuals and Linear Bottlenecks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 4510–4520, ISBN 978-1-5386-6420-9. [Google Scholar]
  13. PyTorch. resnet18 Documentation: PyTorch/Torchvision Models. Available online: https://docs.pytorch.org/vision/main/models/generated/torchvision.models.resnet18.html (accessed on 3 June 2025).
  14. Koh, P.W.; Liang, P. Understanding Black-box Predictions via Influence Functions. arXiv 2017, arXiv:1703.04730. [Google Scholar]
  15. Ghorbani, A.; Zou, J. Data Shapley: Equitable Valuation of Data for Machine Learning. arXiv 2019, arXiv:1904.02868. [Google Scholar] [CrossRef]
  16. Basu, S.; You, X.; Feizi, S. On Second-Order Group Influence Functions for Black-Box Predictions. arXiv 2019, arXiv:1911.00418. [Google Scholar]
  17. Jia, R.; Dao, D.; Wang, B.; Hubis, F.A.; Gurel, N.M.; Li, B.; Zhang, C.; Spanos, C.J.; Song, D. Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms. arXiv 2019, arXiv:1908.08619. [Google Scholar] [CrossRef]
  18. Yeh, C.-K.; Kim, J.S.; Yen, I.E.H.; Ravikumar, P. Representer Point Selection for Explaining Deep Neural Networks. arXiv 2018, arXiv:1811.09720. [Google Scholar] [CrossRef]
  19. Da Costa-Luis, C.; Larroque, S.K.; Altendorf, K.; Mary, H.; Korobov, M.; Yorav-Raphael, N.; Ivanov, I.; Bargull, M.; Rodrigues, N.; Chen, G.; et al. tqdm: A fast, Extensible Progress Bar for Python and CLI; Zenodo: Brussel, Belgium, 2024. [Google Scholar]
  20. Powers, D.M.W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv 2020, arXiv:2010.16061. [Google Scholar] [CrossRef]
  21. Lu, Y.; Zhu, Q.; Zhang, B.; Lai, Z.; Li, X. Weighted Correlation Embedding Learning for Domain Adaptation. IEEE Trans. Image Process. 2022, 31, 5303–5316. [Google Scholar] [CrossRef] [PubMed]
  22. Lu, Y.; Wong, W.K.; Zeng, B.; Lai, Z.; Li, X. Guided Discrimination and Correlation Subspace Learning for Domain Adaptation. IEEE Trans. Image Process. 2023, 32, 2017–2032. [Google Scholar] [CrossRef] [PubMed]
  23. Lu, Y.; Huang, H.; Wong, W.K.; Hu, X.; Lai, Z.; Li, X. Adaptive Dispersal and Collaborative Clustering for Few-Shot Unsupervised Domain Adaptation. IEEE Trans. Image Process. 2025, 34, 4273–4285. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Utilized materials and methods.
Figure 1. Utilized materials and methods.
Data 10 00127 g001
Figure 2. Example image with object detections and corresponding cropped objects.
Figure 2. Example image with object detections and corresponding cropped objects.
Data 10 00127 g002
Figure 3. The confusion matrix of the model trained on the full DWRL dataset.
Figure 3. The confusion matrix of the model trained on the full DWRL dataset.
Data 10 00127 g003
Figure 4. Top 5 images according to positive normalized influence score calculated using TracIn. Influential images from non-target classes are marked in red.
Figure 4. Top 5 images according to positive normalized influence score calculated using TracIn. Influential images from non-target classes are marked in red.
Data 10 00127 g004
Figure 5. Distribution of the top 50 influential images for the trained model on DWRL.
Figure 5. Distribution of the top 50 influential images for the trained model on DWRL.
Data 10 00127 g005
Figure 6. Test accuracy over the number of randomly picked images out of the training set of DWRL.
Figure 6. Test accuracy over the number of randomly picked images out of the training set of DWRL.
Data 10 00127 g006
Figure 7. Linear PCA utilizing the model trained on the full DWRL dataset.
Figure 7. Linear PCA utilizing the model trained on the full DWRL dataset.
Data 10 00127 g007
Table 1. Number of images per class and batch.
Table 1. Number of images per class and batch.
ClassTotalBatch 1Batch 2Batch 3Batch 4
1_PET1492268497199528
2_PP18052088390758
3_PE1514274645156439
4_Tetra3051413935117
5_PS347284154161
6_PVC5571581522443
7_Other2277429760293795
Table 2. Number of detected objects per class and dataset split.
Table 2. Number of detected objects per class and dataset split.
ClassTotal ObjectsTrainValidationTest
1_PET2495 1746375374
2_PP3300 2310495495
3_PE28992029436434
4_Tetra4563196968
5_PS847592128127
6_PVC6544579998
7_Other4327 3028650649
Table 3. Per class classification performance (precision, recall, F1-Score).
Table 3. Per class classification performance (precision, recall, F1-Score).
ClassPrecisionRecallF1-Score
1_PET0.87820.82890.8528
2_PP0.72060.69290.7065
3_PE0.78100.75580.7681
4_Tetra0.79750.92650.8571
5_PS0.65220.47240.5479
6_PVC0.81370.84690.8300
7_Other0.72340.80590.7624
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aberger, J.; Brensberger, L.; Koinig, G.; Häcker, B.; Pestana, J.; Sarc, R. Limitations of Influence-Based Dataset Compression for Waste Classification. Data 2025, 10, 127. https://doi.org/10.3390/data10080127

AMA Style

Aberger J, Brensberger L, Koinig G, Häcker B, Pestana J, Sarc R. Limitations of Influence-Based Dataset Compression for Waste Classification. Data. 2025; 10(8):127. https://doi.org/10.3390/data10080127

Chicago/Turabian Style

Aberger, Julian, Lena Brensberger, Gerald Koinig, Benedikt Häcker, Jesús Pestana, and Renato Sarc. 2025. "Limitations of Influence-Based Dataset Compression for Waste Classification" Data 10, no. 8: 127. https://doi.org/10.3390/data10080127

APA Style

Aberger, J., Brensberger, L., Koinig, G., Häcker, B., Pestana, J., & Sarc, R. (2025). Limitations of Influence-Based Dataset Compression for Waste Classification. Data, 10(8), 127. https://doi.org/10.3390/data10080127

Article Metrics

Back to TopTop