1. Introduction
Plastic pollution has become a critical global challenge that threatens marine ecosystems, terrestrial habitats, and economic sustainability. Current projections estimate that up to 250 million tons of plastic could enter the oceans by 2025 if effective interventions are not implemented [
1]. The ecological consequences are profound: marine organisms often ingest or become entangled in plastic debris, resulting in injury, malnutrition, and death [
2,
3]. In addition, microplastics and chemical additives have been detected in seafood destined for human consumption, raising public health concerns [
4]. From an economic perspective, cumulative plastic pollution in marine environments is a globally recognized concern that poses ecological and economic threats [
5].
Remote sensing technologies for plastic waste monitoring have attracted significant attention to support plastic litter management. Among the available remote sensing technologies, HyperSpectral Imaging (HSI), particularly in the lower bands of the short wave infrared (SWIR) spectral range, from 900 to 1700 nm, is especially useful for plastics detection because plastic polymers such as polyethylene (PE), polypropylene (PP), polystyrene (PS), polyethylene terephthalate (PET), and polyvinyl chloride (PVC) exhibit specific narrow band absorption peaks in this region [
6]. This spectral range provides distinct polymer-specific signatures that enable robust discrimination of plastic materials from natural backgrounds such as vegetation, soil, and water, a capability that visible-spectrum imaging cannot reliably achieve.
1.1. Traditional Machine Learning Approaches for Plastic Detection
In our previous works [
6,
7,
8,
9], we confirmed the feasibility of discriminating different plastic polymers from other materials and among themselves using a customized SWIR imaging system deployed in both controlled laboratory and natural outdoor settings. Classification efforts employed traditional supervised Machine Learning methods such as Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). While these baselines provided valuable insights, they did not exhibit fully satisfactory generalizability under varied illumination conditions and heterogeneous background textures. Consequently, performance can diminish when such models are applied to scenes with environmental characteristics that differ markedly from those in the training data, which may limit their real-world applicability. While individual polymer discrimination is necessary in recycling plants, in this work, we focus on detecting plastic waste in the environment, regardless of polymer type.
Traditional machine learning approaches typically rely on handcrafted spectral features or indices [
10,
11] and pixel-wise classification without capturing spatial context. Spectral indices highlight the absorption patterns of different materials in relative terms, e.g., ratios of reflectance at specific characteristic wavelengths or spectral angles. Such approaches have limitations in complex settings, so they are outperformed by more sophisticated methodologies. LDA projects high-dimensional spectral data onto discriminant axes that maximize class separability, while SVM constructs optimal hyperplanes in (possibly nonlinearly transformed) feature space for classification. However, these methods process each pixel independently, ignoring the spatial relationships between neighboring pixels that are needed for accurate segmentation in complex environmental scenes. Furthermore, their linear or kernel-based decision boundaries may not adequately capture the nonlinear spectral variations caused by changing illumination, material aging, and background heterogeneity encountered in field conditions.
1.2. Deep Learning Methods
1.2.1. RGB-Based Deep Learning Methods
Several recent studies have applied deep Convolutional Neural Networks (CNNs) to waste and plastic detection (see
Table 1 for a comparative overview), primarily as image-level classifiers on RGB datasets.
Ref. [
12] evaluated multiple architectures (ResNet, Inception, Inception-ResNet, VGG-16/19) on the TrashNet dataset (2527 annotated garbage images spanning six classes, i.e., cardboard, glass, metal, paper, plastic, and trash), reporting an accuracy of approximately 88.6% using an Inception-ResNet network. It is important to note that this refers to whole-image classification, where each image contains a single pre-segmented waste object against a uniform background, which differs fundamentally from pixel-wise segmentation in complex environmental scenes. Similarly, Ref. [
13] compared four CNN classifiers (MobileNetV2, DenseNet-121, ResNet-50, VGG-16) on a larger municipal solid waste image set (∼9200 images of pre-sorted waste items) and found that ResNet-50 performed best, achieving about 94.9% accuracy. Ref. [
14] explored single-object waste classification—that is, images containing one isolated, pre-segmented waste item, using the TrashNet dataset and benchmarked Support Vector Machine (SVM) in conjunction with Histogram of Oriented Gradients (HOG), a simple CNN, ResNet-50, a plain (no-skip) ResNet variant, and a hybrid HOG + CNN with augmentation and Adadelta instead of Adam optimizer. ResNet-50 achieved 95.4% test accuracy, outperforming the simple CNN and the HOG + CNN, while the plain ResNet lagged (~76.0% test), highlighting the benefit of Residual connections.
Ref. [
15] used a pre-trained ResNet-50 as a feature extractor with an SVM backend on the TrashNet benchmark, reaching around 87.0% overall accuracy. Ref. [
16] introduced a new web-collected garbage image dataset (aligned with Shanghai recycling categories) and achieved ∼96.0% training accuracy using a ResNet-50 model, and Ref. [
17] proposed a ResNet-18-based Self-Monitoring Module (SMM) for TrashNet images, obtaining 95.9% accuracy.
In contrast to image-level classification, a few studies have addressed object detection or localization of waste items in more complex scenes. Ref. [
18] developed a Faster R-CNN detector with a ResNet Backbone and a region proposal network to distinguish waste occurrence in complex urban street scenes (database comprising 816 urban-street images). Their model attained about 89.0% mean Average Precision (mAP) but required ∼6 s per image, underscoring the latency drawback of heavy two-stage detectors for real-time use. Ref. [
19] likewise applied Faster R-CNN to identify litter (three categories: landfill, recycling, paper), reporting an overall mAP of 68.0% with the lowest precision for the “paper” class (AP ≈ 61.0%) owing to background clutter.
To overcome these computational constraints and enable real-time processing, lightweight neural models for embedded deployment have also been explored. Ref. [
20] achieved 87.2% classification accuracy using MobileNet on a six-class waste sorting task, while Ref. [
21] combined MobileNetV3-small and ShuffleNet-V2 (“WasNet”) to reach 96.1% accuracy on TrashNet with significantly reduced complexity. Large-scale architectures such as AlexNet [
22], although influential, required 1.2 million ImageNet images and multi-day training to reach a top-1 error of 37.5%, which is a standard metric in ImageNet classification that measures how often the model’s predictions miss the correct label, illustrating the impracticality of data-hungry models for field-deployable plastic detection systems.
The studies reviewed above demonstrate that RGB imagery can achieve high accuracy in waste classification in controlled settings, such as recycling facilities, where objects are pre-segmented and backgrounds are uniform. However, for environmental litter detection in complex outdoor scenes with varied backgrounds, spectral confusion between plastics and natural materials (e.g., dry vegetation, sand, rocks) limits the effectiveness of RGB approaches.
1.2.2. Hyperspectral Deep Learning Methods
In recent years, a few studies have demonstrated the application of deep learning techniques to drone- and airborne-HSI for plastic litter detection. For instance, Ref. [
23] developed a 3D-CNN to identify marine plastic targets from two aerial platforms (range investigated: 400 nm to 2500 nm), and later proposed a zero-shot learning approach that achieved high detection accuracy across multiple polymer types [
24]. Another study [
25] demonstrated improved robustness by combining laboratory- and field-collected hyperspectral signatures to detect riverine plastics, highlighting the need for models that generalize detection under different sensors and illumination conditions. Additionally, Ref. [
26] investigated the detection of floating plastics from both satellite and unmanned aerial systems, demonstrating the feasibility of multi-platform approaches for plastic litter monitoring. Ref. [
27] demonstrated qualitatively strong segmentation of typical waste targets on airborne SWIR-HSI with a multi-scale CNN, yet they reported no quantitative cross-scene metrics, limiting evidence of real-world generalization. Ref. [
28] achieved a mean classification accuracy of 97.6% in floating plastic classification using a lightweight visible-shortwave-infrared CNN on UAV HSI. Squeeze-and-Excitation (SE) block analysis proved that NIR-SWIR bands contribute the most to plastic classification, confirming the value of SWIR spectral information for polymer discrimination. These efforts underscore both the promise of deep neural networks for hyperspectral plastic detection and the challenge of building models that remain reliable across varying environmental and geographic contexts.
Table 1.
Comparison of Deep Learning waste recognition models performance.
Table 1.
Comparison of Deep Learning waste recognition models performance.
| Study | Network | Dataset | Task | Performance |
|---|
| [12] | Inception-ResNet | RGB images | Classification | 88.6% accuracy |
| [13] | ResNet-50 | RGB images | Classification | 94.9% accuracy |
| [14] | ResNet-50 + HOG | RGB images | Classification | 95.4% test acc. |
| [15] | ResNet-50 + SVM | RGB images | Classification | 87.0% accuracy |
| [16] | ResNet-50 | RGB images | Classification | 95.9% (train) |
| [17] | ResNet-18 (SMM) | RGB images | Classification | 95.9% accuracy |
| [18,19] | Faster R-CNN | RGB images | Detection | 89.0% accuracy |
| [20] | MobileNet CNN | RGB images | Classification | 87.2% accuracy |
| [21] | MobileNetV3 + ShuffleNetV2 | RGB images | Classification | 96.1% accuracy |
| [22] | multi-scale CNN | Hyperspectral images | Detection | Qualitative high detection |
| [23] | 3D CNN | Hyperspectral images | Classification | Qualitative high accuracy 84% |
| [24] | Zero-shot DL | Hyperspectral images | Classification | 98.7% accuracy |
| [27] | Lite spatial spectral CNN | Hyperspectral images | Classification | 97.6% accuracy |
| This Work | Attentional-Residual U-Net | Hyperspectral images | Pixel Wise Segmentation | 96.8% |
1.3. Semantic Segmentation Architectures for Dense Prediction
For pixel-wise semantic segmentation tasks, several deep learning architectures have been developed. Fully Convolutional Networks (FCNs) pioneered end-to-end dense prediction by replacing fully-connected layers with convolutional layers. SegNet introduced an encoder-decoder architecture with pooling indices for efficient upsampling, while DeepLabV3 employs Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale contextual information without losing resolution.
In this work, we adopt the U-Net architecture [
29] as our baseline for several reasons. First, U-Net’s symmetric encoder-decoder architecture with skip connections is particularly well-suited to preserving fine-grained spatial details, which are essential for accurate plastic boundary delineation. Second, unlike DeepLabV3, which was designed primarily for natural RGB images with pre-trained ImageNet weights, U-Net can be trained from scratch on hyperspectral data without requiring transfer learning from incompatible spectral domains. Third, SegNet’s reliance on pooling indices, while memory-efficient, may lose subtle spectral information critical for distinguishing spectrally similar materials. Fourth, U-Net has demonstrated strong performance in medical imaging and remote sensing applications where precise boundary detection is crucial, making it a natural choice for plastic waste segmentation. We further enhance the U-Net architecture with attention gates [
30] and residual connections [
31] to improve gradient flow and enable the model to focus on discriminative spectral-spatial features.
1.4. Research Gap and Contributions
Although
Table 1 lists impressively high accuracies across both RGB and hyperspectral studies, it also exposes several shortcomings. Most studies based on hyperspectral data provide only qualitative segmentation, lack quantitative, cross-scene validation, and often stop at image-level classification, leaving pixel-wise mapping aside. None of the cited works tests temporal generalization across multiple flight surveys, so robustness to environmental changes remains undocumented. In addition, the research landscape still relies heavily on RGB data for waste-sorting applications or keeps hyperspectral imagery proprietary, limiting reproducibility and underutilizing the potential of SWIR bands for environmental litter detection.
To address these limitations, this study introduces a deep learning framework designed to improve generalization in plastic segmentation tasks specifically for environmental litter monitoring. The key contributions of this work are: (1) a modified U-Net architecture incorporating residual connections and attention gates specifically designed for SWIR hyperspectral plastic segmentation, facilitating robust feature learning while suppressing irrelevant background information; (2) rigorous leave-one-out (LOO) cross-validation across 9 annotated UAV flight campaigns spanning 4 years to evaluate temporal and environmental generalization, ensuring strict separation between training and testing datasets across both space and time; (3) systematic comparison with traditional machine learning methods (LDA, SVM) demonstrating the advantages of deep learning for this task; (4) a publicly available multi-temporal hyperspectral dataset enabling reproducible research; and (5) analysis of acquisition condition impacts (e.g., overexposure) on model performance, providing practical guidelines for operational deployment. Our approach offers a scalable and resilient solution for automated plastic waste detection, bridging the gap between high-resolution hyperspectral acquisition and deployable remote sensing analytics for environmental monitoring applications.
2. Materials
From January 2020 to February 2024, we executed 9 UAV-based hyperspectral imaging campaigns across diverse environmental settings, encompassing variations in ground texture, terrain type, and weather scenarios. The flights detailed in
Table 2 provide a diverse dataset essential for evaluating generalization performance in both machine and deep learning models. Seven of these nine campaigns (Jan20a, Jan20b, Mar20, Apr22, 7Feb24a, 7Feb24b, and 21Feb24a) were also treated in a previous paper of ours [
8], while two additional flights (21Feb24b and Dec22) are introduced here to extend the temporal and environmental coverage. The complete nine-flight dataset is openly available to the research community through Mendeley (DOI: 10.17632/nmpjzrky3r.1).
All data acquisitions were carried out using a DJI Matrice 600 hexacopter (DJI, Shenzhen, China), as shown in
Figure 1a. The sensor payload (
Figure 1b) combines a Xenics (Leuven, Belgium) Bobcat 320 SWIR camera (320 × 256 pixels, 900–1700 nm) with a Specim (Oulu, Finland) ImSpector NIR17 OEM line scan spectrometer. Although the spectrometer records at a native 2.5 nm interval, we undersampled to 10 nm because finer resolution offered little extra information while increasing noise and data volume. A further spectral camera providing information in the VIS/NIR range (400–1000 nm) was mounted but not used in this study. The payload also included a synchronized IDS (Obersulm, Germany) UI-3240-CP-C-HQ RGB camera and a VectorNav (Dallas, TX, USA) VN-200 INS for precise geo-referencing [
7,
8]. During each mission, the UAV flew at altitudes of 7–10 m and ground speeds of 0.5–1.5 m s
−1. These flight parameters were selected to optimize the trade-off between spatial resolution, spectral quality, and operational efficiency. Lower altitudes (7–8 m) provide finer ground sampling distance (GSD) but reduce coverage area per flight and increase sensitivity to terrain variations. Higher altitudes (10 m) offer greater coverage and more stable flight characteristics, but at the cost of coarser spatial resolution. The across-track GSD, determined by altitude, sensor optics, and pixel count, ranged from 3 to 4 cm. The along-track GSD, determined by flight speed and line-acquisition rate (12.5–16 Hz), was typically larger (4–12 cm).
Push-broom hyperspectral image sequences were assembled into 3D spectral cubes through spatial registration. Consecutive line scans were registered using corresponding RGB frames by mosaicking them based on correlation, using their relative positions and orientations to align the simultaneously acquired hyperspectral data. The single one-dimensional segments are pasted into a matrix (slightly oversampled, with pixel size corresponding to approximately 0.6 cm at ground), which needs to be rectangular in space; therefore, along the edges we fill the cube with blank pixels (spectra set to all 0 values). Such blank pixels, hereinafter denoted unassigned pixels, will not be included in the processing.
Each raw hyperspectral cube initially contained 81 bands sampled at 10 nm intervals. However, spectral bands between 900 nm and 930 nm, empty due to geometrical constraints of the camera assembly, and bands between 940 nm and 970 nm and between 1690 nm and 1700 nm, almost dark due to low quantum efficiency, were discarded. Moreover, bands between 1350 nm and 1450 nm were discarded due to the significant effect of atmospheric water vapor. The remaining 60 bands, ranging from 980 nm to 1340 nm and 1460 nm to 1680 nm, retained the principal polymer-specific absorption signatures required for accurate plastic material discrimination [
6]. The hyperspectral cubes were used in their raw digital number (DN) format without radiometric calibration, atmospheric correction, or intensity normalization, preserving the original spectral relationships captured by the sensor and testing the model’s ability to learn robust features directly from uncalibrated data. In a previous paper [
8], we showed that calibration does not improve the discrimination performance of classifiers, whereas a simple level adjustment does improve generalization.
Binary annotation masks were manually created using GIMP (GNU Image Manipulation Program, version 2.10). The annotation procedure involved the following steps: first, a single spectral band from each hyperspectral cube (typically at 1000 nm for optimal contrast) was exported as a grayscale reference image. This reference was visually compared with the corresponding RGB mosaic to identify the boundaries of plastic objects accurately, and the regions corresponding to each material type were drawn over the hyperspectral layer image. The plastic fragments were labeled as 1 and represented in red, the background as 0 and represented in green, while the remaining pixels not containing information (unassigned pixels) were represented in black and excluded from the training. The annotation scheme distinguished the following semantic classes grouped into binary categories:
First category (Plastic)—Six plastic subclasses: PET, PE (LDPE & HDPE), PVC, PP, PS, and other plastics (bio-based or composite materials, and mixed or unknown materials).
Second category (Non-Plastic)—White reference objects (commercial floor tiles verified with a point spectroradiometer to be satisfactorily white in the SWIR range); non-plastic objects (wood, metal, glass, and tyres) and background (grass and bare earth).
To standardize model input dimensions, hyperspectral cubes were divided into non-overlapping patches of size 128 × 128, corresponding to ground areas of approximately 0.6 m
2. The patching procedure was implemented using a sliding window approach with a stride equal to the patch size (128 pixels), ensuring no overlap between adjacent patches. Starting from the top-left corner of each hyperspectral cube, the algorithm sequentially extracted patches by moving horizontally and then vertically across the image. Patches located at the image boundaries that did not contain the full 128 × 128 dimensions were discarded to maintain a uniform input size. Additionally, patches containing more than 50% unassigned pixels (black regions from the mosaicking process) were excluded from the dataset to ensure sufficient valid spectral information for training. Due to the class imbalance (
Figure 2), a stratified split was performed based on target pixel counts per class, resulting in training and validation sets that maintained representative distributions of Plastic and Non-plastic samples.
To assess generalization performance across spatially and spectrally distinct scenes, each of the 9 UAV flights was held out as an unseen test set, while the remaining 8 flights were split into training (80%) and validation (20%) subsets using a leave-one-out (LOO) cross-validation approach. The 80/20 split was used to preserve the proportion of plastic pixels relative to non-plastic pixels, ensuring the minority class (plastic) maintained adequate representation in both the training and validation sets. This methodology ensures that our hyperspectral plastic detection system demonstrates robust performance across varied real-world conditions, providing reliable automated detection capabilities for environmental monitoring applications.
Table 3 summarizes the patch distribution across all nine UAV flights. A total of 15,738 patches were extracted from the hyperspectral cubes. For each LOO fold, one flight was held out as the unseen test set, while patches from the remaining eight flights were divided into training (80%) and validation (20%) subsets using stratified sampling based on plastic pixel counts.
For each of the nine LOO folds, the test set size ranged from 525 patches (Mar20) to 3306 patches (Dec22), with an average of 1749 test patches per fold. The corresponding training sets contained 10,019 to 12,019 patches (average: 11,226), while validation sets ranged from 2473 to 2989 patches (average: 2788). This distribution ensured that each fold evaluated the model on a complete, unseen flight mission while maintaining sufficient diversity in the training data.
3. Methodology
The proposed segmentation architecture (
Figure 3) employs a U-Net encoder-decoder architecture [
29] augmented with Residual connections and Attention gates feature extraction [
30]. These components enhance gradient flow and allow the model to focus on spatial/spectral regions most indicative of plastic debris in hyperspectral scenes.
The encoder processes 60-channel SWIR hyperspectral inputs through progressive down-sampling stages with channel expansion from 64 to 128 to 256 to 512 feature channels across four encoder stages. This progression captures hierarchical features from low-level spectral signatures to high-level spatial patterns indicative of plastic materials. Each Residual block [
31] preserves critical spectral information through dual convolution paths.
The decoder mirrors the encoder by up-sampling features to the original resolution for pixel-wise classification. At each stage, Attention gates weigh encoder skip connections before concatenation, suppressing irrelevant background features while highlighting plastic signatures:
where:
is the gating signal from the decoder (coarser scale)
is the encoder skip connection feature map (finer scale)
, are learned projection weights for gating and encoder features, respectively
, are bias terms
is a convolution that produces attention coefficients
is the sigmoid activation function
are the attention coefficients (values between 0 and 1)
denotes element-wise multiplication
represents the attention-weighted features passed to the decoder
Residual connections within decoder blocks facilitate gradient propagation during training. The final convolution with Softmax activation produces per-pixel probability distributions across background and plastic classes.
We trained the network using a combined focal and Dice loss [
32,
33] to both address class imbalance (plastic occupies < 15% of image area) and optimize segmentation overlap [
34].
where:
is the weighting factor balancing focal and Dice loss contributions (set to α = 0.5)
is the focal loss that down-weighs easy examples to focus learning on hard cases
is the Dice loss that directly optimizes the overlap between prediction and ground truth
is the regularization coefficient ()
represents the norm of model weights, preventing overfitting
Class weights were dynamically computed per fold based on pixel distributions:
where
represents the pixel count for class c,
is the total number of pixels, and C = 2 is the number of classes. This ensures optimal performance across different flight missions with varying levels of plastic contamination. These class weights are applied during training by multiplying the loss contribution of each pixel by its corresponding class weight. This amplifies the learning signal from rare plastic pixels (
) relative to abundant background pixels (
), ensuring the model prioritizes correct classification of the minority class. This weighting strategy ensures optimal performance across different flight missions with varying levels of plastic contamination.
Training employed the Adam optimizer
with fixed learning rate (10
−4), batch size of 8 patches, and early stopping (patience=10 epochs). On an NVIDIA RTX 4080, the model converged in approximately 38 min per fold with 8.1 ms inference latency per tile. Key parameters of the architecture and of the training stage are listed in
Table 4 and
Table 5.To evaluate the classification performance, classical validation metrics were introduced. Let True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) denote the confusion matrix entries when plastic is treated as the positive class. The following metrics were computed on the confusion matrix elements:
Intersection over Union (IoU)
Cohen’s Kappa (CK)
where
.
5. Conclusions
We developed a deep learning framework for pixel-wise plastic segmentation in UAV-based SWIR hyperspectral imagery (980–1680 nm), combining U-Net with Residual blocks and Attention gates. Leave-one-out evaluation across 9 flights spanning diverse conditions demonstrated a mean Dice coefficient of 68.0%, IoU 53.6%, accuracy 90.5%, precision 93.6%, and recall 90.5%, representing 43% κ improvement (0.63 vs. 0.44) over classical LDA, confirming the superior feature learning of deep architectures. Best performance: 21Feb24b (Dice 94.7%, IoU 89.8%, κ = 0.866).
The Mar20 case showed degraded LOO performance (Dice 45.2%, IoU 29.2%) with extensive false positives. Brightness analysis revealed overexposure representing ~2.6× elevation over the other flights. Training on all nine flights achieved substantially improved performance: Dice 81.58% (+20%), IoU 68.89% (+28%), κ 0.7979, with visual confirmation of dramatically reduced Mar20 false positives. This demonstrates sufficient model capacity when training represents expected variations, establishing that acquisition protocol quality, rather than architecture, is the primary operational constraint.
The model handled severe class imbalance (16.9% plastic, 84.3% background) using Dice+Focal loss and balanced sampling, achieving 86.91% precision on plastics. Architectural ablation yielded a baseline Dice coefficient of 0.61, improved to 0.68 (+11%) after adding residual connections, to 0.72 (+6%) after adding the attention mechanism, and finally to 0.76 (+25% total) in the complete architecture. Computational efficiency: 7.2 M parameters, 8.1 ms/tile enables ~123 tiles/s, processing a 100 m2 survey in ~5 s. Training converges in ~38 min/fold.
The key contributions of this work are as follows:
Demonstration that attention-augmented deep architectures achieve robust plastic segmentation with performance limited by training diversity rather than model capacity, providing a 43% improvement in Cohen’s Kappa over LDA under rigorous Leave One Out evaluation.
Mar20 case study with brightness analysis providing a diagnostic template for acquisition and establishing that overexposure can be identified through DN distribution analysis and mitigated through diverse training data.
Multi-temporal dataset (nine flights, 4 years, six polymers, openly available via Mendeley) enabling standardized comparison and reproducible research in hyperspectral plastic detection.
Comprehensive ablation study quantifying the contribution of residual connections (+11% Dice) and attention gates (+6% Dice) to segmentation performance, validating architectural design choices for hyperspectral data.
Practical demonstration that raw, uncalibrated hyperspectral data can achieve robust segmentation without radiometric preprocessing, supporting rapid field deployment scenarios.
Future research activity will address limitations of the present study, including the following:
Similarity of the case study locations, all in central Italy and mostly on open ground or grass: more diverse case studies will be taken into account to enhance the generalization capability of the models;
Cost of manual annotation of the training set: partial automation might be achieved by using the already trained models;
Lack of level and dynamic range adjustment of the data: simple equalization methods, such as those we previously exploited, will be tested in this framework.
Priorities for the development of this work are identified as follows:
Using the VNIR hyperspectral data that our sensor already collects, to fuse them with the SWIR band data and enhance classification, as well as spatial definition of results (the resolution of the VNIR camera is higher);
Uncertainty quantification for confidence flagging to identify predictions requiring human verification;
Model compression for edge deployment, including knowledge distillation and pruning techniques to reduce the 7.2 M parameter count while maintaining accuracy;
Extension to airborne platforms for regional coverage;
Use of data augmentation to enhance the robustness of the solution.
Semi-supervised and self-supervised learning approaches could reduce annotation requirements, while active learning strategies could prioritize labeling of informative samples.
The results presented in this work have a potentially broad impact on the monitoring of plastic litter in the environment. In fact, automated detection transforms monitoring from labor-intensive manual surveys to scalable, quantitative assessments, particularly valuable for remote areas (riverbanks, coastal zones, wetlands). Pixel-wise segmentation enables area quantification for policy and cleanup prioritization. Temporal monitoring supports the assessment of pollution dynamics, the evaluation of interventions, and the detection of illegal dumping. By reducing costs, this methodology enables wider adoption by governments, NGOs, and citizen science.
The publicly available dataset and documented methodology provide a foundation for reproducible research and method benchmarking in the emerging field of hyperspectral plastic detection.