Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor

Bouchelaghem, Soufyane; Balsi, Marco; Moroni, Monica

doi:10.3390/rs18010182

Open AccessArticle

Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor

by

Soufyane Bouchelaghem

¹

,

Marco Balsi

^1,*

and

Monica Moroni

²

¹

Department of Information Engineering, Electronics and Telecommunications (DIET), Sapienza University of Rome, 00184 Rome, Italy

²

Department of Civil, Building, and Environmental Engineering (DICEA), Sapienza University of Rome, 00184 Rome, Italy

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(1), 182; https://doi.org/10.3390/rs18010182

Submission received: 21 November 2025 / Revised: 30 December 2025 / Accepted: 2 January 2026 / Published: 5 January 2026

Download

Browse Figures

Review Reports Versions Notes

Highlights

What are the main findings?

A U-Net architecture with attention gates and residual connections achieves up to 96.8% accuracy in pixel-wise plastic segmentation from UAV-based SWIR hyperspectral imagery, outperforming classical LDA approaches by 43% in Cohen’s kappa metric across leave-one-out cross-validation.
Rigorous evaluation across nine UAV flights spanning four years reveals that model generalization is primarily constrained by training data diversity rather than architectural limitations, with proper exposure protocols being critical for robust performance.

What are the implications of the main findings?

This framework provides environmental agencies and conservation organizations with a practical, deployable tool for automated detection of plastic waste in natural environments, reducing reliance on labor-intensive manual surveys while enabling systematic temporal monitoring.
The publicly released multi-temporal dataset establishes a benchmark for hyperspectral plastic detection research. The acquisition conditions for such a dataset are logged to provide operational guidelines for future UAV-based environmental monitoring campaigns.

Abstract

The proliferation of plastic waste across natural ecosystems has created a global environmental and public health crisis. Monitoring plastic litter using remote sensing remains challenging due to the significant variability in terrain, lighting, and weather conditions. Although earlier approaches, including classical supervised machine learning techniques such as Linear Discriminant Analysis (LDA) and Support Vector Machine (SVM), applied to hyperspectral and multispectral data have shown promise in controlled settings, they often may face challenges in generalizing across diverse environmental conditions encountered in real-world scenarios. In this work, we present a deep learning framework for pixel-wise segmentation of plastic waste in short-wave infrared (900–1700 nm) hyperspectral imagery acquired from an Unmanned Aerial Vehicle (UAV). Our architecture integrates attention gates and residual connections within a U-Net backbone to enhance contextual modeling and spatial-spectral consistency. We introduce a multi-flight dataset spanning over 9 UAV missions across varied environmental settings, consisting of hyperspectral cubes with centimeter-level resolution. Using a leave-one-out cross-validation protocol, our model achieves test accuracy of up to 96.8% (average 90.5%) and a 91.1% F1 score, demonstrating robust generalization to unseen data collected in different environments. Compared to classical models, the deep network captures richer semantic representations, particularly under challenging conditions. This work offers a scalable and deployable tool for automated plastic waste monitoring and represents a significant advancement in remote environmental sensing.

Keywords:

plastic waste; hyperspectral imaging; deep learning; U-Net; remote sensing; UAV; SWIR

1. Introduction

Plastic pollution has become a critical global challenge that threatens marine ecosystems, terrestrial habitats, and economic sustainability. Current projections estimate that up to 250 million tons of plastic could enter the oceans by 2025 if effective interventions are not implemented [1]. The ecological consequences are profound: marine organisms often ingest or become entangled in plastic debris, resulting in injury, malnutrition, and death [2,3]. In addition, microplastics and chemical additives have been detected in seafood destined for human consumption, raising public health concerns [4]. From an economic perspective, cumulative plastic pollution in marine environments is a globally recognized concern that poses ecological and economic threats [5].

Remote sensing technologies for plastic waste monitoring have attracted significant attention to support plastic litter management. Among the available remote sensing technologies, HyperSpectral Imaging (HSI), particularly in the lower bands of the short wave infrared (SWIR) spectral range, from 900 to 1700 nm, is especially useful for plastics detection because plastic polymers such as polyethylene (PE), polypropylene (PP), polystyrene (PS), polyethylene terephthalate (PET), and polyvinyl chloride (PVC) exhibit specific narrow band absorption peaks in this region [6]. This spectral range provides distinct polymer-specific signatures that enable robust discrimination of plastic materials from natural backgrounds such as vegetation, soil, and water, a capability that visible-spectrum imaging cannot reliably achieve.

1.1. Traditional Machine Learning Approaches for Plastic Detection

In our previous works [6,7,8,9], we confirmed the feasibility of discriminating different plastic polymers from other materials and among themselves using a customized SWIR imaging system deployed in both controlled laboratory and natural outdoor settings. Classification efforts employed traditional supervised Machine Learning methods such as Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM). While these baselines provided valuable insights, they did not exhibit fully satisfactory generalizability under varied illumination conditions and heterogeneous background textures. Consequently, performance can diminish when such models are applied to scenes with environmental characteristics that differ markedly from those in the training data, which may limit their real-world applicability. While individual polymer discrimination is necessary in recycling plants, in this work, we focus on detecting plastic waste in the environment, regardless of polymer type.

Traditional machine learning approaches typically rely on handcrafted spectral features or indices [10,11] and pixel-wise classification without capturing spatial context. Spectral indices highlight the absorption patterns of different materials in relative terms, e.g., ratios of reflectance at specific characteristic wavelengths or spectral angles. Such approaches have limitations in complex settings, so they are outperformed by more sophisticated methodologies. LDA projects high-dimensional spectral data onto discriminant axes that maximize class separability, while SVM constructs optimal hyperplanes in (possibly nonlinearly transformed) feature space for classification. However, these methods process each pixel independently, ignoring the spatial relationships between neighboring pixels that are needed for accurate segmentation in complex environmental scenes. Furthermore, their linear or kernel-based decision boundaries may not adequately capture the nonlinear spectral variations caused by changing illumination, material aging, and background heterogeneity encountered in field conditions.

1.2. Deep Learning Methods

1.2.1. RGB-Based Deep Learning Methods

Several recent studies have applied deep Convolutional Neural Networks (CNNs) to waste and plastic detection (see Table 1 for a comparative overview), primarily as image-level classifiers on RGB datasets.

Ref. [12] evaluated multiple architectures (ResNet, Inception, Inception-ResNet, VGG-16/19) on the TrashNet dataset (2527 annotated garbage images spanning six classes, i.e., cardboard, glass, metal, paper, plastic, and trash), reporting an accuracy of approximately 88.6% using an Inception-ResNet network. It is important to note that this refers to whole-image classification, where each image contains a single pre-segmented waste object against a uniform background, which differs fundamentally from pixel-wise segmentation in complex environmental scenes. Similarly, Ref. [13] compared four CNN classifiers (MobileNetV2, DenseNet-121, ResNet-50, VGG-16) on a larger municipal solid waste image set (∼9200 images of pre-sorted waste items) and found that ResNet-50 performed best, achieving about 94.9% accuracy. Ref. [14] explored single-object waste classification—that is, images containing one isolated, pre-segmented waste item, using the TrashNet dataset and benchmarked Support Vector Machine (SVM) in conjunction with Histogram of Oriented Gradients (HOG), a simple CNN, ResNet-50, a plain (no-skip) ResNet variant, and a hybrid HOG + CNN with augmentation and Adadelta instead of Adam optimizer. ResNet-50 achieved 95.4% test accuracy, outperforming the simple CNN and the HOG + CNN, while the plain ResNet lagged (~76.0% test), highlighting the benefit of Residual connections.

Ref. [15] used a pre-trained ResNet-50 as a feature extractor with an SVM backend on the TrashNet benchmark, reaching around 87.0% overall accuracy. Ref. [16] introduced a new web-collected garbage image dataset (aligned with Shanghai recycling categories) and achieved ∼96.0% training accuracy using a ResNet-50 model, and Ref. [17] proposed a ResNet-18-based Self-Monitoring Module (SMM) for TrashNet images, obtaining 95.9% accuracy.

In contrast to image-level classification, a few studies have addressed object detection or localization of waste items in more complex scenes. Ref. [18] developed a Faster R-CNN detector with a ResNet Backbone and a region proposal network to distinguish waste occurrence in complex urban street scenes (database comprising 816 urban-street images). Their model attained about 89.0% mean Average Precision (mAP) but required ∼6 s per image, underscoring the latency drawback of heavy two-stage detectors for real-time use. Ref. [19] likewise applied Faster R-CNN to identify litter (three categories: landfill, recycling, paper), reporting an overall mAP of 68.0% with the lowest precision for the “paper” class (AP ≈ 61.0%) owing to background clutter.

To overcome these computational constraints and enable real-time processing, lightweight neural models for embedded deployment have also been explored. Ref. [20] achieved 87.2% classification accuracy using MobileNet on a six-class waste sorting task, while Ref. [21] combined MobileNetV3-small and ShuffleNet-V2 (“WasNet”) to reach 96.1% accuracy on TrashNet with significantly reduced complexity. Large-scale architectures such as AlexNet [22], although influential, required 1.2 million ImageNet images and multi-day training to reach a top-1 error of 37.5%, which is a standard metric in ImageNet classification that measures how often the model’s predictions miss the correct label, illustrating the impracticality of data-hungry models for field-deployable plastic detection systems.

The studies reviewed above demonstrate that RGB imagery can achieve high accuracy in waste classification in controlled settings, such as recycling facilities, where objects are pre-segmented and backgrounds are uniform. However, for environmental litter detection in complex outdoor scenes with varied backgrounds, spectral confusion between plastics and natural materials (e.g., dry vegetation, sand, rocks) limits the effectiveness of RGB approaches.

1.2.2. Hyperspectral Deep Learning Methods

In recent years, a few studies have demonstrated the application of deep learning techniques to drone- and airborne-HSI for plastic litter detection. For instance, Ref. [23] developed a 3D-CNN to identify marine plastic targets from two aerial platforms (range investigated: 400 nm to 2500 nm), and later proposed a zero-shot learning approach that achieved high detection accuracy across multiple polymer types [24]. Another study [25] demonstrated improved robustness by combining laboratory- and field-collected hyperspectral signatures to detect riverine plastics, highlighting the need for models that generalize detection under different sensors and illumination conditions. Additionally, Ref. [26] investigated the detection of floating plastics from both satellite and unmanned aerial systems, demonstrating the feasibility of multi-platform approaches for plastic litter monitoring. Ref. [27] demonstrated qualitatively strong segmentation of typical waste targets on airborne SWIR-HSI with a multi-scale CNN, yet they reported no quantitative cross-scene metrics, limiting evidence of real-world generalization. Ref. [28] achieved a mean classification accuracy of 97.6% in floating plastic classification using a lightweight visible-shortwave-infrared CNN on UAV HSI. Squeeze-and-Excitation (SE) block analysis proved that NIR-SWIR bands contribute the most to plastic classification, confirming the value of SWIR spectral information for polymer discrimination. These efforts underscore both the promise of deep neural networks for hyperspectral plastic detection and the challenge of building models that remain reliable across varying environmental and geographic contexts.

Table 1. Comparison of Deep Learning waste recognition models performance.

Study	Network	Dataset	Task	Performance
[12]	Inception-ResNet	RGB images	Classification	88.6% accuracy
[13]	ResNet-50	RGB images	Classification	94.9% accuracy
[14]	ResNet-50 + HOG	RGB images	Classification	95.4% test acc.
[15]	ResNet-50 + SVM	RGB images	Classification	87.0% accuracy
[16]	ResNet-50	RGB images	Classification	95.9% (train)
[17]	ResNet-18 (SMM)	RGB images	Classification	95.9% accuracy
[18,19]	Faster R-CNN	RGB images	Detection	89.0% accuracy
[20]	MobileNet CNN	RGB images	Classification	87.2% accuracy
[21]	MobileNetV3 + ShuffleNetV2	RGB images	Classification	96.1% accuracy
[22]	multi-scale CNN	Hyperspectral images	Detection	Qualitative high detection
[23]	3D CNN	Hyperspectral images	Classification	Qualitative high accuracy 84%
[24]	Zero-shot DL	Hyperspectral images	Classification	98.7% accuracy
[27]	Lite spatial spectral CNN	Hyperspectral images	Classification	97.6% accuracy
This Work	Attentional-Residual U-Net	Hyperspectral images	Pixel Wise Segmentation	96.8%

1.3. Semantic Segmentation Architectures for Dense Prediction

For pixel-wise semantic segmentation tasks, several deep learning architectures have been developed. Fully Convolutional Networks (FCNs) pioneered end-to-end dense prediction by replacing fully-connected layers with convolutional layers. SegNet introduced an encoder-decoder architecture with pooling indices for efficient upsampling, while DeepLabV3 employs Atrous Spatial Pyramid Pooling (ASPP) to capture multi-scale contextual information without losing resolution.

In this work, we adopt the U-Net architecture [29] as our baseline for several reasons. First, U-Net’s symmetric encoder-decoder architecture with skip connections is particularly well-suited to preserving fine-grained spatial details, which are essential for accurate plastic boundary delineation. Second, unlike DeepLabV3, which was designed primarily for natural RGB images with pre-trained ImageNet weights, U-Net can be trained from scratch on hyperspectral data without requiring transfer learning from incompatible spectral domains. Third, SegNet’s reliance on pooling indices, while memory-efficient, may lose subtle spectral information critical for distinguishing spectrally similar materials. Fourth, U-Net has demonstrated strong performance in medical imaging and remote sensing applications where precise boundary detection is crucial, making it a natural choice for plastic waste segmentation. We further enhance the U-Net architecture with attention gates [30] and residual connections [31] to improve gradient flow and enable the model to focus on discriminative spectral-spatial features.

1.4. Research Gap and Contributions

Although Table 1 lists impressively high accuracies across both RGB and hyperspectral studies, it also exposes several shortcomings. Most studies based on hyperspectral data provide only qualitative segmentation, lack quantitative, cross-scene validation, and often stop at image-level classification, leaving pixel-wise mapping aside. None of the cited works tests temporal generalization across multiple flight surveys, so robustness to environmental changes remains undocumented. In addition, the research landscape still relies heavily on RGB data for waste-sorting applications or keeps hyperspectral imagery proprietary, limiting reproducibility and underutilizing the potential of SWIR bands for environmental litter detection.

To address these limitations, this study introduces a deep learning framework designed to improve generalization in plastic segmentation tasks specifically for environmental litter monitoring. The key contributions of this work are: (1) a modified U-Net architecture incorporating residual connections and attention gates specifically designed for SWIR hyperspectral plastic segmentation, facilitating robust feature learning while suppressing irrelevant background information; (2) rigorous leave-one-out (LOO) cross-validation across 9 annotated UAV flight campaigns spanning 4 years to evaluate temporal and environmental generalization, ensuring strict separation between training and testing datasets across both space and time; (3) systematic comparison with traditional machine learning methods (LDA, SVM) demonstrating the advantages of deep learning for this task; (4) a publicly available multi-temporal hyperspectral dataset enabling reproducible research; and (5) analysis of acquisition condition impacts (e.g., overexposure) on model performance, providing practical guidelines for operational deployment. Our approach offers a scalable and resilient solution for automated plastic waste detection, bridging the gap between high-resolution hyperspectral acquisition and deployable remote sensing analytics for environmental monitoring applications.

2. Materials

From January 2020 to February 2024, we executed 9 UAV-based hyperspectral imaging campaigns across diverse environmental settings, encompassing variations in ground texture, terrain type, and weather scenarios. The flights detailed in Table 2 provide a diverse dataset essential for evaluating generalization performance in both machine and deep learning models. Seven of these nine campaigns (Jan20a, Jan20b, Mar20, Apr22, 7Feb24a, 7Feb24b, and 21Feb24a) were also treated in a previous paper of ours [8], while two additional flights (21Feb24b and Dec22) are introduced here to extend the temporal and environmental coverage. The complete nine-flight dataset is openly available to the research community through Mendeley (DOI: 10.17632/nmpjzrky3r.1).

All data acquisitions were carried out using a DJI Matrice 600 hexacopter (DJI, Shenzhen, China), as shown in Figure 1a. The sensor payload (Figure 1b) combines a Xenics (Leuven, Belgium) Bobcat 320 SWIR camera (320 × 256 pixels, 900–1700 nm) with a Specim (Oulu, Finland) ImSpector NIR17 OEM line scan spectrometer. Although the spectrometer records at a native 2.5 nm interval, we undersampled to 10 nm because finer resolution offered little extra information while increasing noise and data volume. A further spectral camera providing information in the VIS/NIR range (400–1000 nm) was mounted but not used in this study. The payload also included a synchronized IDS (Obersulm, Germany) UI-3240-CP-C-HQ RGB camera and a VectorNav (Dallas, TX, USA) VN-200 INS for precise geo-referencing [7,8]. During each mission, the UAV flew at altitudes of 7–10 m and ground speeds of 0.5–1.5 m s⁻¹. These flight parameters were selected to optimize the trade-off between spatial resolution, spectral quality, and operational efficiency. Lower altitudes (7–8 m) provide finer ground sampling distance (GSD) but reduce coverage area per flight and increase sensitivity to terrain variations. Higher altitudes (10 m) offer greater coverage and more stable flight characteristics, but at the cost of coarser spatial resolution. The across-track GSD, determined by altitude, sensor optics, and pixel count, ranged from 3 to 4 cm. The along-track GSD, determined by flight speed and line-acquisition rate (12.5–16 Hz), was typically larger (4–12 cm).

Push-broom hyperspectral image sequences were assembled into 3D spectral cubes through spatial registration. Consecutive line scans were registered using corresponding RGB frames by mosaicking them based on correlation, using their relative positions and orientations to align the simultaneously acquired hyperspectral data. The single one-dimensional segments are pasted into a matrix (slightly oversampled, with pixel size corresponding to approximately 0.6 cm at ground), which needs to be rectangular in space; therefore, along the edges we fill the cube with blank pixels (spectra set to all 0 values). Such blank pixels, hereinafter denoted unassigned pixels, will not be included in the processing.

Each raw hyperspectral cube initially contained 81 bands sampled at 10 nm intervals. However, spectral bands between 900 nm and 930 nm, empty due to geometrical constraints of the camera assembly, and bands between 940 nm and 970 nm and between 1690 nm and 1700 nm, almost dark due to low quantum efficiency, were discarded. Moreover, bands between 1350 nm and 1450 nm were discarded due to the significant effect of atmospheric water vapor. The remaining 60 bands, ranging from 980 nm to 1340 nm and 1460 nm to 1680 nm, retained the principal polymer-specific absorption signatures required for accurate plastic material discrimination [6]. The hyperspectral cubes were used in their raw digital number (DN) format without radiometric calibration, atmospheric correction, or intensity normalization, preserving the original spectral relationships captured by the sensor and testing the model’s ability to learn robust features directly from uncalibrated data. In a previous paper [8], we showed that calibration does not improve the discrimination performance of classifiers, whereas a simple level adjustment does improve generalization.

Binary annotation masks were manually created using GIMP (GNU Image Manipulation Program, version 2.10). The annotation procedure involved the following steps: first, a single spectral band from each hyperspectral cube (typically at 1000 nm for optimal contrast) was exported as a grayscale reference image. This reference was visually compared with the corresponding RGB mosaic to identify the boundaries of plastic objects accurately, and the regions corresponding to each material type were drawn over the hyperspectral layer image. The plastic fragments were labeled as 1 and represented in red, the background as 0 and represented in green, while the remaining pixels not containing information (unassigned pixels) were represented in black and excluded from the training. The annotation scheme distinguished the following semantic classes grouped into binary categories:

First category (Plastic)—Six plastic subclasses: PET, PE (LDPE & HDPE), PVC, PP, PS, and other plastics (bio-based or composite materials, and mixed or unknown materials).
Second category (Non-Plastic)—White reference objects (commercial floor tiles verified with a point spectroradiometer to be satisfactorily white in the SWIR range); non-plastic objects (wood, metal, glass, and tyres) and background (grass and bare earth).

To standardize model input dimensions, hyperspectral cubes were divided into non-overlapping patches of size 128 × 128, corresponding to ground areas of approximately 0.6 m². The patching procedure was implemented using a sliding window approach with a stride equal to the patch size (128 pixels), ensuring no overlap between adjacent patches. Starting from the top-left corner of each hyperspectral cube, the algorithm sequentially extracted patches by moving horizontally and then vertically across the image. Patches located at the image boundaries that did not contain the full 128 × 128 dimensions were discarded to maintain a uniform input size. Additionally, patches containing more than 50% unassigned pixels (black regions from the mosaicking process) were excluded from the dataset to ensure sufficient valid spectral information for training. Due to the class imbalance (Figure 2), a stratified split was performed based on target pixel counts per class, resulting in training and validation sets that maintained representative distributions of Plastic and Non-plastic samples.

To assess generalization performance across spatially and spectrally distinct scenes, each of the 9 UAV flights was held out as an unseen test set, while the remaining 8 flights were split into training (80%) and validation (20%) subsets using a leave-one-out (LOO) cross-validation approach. The 80/20 split was used to preserve the proportion of plastic pixels relative to non-plastic pixels, ensuring the minority class (plastic) maintained adequate representation in both the training and validation sets. This methodology ensures that our hyperspectral plastic detection system demonstrates robust performance across varied real-world conditions, providing reliable automated detection capabilities for environmental monitoring applications.

Table 3 summarizes the patch distribution across all nine UAV flights. A total of 15,738 patches were extracted from the hyperspectral cubes. For each LOO fold, one flight was held out as the unseen test set, while patches from the remaining eight flights were divided into training (80%) and validation (20%) subsets using stratified sampling based on plastic pixel counts.

For each of the nine LOO folds, the test set size ranged from 525 patches (Mar20) to 3306 patches (Dec22), with an average of 1749 test patches per fold. The corresponding training sets contained 10,019 to 12,019 patches (average: 11,226), while validation sets ranged from 2473 to 2989 patches (average: 2788). This distribution ensured that each fold evaluated the model on a complete, unseen flight mission while maintaining sufficient diversity in the training data.

3. Methodology

The proposed segmentation architecture (Figure 3) employs a U-Net encoder-decoder architecture [29] augmented with Residual connections and Attention gates feature extraction [30]. These components enhance gradient flow and allow the model to focus on spatial/spectral regions most indicative of plastic debris in hyperspectral scenes.

The encoder processes 60-channel SWIR hyperspectral inputs through progressive down-sampling stages with channel expansion from 64 to 128 to 256 to 512 feature channels across four encoder stages. This progression captures hierarchical features from low-level spectral signatures to high-level spatial patterns indicative of plastic materials. Each Residual block [31] preserves critical spectral information through dual convolution paths.

The decoder mirrors the encoder by up-sampling features to the original resolution for pixel-wise classification. At each stage, Attention gates weigh encoder skip connections before concatenation, suppressing irrelevant background features while highlighting plastic signatures:

α_{i} = σ (ψ^{T} (σ (W_{g} g_{i} + W_{x} x_{i} + b) + b_{ψ}))

{\hat{x}}_{i} = α_{i} ⊙ x_{i}

where:

$g_{i}$ is the gating signal from the decoder (coarser scale)
$x_{i}$ is the encoder skip connection feature map (finer scale)
$W_{g}$ , $W_{x}$ are learned projection weights for gating and encoder features, respectively
$b$ , $b_{ψ}$ are bias terms
$ψ$ is a $1 \times 1$ convolution that produces attention coefficients
$σ$ is the sigmoid activation function
$α_{i}$ are the attention coefficients (values between 0 and 1)
$⊙$ denotes element-wise multiplication
${\hat{x}}_{i}$ represents the attention-weighted features passed to the decoder

Residual connections within decoder blocks facilitate gradient propagation during training. The final

1 \times 1

convolution with Softmax activation produces per-pixel probability distributions across background and plastic classes.

We trained the network using a combined focal and Dice loss [32,33] to both address class imbalance (plastic occupies < 15% of image area) and optimize segmentation overlap [34].

L_{t o t a l} = α \cdot L_{f o c a l} + (1 - α) \cdot L_{D i c e} + λ \cdot {| | W | |}^{2}

where:

$α$ is the weighting factor balancing focal and Dice loss contributions (set to α = 0.5)
$L_{f o c a l}$ is the focal loss that down-weighs easy examples to focus learning on hard cases
$L_{D i c e}$ is the Dice loss that directly optimizes the overlap between prediction and ground truth
$λ$ is the $L 2$ regularization coefficient ( $λ = 10^{- 4}$ )
${| | W | |}^{2}$ represents the $L 2$ norm of model weights, preventing overfitting

Class weights were dynamically computed per fold based on pixel distributions:

w_{c} = \frac{N_{t o t a l}}{C \times N_{c}}

where

N_{c}

represents the pixel count for class c,

N_{t o t a l}

is the total number of pixels, and C = 2 is the number of classes. This ensures optimal performance across different flight missions with varying levels of plastic contamination. These class weights are applied during training by multiplying the loss contribution of each pixel by its corresponding class weight. This amplifies the learning signal from rare plastic pixels (

w_{p l a s t i c} \approx 5.0

) relative to abundant background pixels (

w_{b a c k g r o u n d} \approx 0.2

), ensuring the model prioritizes correct classification of the minority class. This weighting strategy ensures optimal performance across different flight missions with varying levels of plastic contamination.

Training employed the Adam optimizer

(β_{1} = 0.9, β_{2} = 0.999)

with fixed learning rate (10⁻⁴), batch size of 8 patches, and early stopping (patience=10 epochs). On an NVIDIA RTX 4080, the model converged in approximately 38 min per fold with 8.1 ms inference latency per tile. Key parameters of the architecture and of the training stage are listed in Table 4 and Table 5.

To evaluate the classification performance, classical validation metrics were introduced. Let True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN) denote the confusion matrix entries when plastic is treated as the positive class. The following metrics were computed on the confusion matrix elements:

Accuracy

$A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}$
Precision

$P r e c i s i o n = \frac{T P}{T P + F P}$
Recall

$R e c a l l = \frac{T P}{T P + F N}$
F1 Score

$F 1 S c o r e = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}$
Dice coefficient

$D i c e = \frac{2 \times T P}{2 \times T P + F P + F N}$
Intersection over Union (IoU)

$I o U = \frac{T P}{T P + F P + F N}$
Cohen’s Kappa (CK)

C K = \frac{2 (T P \cdot T N - F P \cdot F N)}{P \cdot O N + \cdot N \cdot O P}

where

P = T P + F N, O P = T P + F P, N = F P + T N, O N = F N + T N

.

4. Results and Discussion

4.1. Experimental Setup

The segmentation model was trained and evaluated on 15,738 128 × 128-pixel patches across 60 spectral bands, extracted from nine UAV flights (Table 3). Notably, the hyperspectral cubes were used in their raw digital number (DN) format without radiometric calibration, atmospheric correction, or intensity normalization. This decision was motivated by two factors: (1) preserving the original spectral relationships captured by the sensor, and (2) testing the model’s ability to learn robust features directly from uncalibrated data, which is more representative of rapid field deployment scenarios where calibration may not be feasible. Under the leave-one-out (LOO) cross-validation protocol, each of the nine flights was held out in turn as the test set, while the remaining eight flights provided training (80%) and validation (20%) data. This resulted in training sets of, on average, 11,226 patches and validation sets of 2788 patches per fold, with test sets ranging from 525 to 3306 patches depending on the held-out flight. This rigorous scheme ensured that the model was always evaluated on entirely unseen acquisition conditions, providing a realistic assessment of generalization performance.

4.2. Qualitative Segmentation Results

Figure 4 presents visual segmentation results for all nine flights, organized in four columns: (A) RGB reference image with annotated material labels, (B) ground truth binary mask (plastic in red, background in green), (C) LOO model prediction, and (D) prediction of the model trained on all flights. The scenes encompass diverse plastic types (PET, PE, PP, PS, PVC, and mixed plastics) alongside non-plastic materials (tiles, wood, glass, metal, tyres) distributed across varying backgrounds of grass and bare soil. Blue boxes indicate false positives where the background was misclassified as plastic, while yellow boxes mark false negatives where plastic objects were missed.

The model demonstrates strong generalization across most scenes. The Jan20a/b cases show excellent plastic detection, with well-defined object boundaries and minimal background false positives, though small fragments occasionally exhibit incomplete segmentation (blue boxes mark the false positives). Case Apr22 demonstrates near-perfect segmentation, with accurate plastic delineation and minimal misclassification of PP as background (yellow boxes indicate missing predictions). The cases of February 2024 (7Feb24a, 7Feb24b, 21Feb24a, and 21Feb24b) maintain consistent quality despite varying illumination conditions, with 7Feb24a achieving the highest test accuracy (96.8%) and 21Feb24b the best overlap metrics (Dice 94.7%, IoU 89.8%). However, the Mar20 case presents a notable exception with markedly degraded LOO performance (Panel C), showing extensive false positives and fragmented plastic predictions. This anomaly prompted a detailed investigation of the acquisition conditions, discussed in Section 4.4.

Table 6 summarizes the LOO cross-validation results across all nine test flights. The network maintained minimal overfitting, evidenced by the small training-validation loss gap (Δ ≈ 0.008). Overall, the model achieved a mean test accuracy of 90.5% (ranging from 63.4% to 96.8%), a mean Dice coefficient of 68.0%, an IoU of 53.6%, and Cohen’s κ of 0.625. Precision (93.6%) exceeded recall (90.5%), indicating conservative predictions that favor avoiding false positives. The best individual performance was recorded on 21Feb24b (Dice 94.7%, IoU 89.8%, κ = 0.866), while Mar20 represented a clear outlier (Dice 45.2%, κ = 0.278).

Figure 5a shows the average confusion matrix aggregated across all LOO folds, showing a high specificity of 0.95 and a False Positive Rate of 0.05. The Learning curves (Figure 5b) show rapid loss convergence within 10–15 epochs, with accuracies stabilizing near 96% (training) and 95% (validation). Despite a 16.9% plastic vs. 84.3% background imbalance, the Dice + Focal loss and balanced sampling maintained strong performance.

4.3. Impact of Acquisition Conditions

The anomalous performance on the Mar20 data prompted an investigation into acquisition conditions. Figure 6 compares brightness statistics (mean, maximum, and standard deviation of pixel DN values) across all flights. Mar20 data exhibit notably elevated brightness: mean DN of ~105, compared to 38–60 for other flights; maximum DN of ~188, compared to 73–130; and standard deviation of ~37, compared to 12–18. This 2.6× increase in mean brightness confirms overexposure during acquisition, which altered the spectral characteristics of both the plastic and the background materials. Notably, the maximum DN values remained below the sensor saturation threshold, indicating improper exposure settings rather than sensor clipping. Since no radiometric normalization was applied to the raw data, the model trained on properly exposed flights could not generalize to the shifted intensity distribution of the Mar20 data.

To verify that this performance gap stemmed from data distribution mismatch rather than model capacity limitations, we trained an additional model on all nine flights combined (80% training, 20% validation). Table 7 presents the results: the Dice coefficient improved to 81.6% (+20% relative to LOO average), IoU to 68.9% (+28%), and Cohen’s κ to 0.798. Visual inspection of Panel D in Figure 4 confirms dramatic improvement on Mar20, with false positives substantially reduced and plastic boundaries properly delineated. Critically, performance on properly exposed flights remained unchanged or slightly improved, confirming that including diverse acquisition conditions in training enhances rather than degrades overall model robustness.

Table 7 presents all-flight model metrics. Convergence was reached at epoch 38 with: training loss 0.588, accuracy 96.10%; validation loss 0.587, accuracy 96.8% (Δ = 0.0013). Segmentation substantially improved: Dice 81.6% (vs. 68.0% LOO, +20%), IoU 68.9% (vs. 53.6% LOO, +28%), κ 0.798 (vs. 0.625). Per-class: non-plastic (precision 97.7%, recall 98.8%, F1 98.2%), plastic (precision 86.9%, recall 76.9%, F1 81.6%).

This analysis yields a key operational insight: model generalization in field deployment is primarily constrained by the diversity of the training data rather than by architectural capacity. Ensuring representative coverage of expected acquisition variations, including different exposure levels, weather conditions, and background types, is essential for robust performance. The Mar20 case also provides a diagnostic template: brightness analysis can identify problematic acquisitions that may require reprocessing or model retraining.

4.4. Ablation Study and Comparison with Classical Machine Learning

To quantify the contribution of each architectural component, we conducted an ablation study comparing the baseline U-Net with variants with incremental enhancements (Table 8). The baseline U-Net achieved a Dice coefficient of 0.61 and an accuracy of 90.4%. Adding residual connections improved the Dice coefficient to 0.68 (+11% relative), demonstrating the benefit of improved gradient flow for training deeper networks on hyperspectral data. Incorporating attention gates further increased the Dice coefficient to 0.72 (+6% relative), enabling the model to selectively focus on discriminative spectral-spatial features while suppressing background noise. The final architecture combining both enhancements achieved a Dice coefficient of 0.76 and accuracy of 93.3%, representing a cumulative improvement of +25% in Dice coefficient over the baseline.

The choice of U-Net as the base architecture, rather than alternatives such as DeepLabV3 or SegNet, was motivated by several considerations specific to hyperspectral plastic segmentation. First, U-Net’s symmetric encoder-decoder architecture with skip connections excels at preserving fine-grained spatial details, which are essential for accurate boundary delineation. Second, architectures like DeepLabV3 rely on ImageNet pre-trained weights optimized for 3-channel RGB images, which cannot be directly transferred to 60-channel hyperspectral data; U-Net’s ability to train effectively from scratch makes it better suited for this domain. Third, the attention mechanism enables the model to selectively weigh spatial and spectral regions during feature extraction. The substantial performance gains achieved through residual connections and attention gates (+25% Dice) validate these design choices. Table 9 compares our deep learning approach against classical machine learning methods (LDA and SVM) evaluated in our previous work [8]. Under comparable LOO evaluation, LDA achieved a mean Cohen’s κ of 0.44, while our attention-augmented U-Net reached a κ of 0.63—a 43% relative improvement. This performance gap reflects the deep network’s superior ability to learn complex nonlinear spectral-spatial representations compared to LDA’s linear discriminant projections.

The SVM results (mean κ = 0.82) appear stronger but were obtained using a single train-test split without LOO cross-validation, likely overestimating generalization performance. When both methods are evaluated under rigorous LOO protocols, the deep learning approach demonstrates more consistent performance across diverse acquisition conditions. Notably, both LDA and U-Net struggle equally on the Mar20 case (κ ≈ 0.28), confirming that the overexposure issue affects all learning approaches and cannot be overcome by model architecture alone, reinforcing the critical importance of acquisition protocol quality identified in Section 4.4.

5. Conclusions

We developed a deep learning framework for pixel-wise plastic segmentation in UAV-based SWIR hyperspectral imagery (980–1680 nm), combining U-Net with Residual blocks and Attention gates. Leave-one-out evaluation across 9 flights spanning diverse conditions demonstrated a mean Dice coefficient of 68.0%, IoU 53.6%, accuracy 90.5%, precision 93.6%, and recall 90.5%, representing 43% κ improvement (0.63 vs. 0.44) over classical LDA, confirming the superior feature learning of deep architectures. Best performance: 21Feb24b (Dice 94.7%, IoU 89.8%, κ = 0.866).

The Mar20 case showed degraded LOO performance (Dice 45.2%, IoU 29.2%) with extensive false positives. Brightness analysis revealed overexposure representing ~2.6× elevation over the other flights. Training on all nine flights achieved substantially improved performance: Dice 81.58% (+20%), IoU 68.89% (+28%), κ 0.7979, with visual confirmation of dramatically reduced Mar20 false positives. This demonstrates sufficient model capacity when training represents expected variations, establishing that acquisition protocol quality, rather than architecture, is the primary operational constraint.

The model handled severe class imbalance (16.9% plastic, 84.3% background) using Dice+Focal loss and balanced sampling, achieving 86.91% precision on plastics. Architectural ablation yielded a baseline Dice coefficient of 0.61, improved to 0.68 (+11%) after adding residual connections, to 0.72 (+6%) after adding the attention mechanism, and finally to 0.76 (+25% total) in the complete architecture. Computational efficiency: 7.2 M parameters, 8.1 ms/tile enables ~123 tiles/s, processing a 100 m² survey in ~5 s. Training converges in ~38 min/fold.

The key contributions of this work are as follows:

Demonstration that attention-augmented deep architectures achieve robust plastic segmentation with performance limited by training diversity rather than model capacity, providing a 43% improvement in Cohen’s Kappa over LDA under rigorous Leave One Out evaluation.
Mar20 case study with brightness analysis providing a diagnostic template for acquisition and establishing that overexposure can be identified through DN distribution analysis and mitigated through diverse training data.
Multi-temporal dataset (nine flights, 4 years, six polymers, openly available via Mendeley) enabling standardized comparison and reproducible research in hyperspectral plastic detection.
Comprehensive ablation study quantifying the contribution of residual connections (+11% Dice) and attention gates (+6% Dice) to segmentation performance, validating architectural design choices for hyperspectral data.
Practical demonstration that raw, uncalibrated hyperspectral data can achieve robust segmentation without radiometric preprocessing, supporting rapid field deployment scenarios.

Future research activity will address limitations of the present study, including the following:

Similarity of the case study locations, all in central Italy and mostly on open ground or grass: more diverse case studies will be taken into account to enhance the generalization capability of the models;
Cost of manual annotation of the training set: partial automation might be achieved by using the already trained models;
Lack of level and dynamic range adjustment of the data: simple equalization methods, such as those we previously exploited, will be tested in this framework.

Priorities for the development of this work are identified as follows:

Using the VNIR hyperspectral data that our sensor already collects, to fuse them with the SWIR band data and enhance classification, as well as spatial definition of results (the resolution of the VNIR camera is higher);
Uncertainty quantification for confidence flagging to identify predictions requiring human verification;
Model compression for edge deployment, including knowledge distillation and pruning techniques to reduce the 7.2 M parameter count while maintaining accuracy;
Extension to airborne platforms for regional coverage;
Use of data augmentation to enhance the robustness of the solution.

Semi-supervised and self-supervised learning approaches could reduce annotation requirements, while active learning strategies could prioritize labeling of informative samples.

The results presented in this work have a potentially broad impact on the monitoring of plastic litter in the environment. In fact, automated detection transforms monitoring from labor-intensive manual surveys to scalable, quantitative assessments, particularly valuable for remote areas (riverbanks, coastal zones, wetlands). Pixel-wise segmentation enables area quantification for policy and cleanup prioritization. Temporal monitoring supports the assessment of pollution dynamics, the evaluation of interventions, and the detection of illegal dumping. By reducing costs, this methodology enables wider adoption by governments, NGOs, and citizen science.

The publicly available dataset and documented methodology provide a foundation for reproducible research and method benchmarking in the emerging field of hyperspectral plastic detection.

Author Contributions

S.B.: Writing the original draft, Methodology, Investigation, designing and coding the full Deep Learning pipeline. M.B.: Supervision and Investigation and performing the UAV Survey. M.M.: Supervision, Investigation, producing the input hyperspectral data. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Sapienza Università di Roma “Progetti Piccoli e Medi 2024”.

Data Availability Statement

The datasets used in this paper are openly available to the research community through Mendeley: DOI: 10.17632/nmpjzrky3r.1.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gallo, F.; Fossi, C.; Weber, R.; Santillo, D.; Sousa, J.; Ingram, I.; Nadal, A.; Romano, D. Marine litter plastics and microplastics and their toxic chemicals components: The need for urgent preventive measures. Environ. Sci. Eur. 2018, 30, 13. [Google Scholar] [CrossRef]
Senko, J.F.; Nelms, S.E.; Reavis, J.L.; Witherington, B.; Godley, B.J.; Wallace, B.P. Understanding individual and population-level effects of plastic pollution on marine megafauna. Endanger. Species Res. 2020, 43, 234–252. [Google Scholar] [CrossRef]
Wilcox, C.; Puckridge, M.; Schuyler, Q.A.; Townsend, K.; Hardesty, B.D. A quantitative analysis linking sea turtle mortality and plastic debris ingestion. Sci. Rep. 2018, 8, 12536. [Google Scholar] [CrossRef] [PubMed]
Xu, Z.; Huang, L.; Xu, P.; Lim, L.; Cheong, K.-L.; Wang, Y.; Tan, K. Microplastic pollution in commercially important edible marine bivalves: A comprehensive review. Food Chem. X 2024, 23, 101647. [Google Scholar] [CrossRef] [PubMed]
Corbau, C.; Lazarou, A.; Simeoni, U. Fishing-Related Plastic Pollution on Bocassette Spit (Northern Adriatic): Distribution Patterns Stakeholder Perspectives. J. Mar. Sci. Eng. 2025, 13, 1351. [Google Scholar] [CrossRef]
Moroni, M.; Balsi, M.; Bouchelaghem, S. Plastics detection and sorting using hyperspectral sensing and machine learning algorithms. Waste Manag. 2025, 203, 114854. [Google Scholar] [CrossRef] [PubMed]
Balsi, M.; Moroni, M.; Chiarabini, V.; Tanda, G. High-Resolution Aerial Detection of Marine Plastic Litter by Hyperspectral Sensing. Remote Sens. 2021, 13, 1557. [Google Scholar] [CrossRef]
Balsi, M.; Moroni, M.; Bouchelaghem, S. Plastic Litter Detection in the Environment Using Hyperspectral Aerial Remote Sensing and Machine Learning. Remote Sens. 2025, 17, 938. [Google Scholar] [CrossRef]
Bouchelaghem, S.; Tibermacine, I.E.; Balsi, M.; Moroni, M.; Napoli, C. Cross-Domain Machine Learning Approaches using Hyperspectral Imaging for Plastics litter detection. In Proceedings of the 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Oran, Algeria, 15–17 April 2024; pp. 36–40. [Google Scholar] [CrossRef]
Moroni, M.; Mei, A.; Leonardi, A.; Lupo, E.; Marca, F.L. PET and PVC Separation with Hyperspectral Imagery. Sensors 2015, 15, 2205–2227. [Google Scholar] [CrossRef]
Biermann, L.; Clewley, D.; Martinez-Vicente, V.; Topouzelis, K. Finding Plastic Patches in Coastal Waters using Optical Satellite Data. Sci. Rep. 2020, 10, 5364. [Google Scholar] [CrossRef]
Ruiz, V.; Sánchez, Á.; Vélez, J.F.; Raducanu, B. Automatic Image-Based Waste Classification. In From Bioinspired Systems and Biomedical Applications to Machine Learning; Ferrández Vicente, J.M., Álvarez-Sánchez, J.R., de la Paz López, F., Toledo Moreo, J., Adeli, H., Eds.; Springer International Publishing: Cham, Switzerland, 2019; pp. 422–431. [Google Scholar] [CrossRef]
Srinilta, C.; Kanharattanachai, S. Municipal Solid Waste Segregation with CNN. In Proceedings of the 2019 5th International Conference on Engineering, Applied Sciences and Technology (ICEAST), Luang Prabang, Laos, 2–5 July 2019; pp. 1–4. [Google Scholar] [CrossRef]
Meng, S.; Chu, W.-T. A Study of Garbage Classification with Convolutional Neural Networks. In Proceedings of the 2020 Indo–Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN), Punjab, India, 14–15 February 2020; pp. 152–157. [Google Scholar] [CrossRef]
Adedeji, O.; Wang, Z. Intelligent Waste Classification System Using Deep Learning Convolutional Neural Network. In Proceedings of the 2nd International Conference on Sustainable Materials Processing and Manufacturing, SMPM 2019, Sun City, South Africa, 8–10 March 2019; Volume 35, pp. 607–612. [Google Scholar] [CrossRef]
Liao, Y. AWeb-Based Dataset for Garbage Classification Based on Shanghai’s Rule. Int. J. Mach. Learn. Comput. 2020, 10, 599–604. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Mu, X.; Wang, Z.; Tian, R.; Wang, X.; Liu, X. Recyclable waste image recognition based on deep learning. Resour. Conserv. Recycl. 2021, 171, 105636. [Google Scholar] [CrossRef]
Wang, Y.; Zhang, X. Autonomous garbage detection for intelligent urban management. MATEC Web Conf. 2018, 232, 01056. [Google Scholar] [CrossRef]
Awe, O.; Mengistu, R. Final Report: Smart Trash Net: Waste Localization and Classification. In Stanford University CS229 Final Project Report; Stanford University: Stanford, CA, USA, 2017; Available online: http://cs229.stanford.edu/proj2017/final-reports/5243715.pdf (accessed on 1 January 2026).
Rabano, S.L.; Cabatuan, M.K.; Sybingco, E.; Dadios, E.P.; Calilung, E.J. Common Garbage Classification Using MobileNet. In Proceedings of the 2018 IEEE 10th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), Baguio City, Philippines, 29 November–2 December 2018. [Google Scholar] [CrossRef]
Yang, Z.; Li, D. WasNet: A Neural Network-Based Garbage Collection Management System. IEEE Access 2020, 8, 103984–103993. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Freitas, S.; Silva, H.; Almeida, C.; Viegas, D.; Amaral, A.; Santos, T.; Dias, A.; Jorge, P.A.S.; Pham, C.K.; Moutinho, J.; et al. Hyperspectral Imaging System for Marine Litter Detection. In Proceedings of the OCEANS 2021: San Diego–Porto, IEEE, San Diego, CA, USA, 20–23 September 2021; pp. 1–6. [Google Scholar] [CrossRef]
Freitas, S.; Silva, H.; Silva, E. Hyperspectral Imaging Zero-Shot Learning for Remote Marine Litter Detection and Classification. Remote Sens. 2022, 14, 5516. [Google Scholar] [CrossRef]
Tasseron, P.F.; Schreyers, L.; Peller, J.; Biermann, L.; van Emmerik, T. Toward Robust River Plastic Detection: Combining Lab and Field-Based Hyperspectral Imagery. Earth Space Sci. 2022, 9, e2022EA002518. [Google Scholar] [CrossRef]
Topouzelis, K.; Papakonstantinou, A.; Garaba, S.P. Detection of floating plastics from satellite unmanned aerial systems (Plastic Litter Project 2018). Int. J. Appl. Earth Obs. Geoinf. 2019, 79, 175–183. [Google Scholar] [CrossRef]
Zhao, S.; Lv, Y.; Zhao, X.; Wang, J.; Li, W.; Lv, M. Coastline target detection based on UAV hyperspectral remote sensing images. Front. Mar. Sci. 2024, 11, 1452737. [Google Scholar] [CrossRef]
El Bergui, A.; Porebski, A.; Vandenbroucke, N. A lightweight spatial and spectral CNN model for classifying floating marine plastic debris using hyperspectral images. Mar. Pollut. Bull. 2025, 216, 117965. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer International Publishing: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar] [CrossRef]
Oktay, O.; Schlemper, J.; Folgoc, L.L.; Lee, M.J.; Heinrich, M.; Misawa, K.; Mori, K.; McDonagh, S.G.; Hammerla, N.Y.; Kainz, B.; et al. Attention U-Net: Learning Where to Look for the Pancreas. arXiv 2018, arXiv:1804.03999. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, Nashville, Tennessee, 11–15 June 2025. [Google Scholar] [CrossRef]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Milletari, F.; Navab, N.; Ahmadi, S.-A. V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation. In Proceedings of the 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016; pp. 565–571. [Google Scholar] [CrossRef]
Mu, Y.; Nguyen, T.; Hawickhorst, B.; Wriggers, W.; Sun, J.; He, J. The combined focal loss and dice loss function improves the segmentation of beta-sheets in medium-resolution cryo-electron-microscopy density maps. Bioinform. Adv. 2024, 4, vbae169. [Google Scholar] [CrossRef] [PubMed]

Figure 1. UAV platform and onboard hyperspectral sensing hardware. (a) DJI Matrice 600 with hyperspectral payload. (b) Standalone push-broom hyperspectral sensor system.

Figure 2. Per-flight class distribution (grouped bars). For each UAV survey, bars show the proportion of valid pixels classified as Non-plastic (green, left bar) and plastic (red, right bar). Flights are arranged chronologically from the earliest (Jan2020) to the most recent (Feb2024); the final pair (Overall) aggregates all 9 missions.

Figure 3. Model’s architecture. Attention–Residual U-Net used for pixel-wise plastic segmentation of SWIR hyperspectral cubes. The left inset shows a grayscale single-band (at 1000 nm) input slice from the hyperspectral cube. Center is an encoder–decoder with residual blocks, Attention gates, and skip connections. The right inset is the per-pixel model’s segmentation prediction from the whole 60-band cube (Plastic in red, Non-plastic in Green).

Figure 4. Qualitative segmentation outputs comparing LOO and all-flights strategies. (A) Caption, (B) ground-truth (plastic = red, background = green), (C) LOO prediction, (D) all-flights prediction. Blue boxes = false positives; yellow boxes = missing predictions. Flights appear in the order Jan20a, Jan20b, Mar20, Apr22, 7Feb24a, 7Feb24b, 21Feb24a, 21Feb24b and 28Dec22.

Figure 5. Summary of global segmentation performance. (a) Average confusion matrix aggregated over all test folds. (b) Class-wise accuracy and loss averaged over 9 folds.

Figure 6. Brightness metrics comparison across the 9 flights.

Table 2. Summary of flights across all UAV hyperspectral campaigns. Number of images and ground coverage per flight.

Flight ID	Date	Altitude (m)	Speed (m/s)	Weather	Number of Images	Area (m²)
Jan20a	3 January 2020	8.0	1.0	Sunny	254	340
Jan20b	3 January 2020	8.0	1.0	Sunny	263	360
Mar20	25 March 2020	10.0	1.0	Cloudy	159	455
Apr22	29 April 2022	10.0	1.0	Sunny	261	551
Dec22	28 December 2022	7.0	1.0	Sunny	369	858
7Feb24a	7 February 2024	10.0	1.5	Cloudy	929	1519
7Feb24b	7 February 2024	10.0	0.5	Sunny	499	1550
21Feb24a	24 February 2021	10.0	0.7	Sunny	1033	1080
21Feb24b	21 February 2024	10.0	1.5	Sunny	445	888

Table 3. Patch distribution per flight for training and validation sets.

Flight ID	Training	Validation	Total
Jan20a	585	145	730
Jan20b	852	212	1064
Mar20	422	103	525
Apr22	706	174	880
Dec22	2645	661	3306
7Feb24a	2488	617	3105
7Feb24b	2124	528	2652
21Feb24a	1642	410	2052
21Feb24b	1140	284	1424
Total	12,604	3134	15,738

Table 4. Model complexity and runtime performance on an NVIDIA RTX 408 GPU.

Metric	Value	Notes
Total parameters	7.2 million	Full precision (FP32)
GFLOPs (per tile)	3.9	For 128 × 128 forward pass
Training throughput	165 images/s	Batch size = 16, mixed precision off
Inference latency	8.1 ms	Per tile
Training duration	38 min/fold	Leave-one-out (9 folds)

Table 5. Training hyperparameters.

Parameter	Value
Optimizer	Adam (β₁ = 0.9, β₂ = 0.999)
Initial learning rate	1 × 10⁻⁴ (fixed)
Weight decay	1 × 10⁻⁴ (L₂ regularization)
Batch size	8 patches (128 × 128)
Max epochs	100 (early stopping with patience = 10)
Data augmentation	None
Random seed	42

Table 6. Final performance metrics per test flight. Learning performed on all the other flights.

Flight	Training Loss	Training Accuracy (%)	Validation Loss	Validation Accuracy (%)	Test Accuracy (%)	Dice (%)	IoU (%)	Precision (%)	Recall (%)	F1-Score (%)	Cohen κ
Jan20a	0.588	96.0	0.589	96.1	95.8	75.9	61.2	95.6	95.8	95.5	0.736
Jan20b	0.587	96.2	0.605	90.9	93.9	78.4	64.4	94.0	93.9	94.0	0.748
Mar20	0.587	96.3	0.595	94.8	63.4	45.2	29.2	87.6	63.4	68.1	0.278
Apr22	0.587	96.1	0.592	95.5	95.4	69.7	53.5	95.1	95.4	95.0	0.673
Dec22	0.585	96.1	0.586	96.7	93.3	44.0	28.2	95.2	93.3	94.1	0.407
7Feb24a	0.587	96.2	0.598	93.8	96.8	65.6	48.8	96.7	96.8	96.3	0.641
7Feb24b	0.586	96.5	0.601	92.0	86.6	59.9	42.8	86.9	86.6	86.8	0.519
21Feb24a	0.590	95.5	0.595	94.9	95.9	78.2	64.2	96.7	95.9	96.1	0.756
21Feb24b	0.587	96.4	0.591	96.1	93.6	94.7	89.8	94.2	93.6	93.6	0.866
Average	0.587	96.1	0.595	94.5	90.5	68.0	53.6	93.6	90.5	91.1	0.625

Table 7. All flights model performance metrics.

Metric	Value
Training Loss/Accuracy	0.588/96.10%
Validation Loss/Accuracy	0.587/96.8%
Dice	81.6%
IoU	68.9%
Cohen’s Kappa (κ)	0.798
Precision	92.3%
Recall	87.8%
F1-Score	89.9%

Table 8. Leave-one-out cross-validation across incremental architectural enhancement.

Model Variant	Dice	IoU	Accuracy (%)
Baseline U-Net	0.61 ± 0.22	0.60 ± 0.23	90.4
+ Residual blocks	0.68 ± 0.20	0.66 ± 0.21	92.1
+ Attention gates	0.72 ± 0.19	0.70 ± 0.20	92.9
Final architecture	0.76 ± 0.18	0.75 ± 0.19	93.3

Table 9. Cohen’s κ scores for the three approaches. LDA [8]: average from LOO evaluation on 7 hyperspectral cubes; SVM [8]: single classifier trained on the same 7 cubes with simple level adjustment (no LOO); U-Net (ours): from LOO testing across 9 flight cubes. Dashes indicate unavailable results.

Held-Out Flight/Cube	LDA	SVM	U-Net
Jan20a	0.26	0.85	0.74
Jan20b	–	–	0.75
Mar20	0.29	0.78	0.28
Apr22	0.66	0.86	0.67
Dec22	–	–	0.41
7Feb24a	0.39	0.81	0.64
7Feb24b	0.14	0.61	0.52
21Feb24a	0.47	0.92	0.76
21Feb24b	0.88	0.90	0.87
Mean (κ)	0.44	0.82	0.63

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bouchelaghem, S.; Balsi, M.; Moroni, M. Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor. Remote Sens. 2026, 18, 182. https://doi.org/10.3390/rs18010182

AMA Style

Bouchelaghem S, Balsi M, Moroni M. Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor. Remote Sensing. 2026; 18(1):182. https://doi.org/10.3390/rs18010182

Chicago/Turabian Style

Bouchelaghem, Soufyane, Marco Balsi, and Monica Moroni. 2026. "Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor" Remote Sensing 18, no. 1: 182. https://doi.org/10.3390/rs18010182

APA Style

Bouchelaghem, S., Balsi, M., & Moroni, M. (2026). Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor. Remote Sensing, 18(1), 182. https://doi.org/10.3390/rs18010182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor

Highlights

Abstract

1. Introduction

1.1. Traditional Machine Learning Approaches for Plastic Detection

1.2. Deep Learning Methods

1.2.1. RGB-Based Deep Learning Methods

1.2.2. Hyperspectral Deep Learning Methods

1.3. Semantic Segmentation Architectures for Dense Prediction

1.4. Research Gap and Contributions

2. Materials

3. Methodology

4. Results and Discussion

4.1. Experimental Setup

4.2. Qualitative Segmentation Results

4.3. Impact of Acquisition Conditions

4.4. Ablation Study and Comparison with Classical Machine Learning

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI