Attention-Gated U-Net for Robust Cross-Domain Plastic Waste Segmentation Using a UAV-Based Hyperspectral SWIR Sensor
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsComment
Strengths
- The paper addresses an urgent global issue: plastic waste monitoring in natural ecosystems.
- The use of UAV-based SWIR hyperspectral imagery is innovative and provides high-resolution data.
- The integration of attention gates and residual connections into a U-Net backbone is a solid architectural choice that enhances contextual modeling.
- The multi-flight dataset spanning four years demonstrates robustness and temporal generalization.
Weaknesses
- Acronyms such as LDA and SVM are not defined in the abstract.
- The Materials section lacks details on the number of images captured and the ground coverage area of each image.
- The process of preparing binary mask images is unclear, including the software used.
- The method of dividing large images into non-overlapping patches (128×128 pixels) is not explained.
- The Results section does not specify the number of images used for training, validation, and testing.
- The paper only considers U-Net architecture, without comparison to other semantic segmentation models such as DeepLabV3 or SegNet.
- The use of hyperspectral imagery is not fully leveraged; spectral indices could be applied to separate plastic from non-plastic materials.
- The dataset was prepared by the authors themselves, raising concerns about realism and generalizability.
- The Conclusion section is too brief and does not sufficiently highlight contributions, limitations, or future directions.
Suggestions for Improvement
- In the abstract (line 33), please define the acronyms LDA and SVM when they are first mentioned.
- In the Materials section, specify the number of images captured during the nine UAV flights, and also indicate the ground coverage area of each image. I recommend including this information in Table 2.
- At line 212, you mention the use of a binary mask, but the procedure for preparing the mask image is not clearly explained. Please clarify how the mask was generated and specify the software used.
- At line 224, you state that the patch size for training is 128×128 pixels. However, it is not clear how the large images were divided into non-overlapping patches. Please provide more details on this process.
- Why did you choose to use only the U-Net architecture? Other semantic segmentation models such as DeepLabV3 and SegNet are available. A justification or comparison would strengthen the paper.
- The Results section is insufficient. You should include the number of images used for training, validation, and testing. Additionally, if different backbones were considered to improve semantic segmentation, please compare their performance.
- Since you are using hyperspectral imagery, why did you not apply different spectral indices to separate plastic from non-plastic materials?
Comments for author File:
Comments.pdf
Author Response
We thank Reviewer 1 for their constructive comments and suggestions. We have carefully addressed each point as detailed below. Changes in the revised manuscript are highlighted in red.
Comment 1:
"In the abstract (line 33), please define the acronyms LDA and SVM when they are first mentioned."
Response:
We agree with the reviewer. The acronyms have been defined in the abstract when first mentioned.
Changes in manuscript:
Abstract now reads: "...classical supervised machine learning techniques such as Linear Discriminant Analysis (LDA) and Support Vector Machines (SVM) applied to hyperspectral..."
Comment 2:
"In the Materials section, specify the number of images captured during the nine UAV flights, and also indicate the ground coverage area of each image. I recommend including this information in Table 2."
Response:
We thank the reviewer for this suggestion. We have added two new columns to Table 2 specifying the number of images captured and the ground coverage area (in m²) for each flight. In fact, these figures refer to the sections of the data actually used, because in each flight normally many more images were taken, and often several passes over the same area were performed and then separated into sections.
Changes in manuscript:
Table 2 now includes:
|
Flight ID |
Date |
Altitude (m) |
Speed (m/s) |
Weather |
Number of images |
Area m2 |
|
Jan20a |
03/01/2020 |
8.0 |
1.0 |
Sunny |
254 |
340 |
|
Jan20b |
03/01/2020 |
8.0 |
1.0 |
Sunny |
263 |
360 |
|
Mar20 |
25/03/2020 |
10.0 |
1.0 |
Cloudy |
159 |
455 |
|
Apr22 |
29/04/2022 |
10.0 |
1.0 |
Sunny |
261 |
551 |
|
Dec22 |
28/12/2022 |
7.0 |
1.0 |
Sunny |
369 |
858 |
|
7Feb24a |
07/02/2024 |
10.0 |
1.5 |
Cloudy |
929 |
1519 |
|
7Feb24b |
07/02/2024 |
10.0 |
0.5 |
Sunny |
499 |
1550 |
|
21Feb24a |
24/02/2021 |
10.0 |
0.7 |
Sunny |
1033 |
1080 |
|
21Feb24b |
21/02/2024 |
10.0 |
1.5 |
Sunny |
445 |
888 |
Comment 3:
"At line 212, you mention the use of a binary mask, but the procedure for preparing the mask image is not clearly explained. Please clarify how the mask was generated and specify the software used."
Response:
We have expanded the description of the annotation procedure to clarify the mask generation process, including the software used (GIMP) and the step-by-step methodology.
Changes in manuscript:
Section 2 now includes: Binary annotation masks were manually created using GIMP (GNU Image Manipulation Program, version 2.10). The annotation procedure involved the following steps: first, a single spectral band from each hyperspectral cube (typically at 1000 nm for optimal contrast) was exported as a grayscale reference image. This reference was visually compared with the corresponding RGB mosaic to accurately identify plastic objects boundaries, and the regions corresponding to each material type were drawn over the hyperspectral layer image. The plastic fragments were labeled as 1 and represented in red, the background as 0 and represented in green, while the remaining pixels not containing information (unassigned pixels) are represented in black and excluded from the training. The annotation scheme distinguished the following semantic classes grouped into binary categories: “…
Comment 4:
"At line 224, you state that the patch size for training is 128×128 pixels. However, it is not clear how the large images were divided into non-overlapping patches. Please provide more details on this process."
Response:
We have added an explanation of the patching procedure, including the sliding window approach, boundary handling, and exclusion criteria.
Changes in manuscript:
Section 2 now includes: " The patching procedure was implemented using a sliding window approach with a stride equal to the patch size (128 pixels), ensuring no overlap between adjacent patches. Starting from the top-left corner of each hyperspectral cube, the algorithm sequentially extracted patches by moving horizontally and then vertically across the image. Patches located at the image boundaries that did not contain the full 128×128 dimensions were discarded to maintain uniform input size. Additionally, patches containing more than 50% unassigned pixels (black regions from the mosaicking process) were excluded from the dataset to ensure sufficient valid spectral information for training.”
Comment 5:
"The Results section does not specify the number of images used for training, validation, and testing."
Response:
We have added a new table (Table 3) showing the patch distribution per flight and included detailed statistics about training, validation, and test set sizes for the LOO cross-validation protocol.
Changes in manuscript:
New Table 3 added to the material section 2 showing patch distribution:
Table 3 summarizes the patch distribution across all nine UAV flights. Giving us a total of 15,738 patches extracted from the hyperspectral cubes. For each LOO fold, one flight was held out as the unseen test set, while patches from the remaining eight flights were divided into training (80%) and validation (20%) subsets using stratified sampling based on plastic pixel counts.
|
Flight ID |
Training |
Validation |
Total |
|
Jan20a |
585 |
145 |
730 |
|
Jan20b |
852 |
212 |
1,064 |
|
Mar20 |
422 |
103 |
525 |
|
Apr22 |
706 |
174 |
880 |
|
Dec22 |
2,645 |
661 |
3,306 |
|
7Feb24a |
2,488 |
617 |
3,105 |
|
7Feb24b |
2,124 |
528 |
2,652 |
|
21Feb24a |
1,642 |
410 |
2,052 |
|
21Feb24b |
1,140 |
284 |
1,424 |
|
Total |
12,604 |
3,134 |
15,738 |
For each of the nine LOO folds, the test set size ranged from 525 patches (Mar20) to 3,306 patches (Dec22), with an average of 1,749 test patches per fold. The corresponding training sets contained 10,019 to 12,019 patches (average: 11,226), while validation sets ranged from 2,473 to 2,989 patches (average: 2,788). This distribution ensured that each fold evaluated the model on a complete, unseen flight mission while maintaining sufficient training data diversity.
results Section 4.1 now states: " Under the leave-one-out (LOO) cross-validation protocol, each of the nine flights was held out in turn as the test set, while the remaining eight flights provided training (80%) and validation (20%) data. This resulted in training sets of, on average, 11,226 patches and validation sets of 2,788 patches per fold, with test sets ranging from 525 to 3,306 patches depending on the held-out flight. This rigorous scheme ensured that the model was always evaluated on entirely unseen acquisition conditions, providing a realistic assessment of generalization performance."
Comment 6:
"Why did you choose to use only the U-Net architecture? Other semantic segmentation models such as DeepLabV3 and SegNet are available. A justification or comparison would strengthen the paper."
Response:
We have added a new subsection (1.3) in the Introduction and also results and discussion section that provides detailed justification for choosing U-Net over alternatives like DeepLabV3 and SegNet. Additionally, the ablation study (Table 8) demonstrates the value of our architectural enhancements.
Changes in manuscript:
New Section 1.3 "Semantic Segmentation Architectures for Dense Prediction" explains:
1.3 semantic segmentation Architectures for Dense Prediction
For pixel-wise semantic segmentation tasks, several deep learning architectures have been developed. Fully Convolutional Networks (FCNs) pioneered end-to-end dense prediction by replacing fully connected layers with convolutional layers. SegNet introduced an encoder-decoder architecture with pooling indices for efficient upsampling, while DeepLabV3 employs atrous spatial pyramid pooling (ASPP) to capture multi-scale contextual information without losing resolution.
In this work, we adopt the U-Net architecture [27] as our baseline for several reasons. First, U-Net's symmetric encoder-decoder structure with skip connections is particularly well-suited for preserving fine-grained spatial details essential for accurate plastic boundary delineation. Second, unlike DeepLabV3 which was designed primarily for natural RGB images with pre-trained ImageNet weights, U-Net can be trained from scratch on hyperspectral data without requiring transfer learning from incompatible spectral domains. Third, SegNet's reliance on pooling indices, while memory-efficient, may lose subtle spectral information critical for distinguishing spectrally similar materials. Fourth, U-Net has demonstrated strong performance in medical imaging and remote sensing applications where precise boundary detection is crucial, making it a natural choice for plastic waste segmentation. We further enhance the U-Net architecture with attention gates [28] and residual connections [29] to improve gradient flow and enable the model to focus on discriminative spectral-spatial features.
4.5 Ablation Study and comparison with Classical Machine Learning
The choice of U-Net as the base architecture, rather than alternatives such as DeepLabV3 or SegNet, was motivated by several considerations specific to hyperspectral plastic segmentation. First, U-Net's symmetric encoder-decoder structure with skip connections excels at preserving fine-grained spatial details essential for accurate boundary delineation. Second, architectures like DeepLabV3 rely on ImageNet pre-trained weights optimized for 3-channel RGB images, which cannot be directly transferred to 60-channel hyperspectral data; U-Net's ability to train effectively from scratch makes it better suited for this domain. Third, the attention mechanism provides interpretable feature selection, highlighting which spatial regions and spectral bands contribute most to classification decisions. The substantial performance gains achieved through residual connections and attention gates (+25% Dice) validate these design choices.
Comment 7:
"Since you are using hyperspectral imagery, why did you not apply different spectral indices to separate plastic from non-plastic materials?"
Response:
We appreciate this observation. Our approach intentionally uses raw spectral bands rather than derived indices for two reasons: (1) the deep learning model learns optimal spectral feature combinations directly from data, which may capture more complex relationships than predefined indices; (2) using raw data without preprocessing tests the model's ability to operate in rapid field deployment scenarios. We have clarified this design choice in Section 4.1.
Changes in manuscript:
Section 4.1 now states: "The hyperspectral cubes were used in their raw digital number (DN) format without radiometric calibration, atmospheric correction, or intensity normalization. This decision was motivated by two factors: (1) preserving the original spectral relationships captured by the sensor, and (2) testing the model's ability to learn robust features directly from uncalibrated data, which is more representative of rapid field deployment scenarios where calibration may not be feasible."
Concerning spectral indices, that one of us (Moroni) actually used in the past (cfr. new citation [10]), we added a comment in the introduction to say that they show some limitations in discrimination ability that more sophisticated algorithms outperform. We have therefore added a comment in the introduction as follows: “Traditional machine learning approaches typically rely on handcrafted spectral features or indices [10-11] and pixel-wise classification without capturing spatial context.. Spectral indices highlight absorption patterns of different materials in a relative way, being, e.g., ratios of reflectances in specific characteristic wavelengths, or spectral angles. Such approaches show some limitations in complex settings, so that they are outperformed by more sophisticated methodologies.”
Comment 8:
"The dataset was prepared by the authors themselves, raising concerns about realism and generalizability."
Response:
We understand this concern. However, we emphasize that: (1) the dataset spans 9 flights over 4 years with varied weather conditions, backgrounds, and plastic types, representing substantial environmental diversity; (2) the leave-one-out cross-validation ensures strict separation between training and testing across both space and time; (3) the dataset is publicly available (Mendeley DOI: 10.17632/nmpjzrky3r.1) for independent validation. The Mar20 case study (Section 4.4) demonstrates that even within our own dataset, significant acquisition variations challenge the model, confirming realistic evaluation conditions.
In any case, to our knowledge no public dataset was available, suitable for the kind of analysis that we performed in this work.
Changes in manuscript:
We did not apply changes. The paper means to address dataset diversity through LOO evaluation and public availability. We are in the process of extending our dataset to even more diverse cases and make it available in the future, as well as ready to use public datasets by other authors, but we didn’t find any yet that are suitable for such task. The Mar20 analysis (Section 4.4) demonstrates challenging real-world variations within the dataset.
Comment 9:
"The Conclusion section is too brief and does not sufficiently highlight contributions, limitations, or future directions."
Response:
We have expanded the Conclusion section to more comprehensively summarize contributions, explicitly state limitations, and outline future research directions.
Changes in manuscript:
The Conclusion now includes:
Clear enumeration of 5 key contributions, limitation and Future priorities.
Author Response File:
Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis paper proposes a U-Net architecture based on attention gates for pixel-wise plastic waste segmentation in shortwave infrared (SWIR) hyperspectral imagery acquired from an unmanned aerial vehicle (UAV). Although the performance is not satisfactory under certain specific conditions (such as overexposure scenes), the overall performance is superior to traditional linear discriminant analysis (LDA) methods. However, there are still some technical issues and limitations in the experimental design that need to be further improved.
1. The basis for the selection of flight parameters and how these parameters affect the spatial resolution and spectral quality of the hyperspectral data should be supplemented. In addition, it is suggested to add a comparative analysis of data quality under different flight parameters in the experimental design section to prove the rationality of the selected parameters.
2. It is possible to explore how to improve the robustness of the model through data augmentation or preprocessing. For example, exposure correction algorithms can be used to preprocess the data to reduce the impact of lighting conditions on model performance.
3. Focal Loss is mainly used to solve the problem of class imbalance by reducing the weight of easily classified samples to focus on hard-to-classify samples. Dice Loss, on the other hand, is primarily used to optimize the overlap of segmentation. Will the simple combination of the two lead to conflicts in the model's optimization process?
4. The model has a high complexity (7.2 million parameters) and a long training time. It is recommended to consider model compression or lightweighting techniques, such as knowledge distillation or pruning techniques to optimize the model structure.
5. The full terms should be provided when English abbreviations appear for the first time.
6. Words like "where" at line 298 should be indented; attention should be paid to such issues. In addition, the capitalization format of the reference titles is not consistent.
Author Response
We thank Reviewer 2 for their constructive comments and suggestions. We have carefully addressed each point as detailed below. Changes in the revised manuscript are highlighted in red.
Comment 1:
"The basis for the selection of flight parameters and how these parameters affect the spatial resolution and spectral quality of the hyperspectral data should be supplemented. In addition, it is suggested to add a comparative analysis of data quality under different flight parameters in the experimental design section to prove the rationality of the selected parameters."
Response:
We have added an explanation of the flight parameter selection rationale in Section 2. The flight parameters (altitude 7-10m, speed 0.5-1.5 m/s) were chosen to balance spatial resolution (GSD 3-4 cm), data quality, and operational constraints. The brightness analysis in Section 4.4 (Figure 5) now serves as a comparative quality analysis across different acquisition conditions.
Changes in manuscript:
Section 2 now explains: " These flight parameters were selected to optimize the trade-off between spatial resolution, spectral quality, and operational efficiency. Lower altitudes (7–8 m) provide finer ground sampling distance (GSD) but reduce coverage area per flight and increase sensitivity to terrain variations. Higher altitudes (10 m) offer greater coverage and more stable flight characteristics but at the cost of coarser spatial resolution. The across-track GSD, determined by altitude, sensor optics, and pixel count, ranged from 3-4 cm. The along-track GSD, determined by flight speed and line-acquisition rate (12.5-16 Hz), was typically larger (4-12 cm)"
Section 4.4 analyzes brightness variations across flights, demonstrating how different conditions affect data quality.
Comment 2:
"It is possible to explore how to improve the robustness of the model through data augmentation or preprocessing. For example, exposure correction algorithms can be used to preprocess the data to reduce the impact of lighting conditions on model performance."
Response:
Thanks for the suggestion. We deliberately used raw data without preprocessing to test the baseline model capability. The Mar20 case study (Section 4.4) demonstrates that including diverse acquisition conditions in training data is an effective alternative to preprocessing. In our previous paper [8], we examined various normalization or calibration procedures to adjust the brightness levels appropriately, and concluded that a very simple and relatively “rough” adjustment was more than sufficient, and very easy to apply (also considering the practical issue of avoiding performing a proper calibration procedure in the field). We have added this as a future research direction.
Regarding data augmentation, we did not apply geometric or spectral augmentations in this study to establish a baseline performance, and because the multi-flight dataset already provides substantial natural variability across 9 flights spanning 4 years. However, we indicated the intention of doing so in future developments.
Changes in manuscript:
In section 2 we added a short note: “In a previous paper [8], we proved that calibration does not enhance the discrimination performance of the classifiers, while a very simple level adjustment does help generalization.” Section 4.4 now discusses: "Since no radiometric normalization was applied to the raw data, the model trained on properly exposed flights could not generalize to Mar20's shifted intensity distribution."
The all-flights model results (Table 7) demonstrate that including diverse conditions in training improves robustness without explicit preprocessing.
In the conclusions we added: " Future research activity will address limitations of the present study, including: […]
- lack of level and dynamic range adjustment of the data: simple equalization methods such as we previously exploited will be tested in this framework."
“use of data augmentation to enhance the robustness of the solution”
Comment 3:
"Focal Loss is mainly used to solve the problem of class imbalance by reducing the weight of easily classified samples to focus on hard-to-classify samples. Dice Loss, on the other hand, is primarily used to optimize the overlap of segmentation. Will the simple combination of the two lead to conflicts in the model's optimization process?"
Response:
We appreciate this insightful question. The combination of Focal and Dice loss is well-established in segmentation literature [32]. Focal loss addresses class imbalance at the pixel level by down-weighting easy examples, while Dice loss optimizes global overlap. These objectives are complementary rather than conflicting: Focal loss ensures learning from hard examples (typically at boundaries), while Dice loss ensures overall segmentation quality. The equal weighting (α=0.5) balances both objectives. Our results (low training-validation gap, stable convergence in Figure 6b) confirm no optimization conflicts.
Changes in manuscript:
Section 3 already states: "We trained the network using a combined focal and Dice loss [30-31] to address both class imbalance (plastic occupies <15% of image area) and optimize segmentation overlap [32]."
Figure 5b shows stable, rapid convergence without oscillation, confirming no optimization conflicts. Reference [32] (Mu et al., 2024) specifically validates this combination for segmentation tasks.
Comment 4:
"The model has a high complexity (7.2 million parameters) and a long training time. It is recommended to consider model compression or lightweighting techniques, such as knowledge distillation or pruning techniques to optimize the model structure."
Response:
We agree that model compression is valuable for edge deployment. However, we note that: (1) inference latency is already 8.1 ms/tile, enabling real-time processing at ~123 tiles/second; (2) training time of 38 minutes/fold is reasonable for research applications. We have added model compression as an explicit future research direction.
Changes in manuscript:
Table 4 shows efficient inference: 8.1 ms/tile, enabling processing of 100 m² survey in ~5 seconds.
Future work section now explicitly includes: "model compression for edge deployment, including knowledge distillation and pruning techniques to reduce the 7.2M parameter count while maintaining accuracy."
Comment 5:
"The full terms should be provided when English abbreviations appear for the first time."
Response:
We have reviewed the manuscript and ensured all abbreviations are defined at first use.
Changes in manuscript:
Abbreviations now defined at first use:
- HSI → Hyperspectral Imaging (HSI)
- SWIR → Short-Wave Infrared (SWIR)
- LDA → Linear Discriminant Analysis (LDA)
- SVM → Support Vector Machines (SVM)
- CNN → Convolutional Neural Networks (CNNs)
- UAV → Unmanned Aerial Vehicle (UAV)
- GSD → Ground Sampling Distance (GSD)
- LOO → Leave-One-Out (LOO)
- DN → Digital Number (DN)
Comment 6:
"Words like "where" at line 298 should be indented; attention should be paid to such issues. In addition, the capitalization format of the reference titles is not consistent."
Response:
We thank the reviewer for noting these formatting issues. We have corrected the indentation of equation explanations and standardized the reference title capitalization throughout the manuscript.
Changes in manuscript:
- Equation notation (e.g., "where") now properly indented following MDPI formatting guidelines.
- Reference titles standardized to sentence case as per MDPI style
Author Response File:
Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis research focuses on plastic waste segmentation using UAV SWIR hyperspectral imagery, a topic closely aligned with environmental monitoring needs. Its core conclusion that "model generalization is constrained by data diversity rather than architectural design" has practical value, and the publicly released dataset also provides a benchmark for the field.
1、 The structure of the introduction section is rather disorganized. It is recommended that the plastic remote sensing recognition algorithms be elaborated separately by traditional methods and deep learning methods, so as to highlight the innovation of using SWIR hyperspectral technology for plastic identification.
2、 Given that the experiment involves samples of different plastic types (e.g., PET/PE/PVC), an in-depth analysis on the ability to distinguish among these various plastic types is advised. Simply differentiating between plastics and non-plastics would fail to exploit the full potential of SWIR hyperspectral technology. Meanwhile, the definition of "cross-domain" in the experiment is only limited to differences in time and weather conditions, which lacks sufficient transferability.
3、 The innovation lies in integrating the attention mechanism and residual connections into the U-Net architecture, as well as verifying its generalization performance using flight data obtained from 9 missions over 4 years. However, there is insufficient performance comparison with existing similar deep learning architectures—the current comparison only focuses on traditional models such as LDA and SVM. It is recommended to add quantitative comparisons with deep learning models of the same category, so as to more comprehensively highlight the advantages of the proposed architecture.
Author Response
We thank Reviewer 3 for their constructive comments and suggestions. We have carefully addressed each point as detailed below. Changes in the revised manuscript are highlighted using the MS Word reviewing tool.
Comment 1:
"The structure of the introduction section is rather disorganized. It is recommended that the plastic remote sensing recognition algorithms be elaborated separately by traditional methods and deep learning methods, so as to highlight the innovation of using SWIR hyperspectral technology for plastic identification."
Response:
We have restructured the Introduction section with clear subsections, separating traditional methods from deep learning methods and added a dedicated subsection highlighting SWIR hyperspectral innovation.
Changes in manuscript:
Introduction now organized as:
1.1. Traditional Machine Learning Approaches for Plastic Detection
- LDA, SVM methods and their limitations
1.2. Deep Learning Approaches for Waste and Plastic Detection
1.2.1. RGB-Based Deep Learning Methods
1.2.2. Hyperspectral Deep Learning Methods
1.3. Semantic Segmentation Architectures for Dense Prediction
- U-Net justification vs DeepLabV3/SegNet
1.4. Research Gap and Contributions
- SWIR hyperspectral innovation highlighted
- 5 explicit contributions listed
Comment 2:
"Given that the experiment involves samples of different plastic types (e.g., PET/PE/PVC), an in-depth analysis on the ability to distinguish among these various plastic types is advised. Simply differentiating between plastics and non-plastics would fail to exploit the full potential of SWIR hyperspectral technology. Meanwhile, the definition of "cross-domain" in the experiment is only limited to differences in time and weather conditions, which lacks sufficient transferability."
Response:
We acknowledge this valuable point. The current study focuses on binary plastic/non-plastic segmentation as a foundational step, which is the primary requirement for environmental monitoring applications (detecting the presence of plastic litter). Multi-class polymer discrimination (PET vs PE vs PVC) is indeed possible with SWIR data and represents an important extension (which we have addressed in our previous papers [6-8]). We have indicated that in this work we concentrate on the detection of plastic litter in the environment, and in this case the distinction of polymers is not necessary. Regarding cross-domain definition, we have clarified that our evaluation encompasses temporal (4 years), meteorological (sunny/cloudy), and acquisition parameter (altitude, speed, exposure) variations.
Changes in manuscript:
in section 1.1 we added the sentence: “While individual polymer discrimination is necessary in recycling plants, in this work we concentrate on the detection of plastic waste in the environment, regardless of polymer type.”
Section 4.1 clarifies cross-domain scope: "variations in time (2020-2024), weather (sunny/cloudy), background (grass/bare soil), altitude (7-10m), speed (0.5-1.5 m/s), and exposure conditions."
Comment 3:
"The innovation lies in integrating the attention mechanism and residual connections into the U-Net architecture, as well as verifying its generalization performance using flight data obtained from 9 missions over 4 years. However, there is insufficient performance comparison with existing similar deep learning architectures—the current comparison only focuses on traditional models such as LDA and SVM. It is recommended to add quantitative comparisons with deep learning models of the same category, so as to more comprehensively highlight the advantages of the proposed architecture."
Response:
We have addressed this through: (1) a new Introduction subsection (1.3) providing detailed justification for U-Net selection over DeepLabV3 and SegNet; (2) an expanded ablation study (Table 8) demonstrating incremental contributions of each architectural component; (3) discussion of why direct comparison with RGB-pretrained architectures is not appropriate for 60-channel hyperspectral data.
Changes in manuscript:
New introduction Section 1.3 explains why U-Net is preferred for hyperspectral segmentation:
Table 8 (Ablation Study) provides architectural comparison:
The choice of U-Net as the base architecture, rather than alternatives such as DeepLabV3 or SegNet, was motivated by several considerations specific to hyperspectral plastic segmentation. First, U-Net's symmetric encoder-decoder structure with skip connections excels at preserving fine-grained spatial details essential for accurate boundary delineation. Second, architectures like DeepLabV3 rely on ImageNet pre-trained weights optimized for 3-channel RGB images, which cannot be directly transferred to 60-channel hyperspectral data; U-Net's ability to train effectively from scratch makes it better suited for this domain. Third, the attention mechanism provides interpretable feature selection, highlighting which spatial regions and spectral bands contribute most to classification decisions. The substantial performance gains achieved through residual connections and attention gates (+25% Dice) validate these design choices.
Section 4.5 discusses: "architectures like DeepLabV3 rely on ImageNet pre-trained weights optimized for 3-channel RGB images, which cannot be directly transferred to 60-channel hyperspectral data; U-Net's ability to train effectively from scratch makes it better suited for this domain."
Author Response File:
Author Response.pdf
Round 2
Reviewer 2 Report
Comments and Suggestions for AuthorsAnalysing the paper contents and the revisions done, I realized it is acceptable for publication.
Author Response
Thanks for your comment. High-resolution figures have been provided and figure 1b was completely re-done. The English language has been improved by correcting many instances of punctuation and improper articles usage, and by slightly rephrasing many sentences to employ a more straightforward style.
Reviewer 3 Report
Comments and Suggestions for AuthorsI appreciate the authors' responses and revisions to the raised comments. The technical methodologies adopted in this manuscript are generally sound and rational. However, the research objective of this study is overly conservative and lacks sufficient academic appeal. Sophisticated sensors and advanced algorithms have been employed herein, yet they are only used for the simple discrimination between plastic and non-plastic materials. A core advantage of hyperspectral remote sensing lies in its capability for compositional identification and characterization of target ground objects. Therefore, it is highly recommended that the authors at a minimum extend the research scope to include the identification of plastic types or their chemical compositions.
Author Response
Thanks for your comments.
Concerning your main objection, we want to emphasize again that we dealt with single-polymer discrimination in the past, publishing results in several papers, cited in the text. In this paper, we wanted to focus on plastic litter detection in the environment. In such an application, discriminating polymers among themselves is not required, while the focus is on optimizing the discrimination between plastics and non plastics. We slighty changed a sentence in the introduction to further clarify this point: "In our previous works [6-9], we confirmed the feasibility of discriminating different plastic polymers from other materials and among themselves using a customized SWIR imaging system deployed in both controlled laboratory and natural outdoor settings. [...] While individual polymer discrimination is necessary in recycling plants, in this work we focus on detecting plastic waste in the environment, regardless of polymer type."
High-resolution figures have been provided and figure 1b was completely re-done. The English language has been improved by correcting many instances of punctuation and improper articles usage, and by slightly rephrasing many sentences to employ a more straightforward style.
