Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Applying Deep Learning Methods for a Large-Scale Riparian Vegetation Classification from High-Resolution Multimodal Aerial Remote Sensing Data

Remote Sens. 2025, 17(14), 2373; https://doi.org/10.3390/rs17142373

by Marcel Reinhardt^1,2,*

, Edvinas Rommel²

, Maike Heuner²

and Björn Baschek¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Yonghua Sun

Reviewer 4:

Hung-Pin Huang

Remote Sens. 2025, 17(14), 2373; https://doi.org/10.3390/rs17142373

Submission received: 22 May 2025 / Revised: 25 June 2025 / Accepted: 3 July 2025 / Published: 10 July 2025

(This article belongs to the Special Issue Artificial Intelligence and Machine Learning with Applications in Remote Sensing (Third Edition))

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper utilizes high-resolution multimodal aerial remote sensing data (RGB, near-infrared, elevation models) for large-scale riparian vegetation classification. Overall, I think this research to have good engineering value and direct application significance for practical ecological management. However, the experimental design and analysis lack persuasiveness in supporting the conclusions. The authors may consider revisions from the following aspects:

When comparing the early fusion (EF) and late fusion (LF) strategies, there is a variable confounding issue: The parameter quantity difference between EF1 (78.5M parameters) and LF (128.7M parameters) interferes with the performance comparison of fusion strategies. Although the supplemental EF2 (128.8M parameters) has a similar parameter quantity to LF, it fails to systematically analyze the inherent impact of parameter quantity itself on performance (e.g., whether performance saturates with increasing parameters) and the independent role of fusion strategies under fixed parameter quantity. This causes the conclusion "early fusion is superior" to potentially be disturbed by parameter quantity, lacking rigor.
The paper has a problem of missing ablation study in multimodal contribution analysis. Despite claiming that "elevation models and NIR channels enhance classification performance," it fails to quantify the independent contributions and synergistic effects of each modality (e.g., baseline performance of models trained solely on elevation data, performance degradation after removing NIR channels). It also does not verify inter-modal complementarity. This results in conclusions like "elevation data helps distinguish vertical vegetation structures" lacking data support, weakening the justification for the necessity of multimodal design.
For model selection justification, the paper only cites the general viewpoint "CNNs outperform Transformers on small datasets" as the reason for choosing U-Net variants, without further discussion specifically for the data in this study. For example: Does the "small data advantage" of CNNs apply to the nine-class riparian vegetation segmentation task in this research?
In the conclusion analysis of generalization experiments, the paper merely indicates a 60% mIoU for "Fine-tuning with corrected ground truth data" versus 58.15% mIoU for "Training from scratch with corrected ground truth data" in the Rhine results, stating only that pre-training brings a 2% improvement, but does not explain its practical significance. Additionally, it fails to exclude interference from "label correction" on the results that the pre-trained model may benefit more from label correction rather than its generalization capability itself.
The paper lacks specific explanation for loss function design. It mentions using Focal Loss (γ=2) but fails to provide its formula, parameter definitions, or selection rationale, affecting the reproducibility of results.

Author Response

Please see the attachment

Author Response File: Author Response.docx

Reviewer 2 Report

Comments and Suggestions for Authors

The paper addresses an important ecological monitoring task—semantic segmentation of riparian vegetation using high-resolution aerial imagery combined with NIR and elevation data. Riparian zones are critical for biodiversity, bank stability, and ecosystem services, and detailed large-scale mapping can support sustainable management. Leveraging deep learning (DL), particularly a modified U-Net architecture with attention and residual connections (AttResU-Net), and exploring multimodal fusion (RGB, NIR, DEM, nDSM) is timely given advances in remote sensing and DL methods.

I have only one suggestion, please the authors consider.

The Introduction and Related Work extensively discuss transformers as a state-of-the-art alternative to CNNs for semantic segmentation, highlighting their potential for long-range context. However, no transformer architecture (e.g., SegFormer, SETR) is implemented and compared against the proposed AttResU-Net. So, I suggest the authors to implement at least one relevant transformer-based segmentation model (e.g., a SegFormer variant suitable for remote sensing) and benchmark its performance on the Tidal Elbe dataset using the same input modalities (RGB-NIR + DEM/nDSM) and evaluation protocol.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

Comments and Suggestions for Authors

This article explores the use of deep learning methods for large-scale riparian vegetation classification from high-resolution multimodal aerial remote sensing data. Focusing on the ecologically and socio-economically significant issue of riverbank vegetation classification aligns with sustainable management needs. The methodological design of this paper is rigorous and innovative, with comprehensive experimental design and data analysis. However, there are significant formatting issues in the article. It is suggested that the authors modify the following aspects:
1. The titles of the figures and tables are too long. It is recommended to place the descriptive text below the figures and tables rather than in the titles of the figures and tables.
2. There are too many charts and graphs, suggesting the deletion of those that are not important.
3. The readability of the charts and graphs is poor, and it is recommended to modify them according to the requirements of the journal.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 4 Report

Comments and Suggestions for Authors

1. This study presents a DL-based approach for processing multimodal high-resolution remote sensing data to generate a classification map of the tidal Elbe and a section of the Rhine River.
2. The tide level of tidal river results from the relative situation of three planets of the sun, the moon, and the earth. Moreover, there are two high and low tide levels each day and two higher and lower tide levels each month leading to different area of water zone along the tidal river. The authors need to provide the tidal range of Elbe River firstly.
3. If the high tide level is larger than three meters, the area of shrubs zone would be seen as that of water zone during the period of high tide. On the contrary, the area of water zone would be seen as that of natural substrate during the period of low tide. How to classify nine labelled classes in this study during the various tide levels needs to be deeply discussed. Especially, the ground truth data of Elbe River is used as a training dataset.
4. Misjudgment on riparian vegetation classification might happen with training dataset of tidal river applying to non-tidal river. The detailed treatment with these two different tidal ranges of river is suggested to be explored in this study.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The author revisions have resolved the critical methodological concerns raised in the review. The addition of EF2 experiments conclusively isolates fusion strategy efficacy under matched parameters, while new multimodal ablation studies provide quantitative evidence for elevation and NIR tradeoffs. Though pretraining gains remain modest, their practical value for minority classes is now contextualized. I think the current version of the manuscript provides reliable experimental results and sufficient discussion to support its innovations. And it meets the Remote Sensing‘s standards for methodological rigor and is recommended for acceptance.

Reviewer 4 Report

Comments and Suggestions for Authors

This paper has revised. I have no more comment.

Article Menu

Applying Deep Learning Methods for a Large-Scale Riparian Vegetation Classification from High-Resolution Multimodal Aerial Remote Sensing Data

Further Information

Guidelines

MDPI Initiatives

Follow MDPI