Review Reports - Parameter Optimization for Robust Urban Tree Crown Delineation: Enhancing Accuracy in Raster-Based Segmentation

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Parameter Optimization for Robust Urban Tree Crown Delineation:Enhancing accuracy in raster-based segmentation

In this study, the authors investigate the impact of parameter optimization using different raster-based segmentation methods on tree-crown delineation for a UAV orthomosaic. Although there is abundant literature on this topic, this paper presents a comprehensive comparison of methods and the implications of tuning parameters. I do not have major concerns regarding the analysis, but some minor information is required at certain points. Finally, I suggest improving the introduction by reducing the number of references, which seems excessive. Minor comments follow.

Abstract

L15:16. I guess the full name of the university-campus isn’t really necessary

L17. Please specify what type of the images were used, how many were used, and other relevant details. I recommend providing more information about the characteristics of the dataset (images, area?).

L22. What does it mean for a reader not familiar with the terms?

Introduction

L41. I suggest using up to three references

L44. As well as those related to climate change adaptation, such as carbon sequestration, oxygen production, among others

L46. Remove “regulating”

L57. But it also comes with a high monetary cost.

L63. Add 3-D information

L64. Replace the word “reconstruction,” as the method does not reconstruct anything

L65. Use “forest traits” instead of “characteristics.”

L86. Individual trees at a tree-level? Confusing

L128. The readers may not be familiar with that type of vegetation. Please add a brief description of it.

L131. UAV? Multispectral?

L134. What is the main hypothesis of the research? Although it is always recommended to conduct a thorough literature review, I think that having 74 references up to the introduction is excessive. Please include only those that are strictly relevant and necessary

Materials and Methods

Table 1, replace the “,” by “.”

L163. How many images?

L169. Ok

L193. Is there any reference to support this threshold? It can vary depending on the region and type of vegetation. I´ve seen forests with NDVI ranging from 0.4 to 0.5

L210. Which software was used?

Results

Figure 5. m²

L426:L427. Replace for English

L432. Could you specify the type of computer used? Processing times are likely to vary depending on the computational performance of the system.

Discussion

L545. Silva et al [?]

L669. Hastings et al. [?]

Author Response

Comment 1: Lines 15-16: I guess the full name of the university-campus isn’t really necessary.

Response 1: We have switched to the conventionally accepted short name of the university.

Comment 2: Line 17: Please specify what type of the images were used, how many were used, and other relevant details. I recommend providing more information about the characteristics of the dataset (images, area?).

Response 2: We believe that this information has been sufficiently fully described in the "Materials and Methods" section. Section 2.1 presents the area of the studied territory — 13.07 hectares. Section 2.2.2 describes in detail the characteristics of the sensors used and the photography itself, the number of images and the proportion of their overlap, etc.

Comment 3: Line 22: What does it mean for a reader not familiar with the terms?

Response 3: We have added clarification to improve abstract readability.

Comment 4: Line 41: I suggest using up to three references.

Response 4: We have reconsidered the amount of the references in the introduction.

Comment 5: L44. As well as those related to climate change adaptation, such as carbon sequestration, oxygen production, among others.

Response 5: We have added direct reference to climate change adaptation (Line 45).

Comment 6: L46. Remove “regulating”.

Response 6: Done.

Comment 7: L57. But it also comes with a high monetary cost.

Response 7: We have added comment on monetary cost (Line 59).

Comment 8: L63. Add 3-D information.

Response 8: We have extended description of LiDAR data (Lines 65-66).

Comment 9: L64. Replace the word “reconstruction,” as the method does not reconstruct anything.

Response 9: The “reconstruction” has been replaced with “modelling” (Line 67).

Comment 10: L65. Use “forest traits” instead of “characteristics.”

Response 10: Done.

Comment 11: L86. Individual trees at a tree-level? Confusing.

Response 11: We have removed direct reference to study scale to avoid confusion (Lines 86-87).

Comment 12: L128. The readers may not be familiar with that type of vegetation. Please add a brief description of it.

Response 12: We have added a brief description of tree composition (Lines 132-133). The list of species, as well as some key traits are presented in Table 1.

Comment 13: L131. UAV? Multispectral?

Response 13: We have added clarification that we work with UAV-derived images (Line 135).

Comment 14: L134. What is the main hypothesis of the research?

Response 14: We have added our main hypothesis (Lines 138-140).

Comment 15: Although it is always recommended to conduct a thorough literature review, I think that having 74 references up to the introduction is excessive. Please include only those that are strictly relevant and necessary

Response 15: We have reconsidered the amount of the references in the introduction. 6 references have been removed.

Comment 16: Table 1, replace the “,” by “.”

Response 16: Done.

Comment 17: L163. How many images?

Response 17: An exact number of images is provided in Line 177.

Comment 18: L193. Is there any reference to support this threshold? It can vary depending on the region and type of vegetation. I´ve seen forests with NDVI ranging from 0.4 to 0.5

Response 18: We totally agree that the use of a threshold NDVI of 0.5 is a simplification. We have chosen this threshold value as a highly averaged indicator of growing vegetation to be sure in our optimization method. Of course, this value cannot cover all possible situations but can be useful to provide the main pattern. In the present project, our focus was in an "End-to-End" optimization process for all variables; we could address the problem of NDVI threshold in our further investigations.

Comment 19: L210. Which software was used?

Response 19: Quantum GIS software has been utilized for manual crown delineation. We added clarification (Line 212).

Comment 19: Figure 5. m²

Response 20: Done.

Comment 20: L426:L427. Replace for English

Response 20: Done.

Comment 21: L432. Could you specify the type of computer used? Processing times are likely to vary depending on the computational performance of the system.

Response 21: We have added specifications of the computer (Lines 354-355).

Comment 22: L545. Silva et al [?]

Response 22: Done.

Comment 23: L669. Hastings et al. [?]

Response 23: Done.

Reviewer 2 Report

Comments and Suggestions for Authors

Parameter Optimization for Robust Urban Tree Crown Delineation: Enhancing Accuracy in Raster-Based Segmentation

This manuscript focuses on the strategic selection and optimization of parameters within vegetation segmentation algorithms, namely, Marker-Controlled Watershed Segmentation (MCWS), watershed, Marker-based Regional Growth (by Dalponte), and Improved K-Nearest Neighbor (by Silva). The benchmarking was based on UAV-derived canopy height model. The optimization methods employed therein were random search and differential evolution. The primary objective is to identify the delineation algorithm that achieved the most optimal tree crown delineation performance. Their analyses revealed that MCWS demonstrated the most favorable balance between computational efficiency and optimality. As anticipated, more complex algorithms exhibited superior delineation accuracy at the expense of increased computational burden. Although the novelty of this study was limited, its contribution lies in extensive and rigorous analysis of strategic parameter tuning and its practical impact. Please consider and address the following comments to enhance its academic value.

Since the output of the optimization is data driven, they are very much dependent on the underlying CHM images. While the stated research question (RQ) aims to identify optimal parameters, the generalizability of these parameters remains unclear. For instance, it is not evident whether the parameters derived from the Lobachevsky State University of Nizhny Novgorod campus can be as reliably applied to other comparable environments. Empirical validation across diverse geographical locations would strengthen the study’s claims and applicability. On a related note, the practical utility of the identified parameters following optimization-detection remains unclear. Please highlight how the parameters contribute to subsequent analysis (or decision-making) once segmentation is complete.
Although NDVI thresholding at 0.5 is a widely used approach, more robust vegetation segmentation methods (e.g., those based on ML) may offer improved noise reduction (as clearly noted in the final panel of Fig. 3). It would be beneficial to discuss the relative pros and cons of these alternatives in comparison to the chosen thresholding technique.
While the default values and parameter ranges are listed in Tables 3, 4, and 5, it remains unclear how specific intervals, such as 0–15, 0–0.3, or 0.0001–5, were determined. Please provide justification for the chosen default, minimum, and maximum values (whether based on forestry rationale, mathematical derivation, preliminary trial, or any supporting evidence).
While IoU is an acceptable metric, it would be beneficial to include additional measures, i.e., the Rand Index (RI) and Dice coefficient to provide a more comprehensive assessment of segmentation accuracy. Furthermore, the current definitions of TP, TN, and FN, based on whether a crown is correctly detected (using an IoU threshold of 0.5) are more appropriate for classification tasks such as tree counting (In fact, 0.5 is almost mis-segmentation). In the current segmentation contexts, however, these metrics are typically defined at the pixel level, reflecting whether each pixel is correctly labeled as part of a crown. This discrepancy contributes to notably low performance metrics (below 0.4) as reported in Tables 7 and 8 of the results section. Please consider revising the evaluation protocol accordingly or provide a clear justification for applying such classification accuracy metrics within a segmentation framework.
Appendix A demonstrates that stochastic-evolution (global) optimization outperforms other non-linear search methods (such as BFGS). While this justifies its selection from a performance standpoint, it also suggests that the underlying search (parameter) space may be too complex to generalize. The reliance on manually provided ground truth to reach a viable solution raises concerns about the method’s applicability to 'unseen' landscapes. Please clarify how this strategy can be adapted or validated for broader deployment beyond the current study area (i.e., that without ground truth).
While the statistical analyses of parametric dependencies in Tables 12 and Fig. 8 are commendable, it would be helpful to include a visual assessment of parameter variation and performance sensitivity. For example, clearly showing how slight perturbations in parameter values lead to significant changes in performance or conversely, how performance remains stable despite such variations, would enhance interpretability and practical insight.

Author Response

Comment 1: Since the output of the optimization is data driven, they are very much dependent on the underlying CHM images. While the stated research question (RQ) aims to identify optimal parameters, the generalizability of these parameters remains unclear. For instance, it is not evident whether the parameters derived from the Lobachevsky State University of Nizhny Novgorod campus can be as reliably applied to other comparable environments. Empirical validation across diverse geographical locations would strengthen the study’s claims and applicability. On a related note, the practical utility of the identified parameters following optimization-detection remains unclear. Please highlight how the parameters contribute to subsequent analysis (or decision-making) once segmentation is complete.

Response 1: We have added brief discussion of non-generalizability of particular parameter values obtained in our case study. We advocate application of optimization as a part of overall methodology (lines 715-727).

Comment 2: Although NDVI thresholding at 0.5 is a widely used approach, more robust vegetation segmentation methods (e.g., those based on ML) may offer improved noise reduction (as clearly noted in the final panel of Fig. 3). It would be beneficial to discuss the relative pros and cons of these alternatives in comparison to the chosen thresholding technique.

Response 2: We have added the discussion section describing potential application of more sophisticated techniques for further improvements of the overall tree detection and crown delineation framework (Lines 787-800).

Comment 3: While the default values and parameter ranges are listed in Tables 3, 4, and 5, it remains unclear how specific intervals, such as 0–15, 0–0.3, or 0.0001–5, were determined. Please provide justification for the chosen default, minimum, and maximum values (whether based on forestry rationale, mathematical derivation, preliminary trial, or any supporting evidence).

Response 3: We agree that clarity on this aspect is crucial for the reproducibility and understanding of our research. We have internally developed a detailed technical justification for the determination of each parameter's bounds. This justification includes considerations such as:

Lower bounds: These were set based on either the theoretical absence of an effect (e.g., a zero value for a smoothing parameter implying no smoothing, or for a regression slope indicating a constant crown width irrespective of height), or the physical and theoretical limits of a meaningful search (e.g., non-negative values for tree heights). For instance, a minimum value of 0.001 for the REGint parameter was chosen to prevent algorithmic looping caused by zero search radii in certain extreme pixel conditions.
Upper bounds: These were generally determined where further increases in parameter values led to a steady decrease in segmentation efficiency (e.g., by excluding legitimate trees or creating excessively large search radii, leading to an increase in False Negatives). Another critical factor was computational feasibility; unreasonably large ranges were avoided as they drastically increase computation time, especially for multi-dimensional optimization approaches like Differential Evolution, which proved unfeasible for methods like Grid Search. Thus, a balanced range was specified for many parameters, reflecting both theoretical applicability and practical computational constraints.

While we possess these detailed individual justifications, we have chosen not to include all these technical specifics for each parameter directly within the main text. Instead, to maintain the article's flow and avoid overburdening the reader with excessive detail, we have added a concise, overarching description of the methodology for determining these parameter ranges in Section 2.5 (Lines 338-353). This updated text explains the general principles that guided our choice of default, minimum, and maximum values – combining theoretical limits, practical considerations, and preliminary empirical trials – without delving into the minutiae of each specific parameter.

We believe this approach enhances the readability of the manuscript while ensuring transparency regarding our methodology for defining parameter search spaces.

Comment 4: While IoU is an acceptable metric, it would be beneficial to include additional measures, i.e., the Rand Index (RI) and Dice coefficient to provide a more comprehensive assessment of segmentation accuracy. Furthermore, the current definitions of TP, TN, and FN, based on whether a crown is correctly detected (using an IoU threshold of 0.5) are more appropriate for classification tasks such as tree counting (In fact, 0.5 is almost mis-segmentation). In the current segmentation contexts, however, these metrics are typically defined at the pixel level, reflecting whether each pixel is correctly labeled as part of a crown. This discrepancy contributes to notably low performance metrics (below 0.4) as reported in Tables 7 and 8 of the results section. Please consider revising the evaluation protocol accordingly or provide a clear justification for applying such classification accuracy metrics within a segmentation framework.

Response 4: We agree that including additional measures, such as the Rand Index and Dice coefficient, could offer a more comprehensive assessment of segmentation accuracy. Our decision to primarily utilize the Intersection over Union metric stems from our commitment to ensure direct comparability with a significant body of existing literature in individual tree crown delineation. This object-level application of IoU with a 0.5 threshold is a widely accepted and standard practice within our specific research domain. It emphasizes the quality of the object detection and segmentation match rather than a pixel-wise classification accuracy for every single pixel. We acknowledge that this object-centric evaluation strategy may result in lower reported performance metrics compared to purely pixel-level evaluations, as you noted. These values, however, accurately reflect the inherent challenges of achieving high-quality individual tree crown delineation from CHMs in heterogeneous environments. Our objective was to rigorously assess how well entire tree crowns are identified and spatially matched to reference data, which is crucial for accurate urban forest inventories.

While we recognize the benefits of other metrics, incorporating them would introduce an additional layer of complexity and could potentially make our results less directly comparable to those of our colleagues who predominantly use the IoU-based object-matching approach. We believe that a detailed comparison of various quality metrics, including RI and Dice coefficient, in the context of optimizing segmentation algorithms for tree crowns, represents an intriguing and valuable avenue for a separate, dedicated research project in the future.

Comment 5: Appendix A demonstrates that stochastic-evolution (global) optimization outperforms other non-linear search methods (such as BFGS). While this justifies its selection from a performance standpoint, it also suggests that the underlying search (parameter) space may be too complex to generalize. The reliance on manually provided ground truth to reach a viable solution raises concerns about the method’s applicability to 'unseen' landscapes. Please clarify how this strategy can be adapted or validated for broader deployment beyond the current study area (i.e., that without ground truth).

Response 5: We acknowledge that our optimization relies on ground truth data to achieve viable solutions. To clarify the broader applicability of this strategy, we have added a clarification to the last paragraph of Section 4.2 (Lines 728-733).

Comment 6: While the statistical analyses of parametric dependencies in Tables 12 and Fig. 8 are commendable, it would be helpful to include a visual assessment of parameter variation and performance sensitivity. For example, clearly showing how slight perturbations in parameter values lead to significant changes in performance or conversely, how performance remains stable despite such variations, would enhance interpretability and practical insight.

Response 6: We agree that a deeper understanding of parameter influence and performance sensitivity is crucial. We have addressed this by expanding our results with a dedicated Section 3.3.4 "Ablation Study". This section mathematically isolates the effect of individual optimized parameters by quantifying the F-score degradation when each parameter is systematically reverted to its default value. While not a continuous visual assessment of parameter variation, this analysis provides a direct and quantitative measure of performance sensitivity to each parameter's optimized setting. It clearly demonstrates how significant the impact of each parameter's tuning is, thereby enhancing interpretability and practical insight into their relative importance. This approach provides a clear quantitative assessment of how robust the optimal performance is to changes in individual parameters, thereby fulfilling the request for understanding performance sensitivity.

Reviewer 3 Report

Comments and Suggestions for Authors

Attached

Comments for author File: Comments.pdf

Author Response

Comment 1: The mathematical formulation of the segmentation algorithms (Watershed, Marker Controlled Watershed, Dalponte, Silva) should be explicitly provided. Define the cost function or similarity metric f(x,θ) each algorithm minimizes.

Response 1: We appreciate the reviewer's suggestion to explicitly provide the mathematical formulation and cost functions for each segmentation algorithm. We understand the importance of this level of detail for a comprehensive understanding of the algorithms' intrinsic workings.

However, the primary focus of our study is on the systematic optimization of parameters for these established raster-based segmentation algorithms, rather than on their fundamental mathematical derivations. The algorithms used—Watershed, Marker-Controlled Watershed, Dalponte, and Silva—are well-documented in the scientific literature. Their foundational mathematical principles, operational details, and implicit or explicit cost functions or similarity metrics are extensively described in the original publications we cite. Furthermore, our implementation relies on publicly available and thoroughly vetted functions within the lidR and ForestTools packages in R. The source code and comprehensive documentation of these packages offer further insight into their technical workings. Including a full mathematical exposition for each of the four algorithms would significantly increase the length of the manuscript and could potentially distract from our primary contribution: demonstrating the profound impact of systematic parameter optimization on the accuracy and reliability of tree crown delineation. We believe that our current approach, which provides conceptual descriptions and clear references to the foundational works and their software implementations, offers sufficient detail while maintaining the focused scope of our paper.

Comment 2: Clearly define the optimization objective function. For instance, is the Random Search or Differential Evolution minimizing an F-score error L = 1 − F? Provide the exact form.

Response 2: We have clarified this explicitly in Section 2.5 (lines 325-327) of the revised manuscript to ensure full transparency.

Comment 3: Include a complete equation for the F-score used.

Response 3: We have updated equations 6-8.

Comment 4: Provide normalization details for the Canopy Height Model (CHM). Was the CHM normalized by ground elevation (znorm = z − zground) or using a percentile-based scaling?

Response 4: CHM was basically built as a difference between Digital Surface Model (DSM) and Digital Terrain Model (DTM) as specified in equation (1).

Comment 5: Clarify the parameter search space. For instance, if parameter pi ∈ [ai,bi], specify bounds and justify them based on empirical or theoretical considerations.

Response 5: We agree that clarity on this aspect is crucial for the reproducibility and understanding of our research. We have internally developed a detailed technical justification for the determination of each parameter's bounds. This justification includes considerations such as:

Lower bounds: These were set based on either the theoretical absence of an effect (e.g., a zero value for a smoothing parameter implying no smoothing, or for a regression slope indicating a constant crown width irrespective of height), or the physical and theoretical limits of a meaningful search (e.g., non-negative values for tree heights). For instance, a minimum value of 0.001 for the REGint parameter was chosen to prevent algorithmic looping caused by zero search radii in certain extreme pixel conditions.
Upper bounds: These were generally determined where further increases in parameter values led to a steady decrease in segmentation efficiency (e.g., by excluding legitimate trees or creating excessively large search radii, leading to an increase in False Negatives). Another critical factor was computational feasibility; unreasonably large ranges were avoided as they drastically increase computation time, especially for multi-dimensional optimization approaches like Differential Evolution, which proved unfeasible for methods like Grid Search. Thus, a balanced range was specified for many parameters, reflecting both theoretical applicability and practical computational constraints.

We believe this approach enhances the readability of the manuscript while ensuring transparency regarding our methodology for defining parameter search spaces.

Comment 6: Differential Evolution requires mutation and crossover coefficients (F,CR). Report the mathematical values and update rule and explain how parameter diversity was maintained.

Response 6: We have expanded the text and added the relevant equation of the mutation procedure in section 2.5.2 (Lines 376-384).

Comment 7: Random Search should include the sampling distribution. Specify whether uniform or log-uniform sampling was used for continuous parameters.

Response 7: We clarified the application of uniform sampling in section 2.5.1 (Line 361).

Comment 8: Include the convergence criteria for optimization algorithms.

Response 8: For all optimization runs, a maximum of 2400 iterations was set as a computational constraint, at which point the optimization process was terminated. This limit was chosen to ensure a thorough exploration of the parameter space while maintaining computational feasibility. As discussed in Sections 3.3.1, 3.3.2, and 3.3.3, our results show that the algorithms typically reached a relatively stable performance plateau well within this limit (e.g., around 800-1000 iterations for Random Search and 15 generations for Differential Evolution), indicating that 2400 iterations provided ample opportunity for optimization convergence and capturing any subsequent smoother changes towards a higher F-score. We have clarified this point in Section 2.5 (Lines 337-338).

Comment 9: The paper mentions “22% improvement” — provide the mathematical definition.

Response 9: We acknowledge that our previous phrasing was ambiguous. We had indeed, in some instances, expressed absolute differences in F-score values (or 'F-score points') using percentage terminology, which could be misinterpreted as relative percentage improvements. To resolve this potential misinterpretation and enhance clarity, we have revised the manuscript to consistently report all improvements and changes in terms of absolute F-score points, rather than using percentage terminology in this context. This ensures an unambiguous representation of the magnitude of performance enhancements.

Comment 10: Introduce a statistical test (e.g., paired t-test or Wilcoxon) to compare pre- and post-optimization F-scores and include equations for significance testing.

Response 10: We fully recognize the importance of statistical rigor in scientific research. Following your recommendation, we did indeed conduct statistical analyses (Wilcoxon signed-rank tests) to formally compare the F-scores obtained with default parameters against those achieved after optimization (for the last generation in DE). These analyses consistently demonstrated highly statistically significant differences in all cases (p < 0.001), unequivocally confirming the positive and substantial impact of parameter optimization on segmentation performance.

While these formal tests corroborate the effectiveness of our optimization methods, the magnitude of the observed improvements in F-scores is already exceptionally large and practically significant, as detailed in Sections 3.3.1 and 3.3.2. Given both the overwhelming statistical significance and the clear practical importance of these gains, we believe that incorporating the full formal statistical results (such as detailed tables of p-values) into the main manuscript would be superfluous, as it would reiterate a conclusion that is already robustly supported by the presented F-score improvements.

Should the review board still deem it necessary, we are prepared to include a condensed summary of these statistical findings in a revised version of the manuscript.

Comment 11: The “3.1% convergence range” should be backed with a standard deviation or confidence interval computation.

Response 11: The corresponding standard deviations have been added to the results of each optimization method (Lines 432-433 and 485).

Comment 12: Define the mathematical formulation of the Dalponte and Silva segmentation metrics, especially how they utilize spectral and height data fusion (if applicable).

Response 12: In this study, we exclusively utilized the Canopy Height Model as the primary input for all segmentation algorithms, including those by Dalponte and Silva. As detailed in Section 2.2.2, the CHM was derived from UAV photogrammetric data, where spectral information (specifically, NDVI) was used solely for initial vegetation masking and filtering during the CHM creation process. The segmentation algorithms themselves, as implemented and applied in our study, did not involve direct fusion of spectral and height data within their core segmentation logic.

We employed the standard, established implementations of the Dalponte and Silva algorithms from the lidR and ForestTools R packages, as described in Sections 2.3.3 and 2.3.4 respectively. While we have provided a conceptual overview of their operational principles and input parameters, a detailed, step-by-step mathematical formulation of these complex algorithms was developed by their original authors. As we are users of these well-documented implementations rather than their developers, providing a comprehensive mathematical derivation of every internal component of these algorithms would be beyond the scope of this paper and our direct expertise.

Comment 13: The term “parameter tuning” should be represented mathematically with appropriate boundary conditions.

Response 13: We have added the corresponding equation in section 2.5 (Lines 327-328).

Comment 14: Clarify whether the optimization was global or local. Provide pseudo-code or algorithmic steps to support reproducibility.

Response 14: We confirm that we performed global optimization for all segmentation algorithms within the pre-defined parameter ranges, as detailed in Section 2.5. Our choice of optimization techniques, Random Search and Differential Evolution, are inherently designed to explore the entire parameter space to locate the global optimum for the objective function, rather than getting trapped in local extrema.

For full reproducibility and detailed algorithmic steps beyond the conceptual overview provided in the manuscript, the complete code, including the implementation of these optimization routines and the specific parameter ranges, is publicly available in our GitHub repository.

Comment 15: If multiple UAV datasets were used, specify the data partitioning method — e.g., cross-validation folds (k-fold) — to prevent overfitting of optimized parameters.

Response 15: In this study, we utilized a single, unified UAV dataset for all stages of our research, including the generation of the Canopy Height Model, manual tree crown delineation, and the subsequent algorithmic crown recognition and parameter optimization. As detailed in Section 2.2.2, all remote sensing data were acquired from a single flight mission. This single CHM was then consistently used across all considered segmentation and optimization methods.

Therefore, the concern about data partitioning (e.g., into cross-validation folds) for multiple UAV datasets to prevent overfitting is not applicable to our study design, as we did not employ multiple distinct UAV datasets. The optimization procedures were conducted on this singular, comprehensive dataset.

Comment 16: Include a correlation or sensitivity analysis to identify which parameters most influence segmentation accuracy.

Response 16: We agree on the importance of understanding parameter impact. It is crucial to note that many common sensitivity analysis methods, particularly those involving partial derivatives or systematic perturbations around a precise optimum, are inherently gradient-based. In our study, however, we employed non-gradient, heuristic global optimization techniques, which explore the parameter space without relying on gradient information. Given this methodological framework, implementing a formal gradient-based sensitivity analysis would introduce a different analytical paradigm that is not directly aligned with our optimization approach and would significantly extend the scope of the current research.

Nevertheless, we did conduct an analysis to understand the influence of individual parameters. As described in Section 3.3.3, we analyzed how performance metrics depend on specific parameters of the segmentation algorithms by considering all parameter values sampled during the Random Search and Differential Evolution procedures. We fitted linear models relating the F-score to individual parameters, and Table 12 presents the statistical indicators of these models, identifying mostly significant dependencies. This approach provides insights into the general trends and correlations between individual parameters and the F-score within the explored multi-dimensional space, effectively serving a purpose similar to an ablation study by highlighting which parameters exhibit stronger relationships with the final segmentation accuracy. While a comprehensive, formal sensitivity analysis beyond what we have presented would be a valuable direction for future research, we believe our current analysis, combined with the nature of our optimization methods, adequately addresses the parameter influence within the scope of this manuscript.

Comment 17: Provide quantitative comparison plots (e.g., optimization trajectory Ft vs. iteration t) to visualize convergence and variance in parameter search.

Response 17: Figure 7 of the manuscript, directly addresses your request for an "optimization trajectory Ft vs. iteration t" plot. This figure illustrates the dynamics of the maximum F-score achieved at each iteration (for Random Search) or generation (for Differential Evolution) for all four segmentation algorithms.

While Figure 7 focuses on the best performance found at each step, showcasing the trajectory of improvement and convergence, it indirectly reflects aspects of the parameter search. The observed stability of the F-score plateaus after initial rapid gains suggests that the methods effectively explored the parameter space to find robust configurations.

We believe that Figure 7 adequately visualizes the optimization progress and convergence as requested.

Comment 18: Include intersection-over-union (IoU) formula.

Response 18: The formula for calculating the Intersection-over-Union metric, specifically for the case of manually delineated crowns ("ground-truth") and automatically segmented crowns, has been included in the manuscript. It is presented as Formula (5) in Section 2.4.

Comment 19: The paper lacks a loss formulation combining multiple objectives (e.g., precision and recall). Consider introducing a multi-objective fitness function where α balances the two metrics.

Response 19: In our study, we indeed utilized a metric that inherently combines and balances precision and recall: the F-score. As defined in Section 2.4, the F-score is the harmonic mean of precision and recall values, and it serves to signify overall segmentation performance. This metric is widely used in performance evaluation, particularly when there is an uneven class distribution, as it implicitly seeks a balance between achieving high precision (minimizing false positives) and high recall (minimizing false negatives). Therefore, while we did not introduce an explicit parameter to weight precision and recall separately, the F-score effectively acts as a single, combined objective that implicitly balances these two critical metrics for evaluating segmentation accuracy.

Comment 20: Provide the computational complexity of each algorithm, either analytically or empirically. For example, Watershed segmentation is typically O(N logN), while marker-controlled variants may scale differently.

Response 20: We acknowledge the importance of understanding the theoretical computational complexity for a comprehensive evaluation. However, we found that the original developers of these methods have not provided formal analytical estimates of their computational complexity. Furthermore, a rigorous analytical derivation of the computational complexity for each algorithm, especially considering their specific implementations and adaptations within the lidR and ForestTools packages, would entail a dedicated theoretical analysis that extends beyond the scope of the current paper.

Instead, we have focused on empirically evaluating the practical computational demands of each segmentation algorithm within our study. We believe that this practical perspective on computational demands directly reflects the 'time spent on their reproduction' and execution within a real-world application setting. As presented in Table 11 (and also in Table 9 for Random Search optimization), which details the average segmentation time per algorithm, the Dalponte method consistently exhibits the highest computational cost, making it the most time-consuming among all the presented algorithms. Conversely, the Watershed and MCWS methods are considerably more efficient in terms of software computing time. This empirical data provides a practical understanding of the algorithms' performance with respect to execution time, which is a critical aspect for application-oriented research.

Comment 21: Clarify whether the optimization considered stochastic effects. Were the algorithms repeated n times with random seeds to ensure stability?

Response 21: We agree that understanding the impact of randomness is crucial for robust scientific findings. Our primary optimization methods, Random Search and Differential Evolution, are inherently stochastic algorithms. Random Search, as described in Section 2.5.1, operates by independently and uniformly selecting parameter values for each iteration. In essence, each iteration represents a new random "seed" for exploring a different point in the parameter space. Differential Evolution, as detailed in Section 2.5.2, is a stochastic, population-based method that simulates evolutionary processes. Its search mechanism relies on random initialization of populations and random mutations (e.g., using three randomly chosen members of the population as shown in its mutation procedure). Given the inherent randomness and the extensive search conducted by these methods, we opted for a sufficiently large number of iterations/generations in a single run (up to 2400, as stated in Section 2.5). We believe that the comprehensive exploration of the multi-dimensional parameter space afforded by these long runs, driven by the algorithms' internal stochasticity, provides a robust indication of optimal parameter configurations. The objective of these global optimization methods is to broadly sample the search space to find the best possible configurations, rather than to converge to a single, precise local minimum from a specific starting point. Therefore, repeating the entire optimization process multiple times with different initial random seeds would likely yield similar final best F-score values, though potentially with different trajectories leading to those optima. Furthermore, the core message of our paper, as highlighted in our discussion and conclusions, is the significant improvement in segmentation accuracy achieved through parameter optimization in general, regardless of the specific stochastic properties of the optimizer. Our study aims to demonstrate that optimization itself is critical, and even a random search can provide substantial gains over default settings. We acknowledge the value of repeating stochastic optimizations to assess the robustness of the optimization process itself and will certainly consider implementing such a rigorous analysis of the optimization's stability in our subsequent research on this topic.

Comment 22: Discuss parameter interpretability: for each optimized parameter θi, define its physical meaning (e.g., minimum crown radius, height threshold).

Response 22: We agree that understanding the physical significance of parameters is crucial for contextualizing the optimization results. For the Marker-Controlled Watershed, Dalponte, and Silva algorithms, the parameters largely possess direct physical meanings that relate to the empirical patterns of Canopy Height Model formation and tree crown characteristics. Their descriptions, including their physical interpretation and typical ranges, are provided in the manuscript.Table 4 details the parameters for the MCWS algorithm, such as those related to crown width (CW coefficients, REGint, REGslope) which are derived from field measurements and linear regression, linking directly to physical tree dimensions, as discussed in Section 2.3.2. Table 5 provides the parameters for the Dalponte algorithm, like th_seed and th_cr, which describe criteria for including neighboring pixels into a crown, essentially defining crown growth rules (as elaborated in Section 4.1 regarding our definition of default values for these parameters). Table 6 outlines the parameters for the Silva algorithm, including max_cr_factor (estimated expected tree crown diameter as a proportion of height) and exclusion (a height threshold to remove low-lying noise), both of which are directly tied to physical properties of tree crowns as explained in Section 4.1. In contrast, the Original Watershed algorithm (parameters in Table 3), as implemented in our study, functions primarily as a general-purpose morphological image processing filter. Its parameters, such as th_tree, hmin, or sigma, control internal algorithmic thresholds for processes like local maxima detection and image smoothing. While these parameters are essential for the algorithm's operation, their values do not always translate directly into easily interpretable physical attributes of tree crowns (e.g., a specific crown radius or fixed height threshold across all trees), unlike the more domain-specific parameters of the other three algorithms. This characteristic makes their interpretation within the context of tree crown recognition less intuitive.

Comment 23: Introduce an error propagation analysis showing how small perturbations ∆θi affect ∆F.

Response 23: We understand the value of such sensitivity analyses, particularly in contexts where optimization relies on gradient information or precise local minima. However, our optimization methods, Random Search and Differential Evolution (Sections 2.5.1 and 2.5.2), are inherently stochastic and non-gradient-based. Their reliance on direct random search and random mutations means classical gradient-based perturbation analysis is not directly applicable or necessary for our objectives. Such an analysis would be incongruent with our main hypotheses and conclusions, potentially confusing the reader.

Instead, to understand the influence of individual parameters and achieve a similar purpose to an ablation study, we analyzed how performance metrics depend on specific parameters. We also considered all parameter values sampled during the optimization procedures and fitted linear models relating the F-score to individual parameters, with the main statistical indicators presented in Table 12. This approach allowed us to identify significant trends and correlations between individual parameters and the F-score across the explored parameter space, offering insight into their impact within our multi-dimensional optimization framework.

Comment 24: Provide a visual mathematical comparison of pre- and post-optimization F-scores using boxplots or cumulative distribution plots.

Response 24: We agree that visualizations are essential for conveying the impact of optimization. In our manuscript, Figure 7 serves this precise purpose. These dynamic plots visually illustrate the improvement in F-score from the initial, unoptimized state (represented by the starting point of each curve) to the final, optimized state (the plateau reached at the end of the trajectory) for each algorithm and optimization method. This demonstrates not only the pre- and post-optimization F-scores but also the dynamic progression and speed of improvement over the course of the optimization process.

Comment 25: Add ablation studies to mathematically isolate the effect of each parameter using.

Response 25: We have addressed this by including a new dedicated Section 3.3.4 "Ablation Study", which details the methodology and presents the comprehensive quantitative results.

A concise summary of the ablation study's key findings, specifically identifying the most influential parameters for each algorithm, has been also integrated into the Discussion section (Lines 658-662).

Comment 26: Include a discussion on potential overfitting, with a mathematical regularization term.

Response 26: We agree that discussing the potential for overfitting is an important consideration.

We have addressed this point in the Discussion section (Lines 717-720). We acknowledge that our optimization methods do not inherently include explicit mathematical regularization terms. Consequently, the optimized parameters, being intrinsically data-driven and highly tailored to our specific dataset, carry a risk of overfitting. This highlights the necessity for empirical re-validation when applying these precise parameter values to different geographical locations, sensor types, or diverse remote sensing datasets.

Comment 27: Lastly, provide reproducible pseudo-code for the optimization process and define all variables clearly to improve the transparency of mathematical procedures.

Response 27: For full reproducibility and detailed algorithmic steps beyond the conceptual overview provided in the manuscript, the complete code, including the implementation of these optimization routines and the specific parameter ranges, is publicly available in our GitHub repository.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

Although I do not fully agree with certain aspects of the authors’ responses, their rebuttals are reasonable and scientifically sound. No further revision is necessary. I recommend acceptance of this study to encourage broader discussion within the community.

Reviewer 3 Report

Comments and Suggestions for Authors

No comments