Next Article in Journal
A Literature Review of Vehicle and Drone Delivery Routing Problems in Different Synchronization Level Scenarios
Previous Article in Journal
Multi-Domain Fusion for UAV Image Super-Resolution Based on Tiny-Transformer
 
 
Article
Peer-Review Record

Development of a Visual SLAM-Based Autonomous UAV System for Greenhouse Plant Monitoring

by Jing-Heng Lin and Ta-Te Lin *
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4:
Reviewer 5: Anonymous
Submission received: 27 January 2026 / Revised: 13 March 2026 / Accepted: 13 March 2026 / Published: 15 March 2026
(This article belongs to the Section Drones in Agriculture and Forestry)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The paper proposes a Visual SLAM driven UAV platform for greenhouse monitoring in GPS-denied corridors, framed as a distributed (dual-link) architecture where flight control and sensing remain lightweight while heavier perception and pose estimation can be executed on a dedicated compute node. 

Before the final greenhouse deployment, the authors report a staged validation process starting with a simulated virtual setup, followed by refinement in a controlled indoor physical environment to capture real platform dynamics prior to field testing. 

In the experimental greenhouse campaign, the evaluation is organized as follow: (i) navigation/localization performance and (ii) plant-monitoring performance. 

For navigation, Visual SLAM is used as the UAV’s sole global localization method during flight, and its accuracy is assessed against an external UWB reference.

For plant monitoring, the final stage focuses on plant and flower detection/tracking from side-view imagery: a two-model YOLOv8 based pipeline performs plant detection first, then flower detection with small-object oriented refinement, and concludes with a multi-step counting procedure that combines tracking, optical-flow-based motion compensation, trajectory analysis, and clustering to estimate flower counts per plant over time.

====

Overall

The manuscript was well constructed and has scientific rigor.  The introduction is well motivated and provides a coherent justification for the problem, supported by relevant references. It also has clearly stated objectives and scope.  

The methodology is presented in a logically organized manner, with a clear separation between platform/architecture choices and the perception and navigation components, which improves readability.

In addition, the experimental section is described with a good level of detail. The platform configuration, communication links, and processing pipeline are documented, and the evaluation design follows a structured progression (simulation, controlled testing, and greenhouse deployment), with the sensing and estimation techniques explained at a level that allows the reader to understand the full workflow.

=== Minor
The paper could be improved in a few minor aspects. However, these should be considered optional refinements:

1) The paper would may be clearer and more explicit about which functions are executed on-board the UAV versus on external computing resources, as the current description does not fully resolve where the visual pose estimation (V-SLAM) and other latency-sensitive blocks actually run. 

Given the presence of multiple processing nodes and communication links, it remains difficult to infer whether the core estimation, planning, and control loops are entirely on-board, partially external, or distributed.

The authors could provide a concise mapping of major software components to the specific hardware platform and briefly characterize how inter-module communication delays are handled or tolerated.

2) The main findings and contributions could be a little more explicit in the end of the introduction.

3) The limitations of the study should also be more discussed in the results.

4) The current navigation setup appears to rely most on following pre-defined trajectories/waypoints under a structured operational scenario, rather than supporting more advanced autonomous navigation with local replanning and obstacle avoidance. 

 Given that the experiments are conducted under a comparatively well-prepared setting (with deliberate measures to stabilize perception and communication), it would strengthen the work to more explicitly characterize these assumptions and to evaluate how relaxing them impacts navigation performance. In particular, a analysis exploring conditions that reduce environmental structure (e.g., fewer visual aids, increased occlusions, higher traffic/dynamics, degraded link quality) would help delineate the operational envelope and clarify expected behavior in more challenging scenarios.

At minimum, detailing these limitations and outlining a concrete evaluation plan as future work would improve the contribution.

Author Response

Comment 1: The paper would may be clearer and more explicit about which functions are executed on-board the UAV versus on external computing resources, as the current description does not fully resolve where the visual pose estimation (V-SLAM) and other latency-sensitive blocks actually run. Given the presence of multiple processing nodes and communication links, it remains difficult to infer whether the core estimation, planning, and control loops are entirely on-board, partially external, or distributed. The authors could provide a concise mapping of major software components to the specific hardware platform and briefly characterize how inter-module communication delays are handled or tolerated.

Response1:

We agree that the original description left ambiguity about where latency-sensitive blocks execute. To resolve this, we added a module-to-hardware mapping table (Table 1) in Section 2.1 that assigns each major functional module (V-SLAM, mission control, detection, offline analytics) to its execution platform, together with update rates and communication interfaces. We also redesigned Figure 2 to show hardware boundaries, communication-channel separation, and key timing labels (22 ms FPV link, 126 ms pose update, 10 Hz control loop). These additions allow the reader to trace execution ownership and delay handling across the full data path (revised in Section 2.1; Table 1; Figure 2).

 

Comment 2:The main findings and contributions could be a little more explicit in the end of the introduction.

Response 2:

We agree. The Introduction now closes with three numbered contribution statements: (1) constraint-driven modular architecture, (2) 27-day operational characterization, and (3) a proof-of-concept flower counting pipeline. Each point is tied to specific results and conclusions language to keep the contribution framing explicit and traceable (revised in Section 1, final paragraph).

 

Comment 3:The limitations of the study should also be more discussed in the results.

Response 3:

We fully agree. The original manuscript underreported limitations, and this has been corrected across Methods/Results and then consolidated in Conclusions. Specifically, we now state the handheld-vs-flight caveat in Section 3.1.2, use evidence-constrained attribution language in Section 3.1.1, report UWB reference uncertainty in Section 2.4.1, and include FN/FPR/IDSW decomposition in Section 3.3.2. Section 4 (paragraph 2) then consolidates the principal boundaries: single-site scope, marker-related feature dependence, handheld endurance protocol, flower-tracking undercount bias (MOTA = 49.1%, FN-dominant), and compute-rate degradation (9 -> 6 Hz).

 

Comment 4: The current navigation setup appears to rely most on following pre-defined trajectories/waypoints under a structured operational scenario, rather than supporting more advanced autonomous navigation with local replanning and obstacle avoidance. Given that the experiments are conducted under a comparatively well-prepared setting (with deliberate measures to stabilize perception and communication), it would strengthen the work to more explicitly characterize these assumptions and to evaluate how relaxing them impacts navigation performance. In particular, a analysis exploring conditions that reduce environmental structure (e.g., fewer visual aids, increased occlusions, higher traffic/dynamics, degraded link quality) would help delineate the operational envelope and clarify expected behavior in more challenging scenarios. At minimum, detailing these limitations and outlining a concrete evaluation plan as future work would improve the contribution.

Response 4:

The reviewer is correct that the current system follows pre-defined waypoints without local replanning or obstacle avoidance. This is a deliberate design choice for greenhouse corridor-following, where geometry is constrained and dynamic obstacles are limited during controlled monitoring sessions. The system was not designed for open-field or obstacle-rich scenarios, and this scope is now stated explicitly. In the revised Conclusions (Section 4, paragraphs 1-2), we summarize operational boundaries and limitations along environmental structure, temporal repeatability, flight dynamics, and feature-density dependence. The detailed boundary characterization used to support this summary is:

  • Fixed greenhouse geometry (7 m × 10 m, three parallel racks). The corridor structure constrains the flight path but also simplifies waypoint planning.
  • 27-day sustained operation with daily fresh SLAM initialization (no map persistence). This demonstrates repeatability but does not test seasonal or structural changes to the greenhouse.
  • Low flight speed (0.13 m/s mean, 0.50 m/s ceiling) with waypoint tolerance of 10 cm. The system has not been tested under higher-speed or more aggressive maneuvering.
  • Visual feature density. Marker boards on the racks contributed stable features. A marker-absent flight remained operational with bounded quality degradation (weakly observed landmarks: 2.75% →28%), but this is a single-trial observation rather than a systematic sensitivity study.

As the reviewer suggests, evaluating relaxed conditions is an important next step. Section 4 now outlines future work on deployment across diverse greenhouse geometries/crops, reduced-structure and varied-lighting evaluation, and reactive control integration for obstacle-rich scenarios. We explicitly frame the current study as validation within a bounded operational envelope rather than a claim of broad autonomy generality.

Reviewer 2 Report

Comments and Suggestions for Authors

The paper has described a V-SLAM-based autonomous operation of a UAV for GPS-denied areas. Additionally, plant monitoring has been incorporated in the datapath. Although the presented work is admirable from a system design and integration point of view, it lacks novelty and research rigor.

  • The novelty of the manuscript is limited since all the component algorithms e.g. V-SLAM, EKF etc. are already built into the ROS framework. Inclusion of RPi computer module and a separate NVIDIA jetson with communication through MAVROS is routinely done by hobbyists and researchers alike. Same is true for flower detection and tracking algorithms i.e. YOLOv8 and BOT-SORT.
  • Two figures have been numbered “1”. This makes every figure referenced in the manuscript confusing.
  • UWB has been used as the main validation source of ground truth. However, the provided figures show that Aruco markers were densely placed throughout the greenhouse at close range. The manuscript only makes a passing reference to these markers on line 411. At such close range, Aruco markers could give much higher localization precision than UWB which have the additional issues of hot zones. Thus, it should be clarified, why UWB has been favored over Aruco when both options are conveniently available within the controlled environment of a greenhouse.
  • The positioning and detection results (Table 2) could be expanded to include other state-of-the-art methods e.g.
    1. Islam, R., Habibullah, H. & Hossain, T. AGRI-SLAM: a real-time stereo visual SLAM for agricultural environment. Auton Robot47, 649–668 (2023). https://doi.org/10.1007/s10514-023-10110-y
    2. Zhai, Y.; Zhang, L.; Hu, X.; Yang, F.; Huang, Y. A Dynamic Kalman Filtering Method for Multi-Object Fruit Tracking and Counting in Complex Orchards. Sensors202525, 4138. https://doi.org/10.3390/s25134138

Within ROS framework, several other algorithms could be compared as well. E.g. ROVIO and LSD-SLAM.

 

Author Response

Comment 1: The paper has described a V-SLAM-based autonomous operation of a UAV for GPS-denied areas. Additionally, plant monitoring has been incorporated in the datapath. Although the presented work is admirable from a system design and integration standpoint, it lacks novelty and research rigor.

Response 1:

We agree that the original submission did not sufficiently distinguish systems contribution from algorithmic novelty, and did not make the evaluation boundaries explicit enough. In the revised manuscript, we therefore reposition the paper as a systems deployment contribution rather than an algorithm-development paper.

Specifically, the manuscript now defines three contribution axes at the end of Section 1 (Lines 104-118): (1) a constraint-driven UAV architecture for greenhouse monitoring under payload, compute, and communication limits, (2) operational characterization from a 27-day deployment in a commercial greenhouse, and (3) a phenotyping proof-of-concept showing trend-level flower-count recovery rather than absolute counting accuracy. We do not claim novelty in OpenVSLAM, YOLOv8, BOT-SORT, or EKF themselves.

We also strengthened the research-rigor aspects of the manuscript. Section 2.4.1 (Lines 394-398) now reports UWB reference uncertainty; Sections 3.3.2-3.3.3 (Lines 602-643) now provide flower-count error decomposition and bound the phenotyping claim to temporal trend recovery; and Section 4 (Lines 665-681) consolidates the principal limitations and operational boundaries of the study. Accordingly, the revised manuscript argues for contribution through validated system deployment and evidence-bounded analysis under real greenhouse conditions, not through algorithmic novelty.


Comment 2:
The novelty of the manuscript is limited since all the component algorithms e.g. V-SLAM, EKF etc. are already built into the ROS framework. Inclusion of RPi computer module and a separate NVIDIA jetson with communication through MAVROS is routinely done by hobbyists and researchers alike. Same is true for flower detection and tracking algorithms i.e. YOLOv8 and BOT-SORT.

Response 2:

We agree that the individual algorithmic components (V-SLAM, EKF, YOLOv8, BOT-SORT) are established methods. The contribution of this work is not in novel algorithms but in three specific aspects that, to our knowledge, have not been jointly demonstrated:

  • Constraint-driven architecture design. The system operates under stringent compute, payload, and cost constraints (Raspberry Pi CM4 + Jetson Orin Nano, total system cost under $2,000). Each design choice — monocular rather than stereo SLAM, dual-camera decoupling, edge-offloaded navigation, offline phenotyping — was dictated by these constraints. The Introduction now traces each bottleneck to its corresponding design trade-off (Section 1, Lines 104-118).
  • Sustained operational characterization. Prior V-SLAM greenhouse studies demonstrated single-session feasibility (Krul et al., 2021) or mapping without autonomous flight (Sukvichai et al., 2023). Related agricultural SLAM has also been reported on stereo ground platforms rather than small aerial systems (Islam et al., 2023). Our 27-day campaign with daily autonomous missions demonstrates the practical feasibility of the proposed UAV monitoring framework in a real commercial greenhouse setting (Section 1, Lines 95-103).
  • Integrated phenotyping pipeline. The flower counting proof-of-concept demonstrates that the navigation architecture produces imagery consistent enough for temporal trend recovery (Gompertz growth model, R² = 0.75–0.95) despite per-frame tracking limitations (MOTA = 49.1%; Sections 3.3.2-3.3.3, Lines 602-643).

 

Comment 3: Two figures have been numbered "1". This makes every figure referenced in the manuscript confusing.

Response 3:

We apologize for this error. The duplicate Figure 1 numbering has been resolved: the greenhouse/UAV overview retains Figure 1, and the system architecture diagram becomes Figure 2. Subsequent figure numbering and in-text callouts were synchronized across the manuscript. Revised in: Section 2.1 figure references and early figure callouts.

 

Comment 4: UWB has been used as the main validation source of ground truth. However, the provided figures show that Aruco markers were densely placed throughout the greenhouse at close range. The manuscript only makes a passing reference to these markers on line 411. At such close range, Aruco markers could give much higher localization precision than UWB which have the additional issues of hot zones. Thus, it should be clarified, why UWB has been favored over Aruco when both options are conveniently available within the controlled environment of a greenhouse.

Response 4:

We thank the reviewer for raising this important methodological point. UWB was selected as the evaluation reference specifically to ensure metrological independence from V-SLAM. Because V-SLAM relies on visual features for localization, using a vision-based reference (such as ArUco marker pose estimation) would create a circular dependency between the measurement being evaluated and its reference. UWB ranging operates on an independent physical modality (radio time-of-flight), ensuring that errors in the visual processing pipeline do not propagate into the reference.

We have clarified the distinct roles of the three positioning-related elements: V-SLAM provides operational navigation; marker boards provide environmental visual features (processed as generic OpenVSLAM features without ArUco-specific detection); and UWB serves as a metrologically independent evaluation reference. Revised in: Section 2.4.1 (Section 2.4.1, Lines 381-385) and Figure 11 context text (Lines 559-561).

 

Comment 5: The positioning and detection results (Table 2) could be expanded to include other state-of-the-art methods e.g.

Islam, R., Habibullah, H. & Hossain, T. AGRI-SLAM: a real-time stereo visual SLAM for agricultural environment. Auton Robot47, 649–668 (2023). https://doi.org/10.1007/s10514-023-10110-y

Zhai, Y.; Zhang, L.; Hu, X.; Yang, F.; Huang, Y. A Dynamic Kalman Filtering Method for Multi-Object Fruit Tracking and Counting in Complex Orchards. Sensors2025, 25, 4138. https://doi.org/10.3390/s25134138.

Response 5:

We thank the reviewer for suggesting these references, both of which we have now cited. To address the SOTA-comparison request transparently, we now frame comparability along three axes: (i) input modality (monocular vs stereo/visual-inertial), (ii) platform dynamics (small UAV vs ground vehicle), and (iii) evaluation protocol/metrics. This framing follows prior cross-method SLAM comparison studies and recent review guidance on fair SLAM benchmarking (Merzlyakov & Macenski, 2021; Al-Tawil et al., 2024).

Islam et al. (2023) developed AGRI-SLAM, a stereo visual SLAM system for agricultural ground vehicles. This work has been incorporated into the revised Introduction as a key prior study in the problem-driven narrative (Section 1, Lines 95-103). A direct numerical comparison against our localization table (Table 3 in the revised manuscript; Table 2 in the original submission) is not straightforward under the above framework because AGRI-SLAM operates on a different platform class and sensing stack than our monocular aerial system.

Zhai et al. (2025) addresses multi-object fruit tracking in orchards using dynamic Kalman filtering. While the tracking methodology and environmental context differ from our greenhouse flower counting (indoor vs. outdoor, BOT-SORT vs. Kalman filtering, muskmelon flowers vs. tree fruits), this work provides useful context for the broader phenotyping tracking literature. We now cite it in the conclusions with an explicit future-work direction for motion-model-based tracking improvement, consistent with this narrative-integration strategy (Section 4, Lines 682-688).

 

Comment 6: Within ROS framework, several other algorithms could be compared as well. E.g. ROVIO and LSD-SLAM.

Response 6:

We appreciate the reviewer's suggestion. A systematic comparison across SLAM algorithms would indeed provide valuable context. However, the contribution of this work is not a SLAM benchmark but rather the integration, sustained deployment, and characterization of a complete greenhouse monitoring pipeline. The 27-day campaign with daily autonomous missions represents the operational validation dimension that distinguishes this work from single-session algorithmic comparisons.

That said, we acknowledge the merit of the suggested algorithms. ROVIO incorporates IMU measurements for visual-inertial odometry correction (Bloesch et al., 2015), which can improve robustness under rapid orientation changes. LSD-SLAM uses direct photometric consistency instead of sparse feature matching (Engel et al., 2014), offering potential advantages in feature-sparse greenhouse regions. More broadly, visual-inertial methods require synchronized IMU-camera calibration that adds integration complexity under our payload budget, while direct methods impose higher computational demand on the edge-computing platform. Using the same comparability criteria defined in R2-C04 (sensor modality, platform dynamics, and evaluation protocol), we treat these as relevant alternatives and future evaluation targets rather than claiming direct numeric comparability to our current monocular-UAV localization results table. A comparative evaluation under greenhouse conditions remains a valuable future direction.

References:

Krul, S.; Pantos, C.; Frangulea, M.; Valente, J. Visual SLAM for Indoor Livestock and Farming Using a Small Drone with a Monocular Camera: A Feasibility Study. Drones 2021, 5(2), 41. https://doi.org/10.3390/drones5020041

Sukvichai, K.; Thongton, N.; Yajai, K. Implementation of a Monocular ORB SLAM for an Indoor Agricultural Drone. In 2023 3rd ICA-SYMP; pp. 45-48. https://doi.org/10.1109/ICA-SYMP56348.2023.10044953

Merzlyakov, A.; Macenski, S. A Comparison of Modern General-Purpose Visual SLAM Approaches. In IROS 2021; pp. 6459-6466. https://doi.org/10.1109/IROS51168.2021.9636615

Al-Tawil, B.; Hempel, T.; Abdelrahman, A.; Al-Hamadi, A. A review of visual SLAM for robotics: evolution, properties, and future applications. Frontiers in Robotics and AI 2024, 11, 1347985. https://doi.org/10.3389/frobt.2024.1347985

Bloesch, M.; Omari, S.; Hutter, M.; Siegwart, R. Robust Visual Inertial Odometry Using a Direct EKF-Based Approach. In IROS 2015; pp. 298-304. https://doi.org/10.1109/IROS.2015.7353389

Engel, J.; Schoeps, T.; Cremers, D. LSD-SLAM: Large-Scale Direct Monocular SLAM. In ECCV 2014; pp. 834-849. https://doi.org/10.1007/978-3-319-10605-2_54

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript presented a well-engineered UAV system for greenhouse monitoring. However, its contribution is primarily application-level integration of existing methods rather than a novel research advance. The navigation relied on OpenVSLAM without algorithmic modification, the perception pipeline was based on standard YOLOv8 and BOT-SORT, and the UAV followed predefined paths in a highly structured environment. No new planning, control, SLAM, or learning methods are proposed, nor is there a comparative analysis demonstrating advantages over existing approaches. As such, the work reads as an engineering deployment report rather than a research article, and the novelty relative to prior UAV-based indoor navigation and phenotyping systems is unclear.

Author Response

Comments 1: The manuscript presented a well-engineered UAV system for greenhouse monitoring. However, its contribution is primarily application-level integration of existing methods rather than a novel research advance. The navigation relied on OpenVSLAM without algorithmic modification, the perception pipeline was based on standard YOLOv8 and BOT-SORT, and the UAV followed predefined paths in a highly structured environment. No new planning, control, SLAM, or learning methods are proposed, nor is there a comparative analysis demonstrating advantages over existing approaches. As such, the work reads as an engineering deployment report rather than a research article, and the novelty relative to prior UAV-based indoor navigation and phenotyping systems is unclear.

Response 1:

We thank the reviewer for this direct assessment and agree that the original submission did not clearly distinguish systems contribution from algorithmic novelty. In the revised manuscript, we therefore do not position this work as a new SLAM, planning, control, or learning method. Instead, we position it as a systems deployment contribution supported by field validation under real greenhouse constraints.

Specifically, Section 1 (Lines 95-118) now defines three contribution axes: (1) a constraint-driven modular UAV architecture under payload, compute, and communication limits, (2) operational characterization from a 27-day deployment in a commercial greenhouse, and (3) a phenotyping proof-of-concept whose claim is explicitly bounded to temporal trend recovery rather than absolute counting accuracy. We do not claim algorithmic novelty in OpenVSLAM, YOLOv8, or BOT-SORT, nor do we claim superiority over alternative methods.

We also clarified the manuscript's evidence boundary rather than recasting it as an algorithm-comparison study. Sections 3.3.2-3.3.3 (Lines 602-643) now state the tracking limitations and trend-level interpretation explicitly, and Section 4 (Lines 650-688) consolidates the operational scope, limitations, and future comparative directions. Accordingly, the revised manuscript argues for research contribution through validated deployment evidence and bounded analysis in a real greenhouse setting, rather than through algorithmic innovation.

References:

Krul, S.; Pantos, C.; Frangulea, M.; Valente, J. Visual SLAM for Indoor Livestock and Farming Using a Small Drone with a Monocular Camera: A Feasibility Study. Drones 2021, 5(2), 41. https://doi.org/10.3390/drones5020041

Sukvichai, K.; Thongton, N.; Yajai, K. Implementation of a Monocular ORB SLAM for an Indoor Agricultural Drone. In 2023 3rd ICA-SYMP; pp. 45-48. https://doi.org/10.1109/ICA-SYMP56348.2023.10044953

Islam, R.; Habibullah, H.; Hossain, M. T. AGRI-SLAM: a real-time stereo visual SLAM for agricultural environment. Autonomous Robots 2023, 47, 649-668. https://doi.org/10.1007/s10514-023-10110-y

Reviewer 4 Report

Comments and Suggestions for Authors

The paper proposes a distributed navigation framework for UAVs with significant payload constraints, facilitating autonomous monitoring in greenhouses where GPS signals are obstructed. A lightweight onboard controller manages flight control and sensor acquisition, while deep learning perception is offloaded to an edge computer via a first-person view (FPV) video link using dual-link architecture. Experimental results demonstrate the framework's applicability in real-world environments through long-term missions conducted in a commercial greenhouse.


  1. The manuscript extensively discusses IoT sensors, external positioning, and other sensor types. However, it lacks a clear identification of the primary bottlenecks in autonomous greenhouse UAV operation. The introduction should explicitly state how proposed dual-link architecture and edge offloading address these critical challenges. Furthermore, the study's focus remains ambiguous, as it is unclear whether it targets the implementation of a greenhouse UAV system or the separation of communication and computation functions.
  2. Monocular simultaneous localization and mapping (SLAM) was chosen as a practical compromise to address visual domain shifts caused by lighting variations and plant growth in greenhouses; however, its inherent limitations persist. The introduction does not adequately explain how the proposed design mitigates these weaknesses. Moreover, the justification for emphasizing operations without external infrastructure is unclear, particularly given the use of infrastructure-based verification methods.

 

  1. The architectural description includes redundant mentions of online processing, offline analysis, and offloading within a single paragraph. This repetition impedes understanding of data flow and execution locations without consulting the system block diagram.
  2. The communication delay budget is presented only at a strategic level, without connecting it to control-loop stability. While values such as 22 ms for a 5.8 GHz link, 126 ms for pose updates, and a 10 Hz external loop frequency are provided, the acceptable end-to-end jitter within the closed loop is not specified. The manuscript should clarify how system stability is maintained under these timing constraints.
  3. A potential conflict exists between the claim of infrastructure independence and experimental design. The paper initially asserts that SLAM provides the sole global positioning during flight yet subsequently describes ultra-wideband (UWB) as the exclusive global positioning method. Moreover, installing a marker board to enhance SLAM performance constitutes adding environmental infrastructure, thereby contradicting the stated independence.
  4. While the paper specifies parameters such as YOLO learning settings and BOT-SORT thresholds, it omits essential details required for reproducibility. In particular, conditions ensuring safe system rotation—including coordinate system transformations, time synchronization, and fallback procedures when SLAM pose data is supplied to PX4—are not described.
  5. The approach of removing numerous samples using DBSCAN, gating, smoothing, Kalman filtering, and hot-zone flagging complicates the separation and interpretation of SLAM and UWB errors. If UWB is treated as an uncertain reference rather than ground truth, the manuscript should provide (i) comparisons before and after filtering, (ii) justification for threshold selection, and (iii) analysis of the impact of excluded data segments on the results.
  6. The authors attribute the increased RMSE in the greenhouse to lighting conditions or reflections; however, they provide no observational evidence, such as identifying affected sections, changes in reprojection errors, or tracking failure rates. Furthermore, the claim that improved rotation accuracy results from the trajectory lacks a quantitative comparison with trajectory characteristics.
  7. The experiment involved five handheld loops rather than in-flight tests. It remains unclear whether intentionally disabling the repositioning module prevents relocalization during operation. Before concluding that the system adequately supports autonomous navigation, experimental results should verify parameter stability under flight conditions that include vibration, motion blur, and altitude variations.
  8. The paper attributes the drop from 9 Hz to 6 Hz to heat buildup and memory issues; however, no supporting evidence is provided, such as temperature logs, memory usage, CPU/GPU utilization, or throttling events. If the "8-minute limit without flight" is also a consequence, the manuscript should clarify whether SLAM tracking constitutes the bottleneck and quantify improvements achieved by adjusting resolution, frame rate, or other parameters.
  9. In the experiment, flower detection demonstrates low mean average precision, and tracking exhibits a significant undercounting bias, with a MOTA of 49%. Treating this bias as corrected by regularization appears arbitrary. A detailed analysis is required to determine whether to undercount results from missed detections, identity switches, or merging errors. Additionally, a correction model should be developed to address heterogeneity caused by increased flower counts, incorporating cross-validation and temporal variation as covariates.
Comments on the Quality of English Language

Although overall comprehension is adequate, several sentences are lengthy and complex, hindering the clear communication of the core argument and contribution. Specifically, the sections addressing the method's prerequisites, model limitations, and result interpretation would benefit from more concise sentence structures to improve technical clarity and persuasiveness.

Author Response

Comment 1:  The manuscript extensively discusses IoT sensors, external positioning, and other sensor types. However, it lacks a clear identification of the primary bottlenecks in autonomous greenhouse UAV operation. The introduction should explicitly state how proposed dual-link architecture and edge offloading address these critical challenges. Furthermore, the study's focus remains ambiguous, as it is unclear whether it targets the implementation of a greenhouse UAV system or the separation of communication and computation functions.

Response 1:

We agree that the original Introduction discussed sensor technologies broadly without identifying the specific bottlenecks that motivated our design. The revised Introduction (Section 1, Lines 64-118) now explicitly identifies four operational bottlenecks: (1) GPS-denied localization in vegetated environments, (2) compute and payload constraints on small aerial platforms, (3) temporal scene variation from plant growth, and (4) the need for perception outputs reliable enough for agronomic decisions. Each subsequent design choice (monocular V-SLAM, dual-camera decoupling, edge offloading, offline phenotyping) is traced to the bottleneck it addresses.

Regarding the study's focus: the paper targets the implementation and validation of a complete greenhouse UAV monitoring system. The dual-link architecture and edge offloading are design responses to the compute/payload bottleneck, not independent research objectives. The revised Introduction now makes this relationship clear in Section 1: the communication and computation separation exists because the platform cannot run V-SLAM and phenotyping onboard simultaneously, not as a standalone contribution.

 

Comment 2: Monocular simultaneous localization and mapping (SLAM) was chosen as a practical compromise to address visual domain shifts caused by lighting variations and plant growth in greenhouses; however, its inherent limitations persist. The introduction does not adequately explain how the proposed design mitigates these weaknesses. Moreover, the justification for emphasizing operations without external infrastructure is unclear, particularly given the use of infrastructure-based verification methods.

Response2:

We agree that the original Introduction did not explain sufficiently how the design responds to monocular SLAM limitations. The revised Introduction now addresses this at two levels.

First, selection rationale: under the payload and cost limits of this platform, monocular SLAM was selected as the practical trade-off relative to stereo or LiDAR alternatives (Campos et al., 2021; Al-Tawil et al., 2024).

Second, mitigation at system level: the forward camera is dedicated to V-SLAM, V-SLAM is edge-offloaded to preserve runtime headroom, and missions are reinitialized daily to limit cross-day drift accumulation. The tested low flight speed (mean 0.13 m/s) also limits motion blur and supports feature re-observation. We explicitly state that these measures mitigate operational impact but do not remove inherent monocular-SLAM sensitivity.

For infrastructure wording, we now separate operation from evaluation: during flight, global navigation state is provided by V-SLAM; UWB is used only as an independent evaluation reference; and marker boards are disclosed as a present environmental condition in this campaign (Section 1, Lines 83-103; Section 2.4.1, Lines 382-385; and Section 4, Lines 665-688).

References:

Campos, C.; Elvira, R.; Gomez Rodriguez, J. J.; Montiel, J. M. M.; Tardos, J. D. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual-Inertial, and Multi-Map SLAM. IEEE Transactions on Robotics 2021, 37(6), 1874-1890. https://doi.org/10.1109/TRO.2021.3075644

Al-Tawil, B.; Hempel, T.; Abdelrahman, A.; Al-Hamadi, A. A review of visual SLAM for robotics: evolution, properties, and future applications. Frontiers in Robotics and AI 2024, 11, 1347985. https://doi.org/10.3389/frobt.2024.1347985

 

Comment 3:The architectural description includes redundant mentions of online processing, offline analysis, and offloading within a single paragraph. This repetition impedes understanding of data flow and execution locations without consulting the system block diagram.

Response 3:

Section 2.1 has been restructured to present the data flow clearly through a module-to-hardware mapping table (Table 1, beginning at Line 181), replacing the redundant prose description. The revised text follows a single pass: sensor input -> onboard processing (what runs on RPi CM4) -> edge processing (what runs on Jetson Orin Nano) -> offline analysis (post-mission). Revised in: Section 2.1 (Lines 151-192).

 

Comment 4:The communication delay budget is presented only at a strategic level, without connecting it to control-loop stability. While values such as 22 ms for a 5.8 GHz link, 126 ms for pose updates, and a 10 Hz external loop frequency are provided, the acceptable end-to-end jitter within the closed loop is not specified. The manuscript should clarify how system stability is maintained under these timing constraints.

Response 4:

We agree that the delay budget should be connected to stability requirements. In the revised manuscript, Section 2.1 now includes a module-to-hardware mapping table (Table 1) that explicitly identifies the loop ownership of each module: inner-loop stabilization on PX4, outer-loop mission-execution PID on RPi CM4, and global localization on Jetson Orin Nano. Figure 2 annotates communication channels with nominal timing labels. Together these clarify the control-loop hierarchy and where each timing value applies.

Regarding acceptable jitter and stability margin: the outer-loop PID controller runs at 10 Hz, receiving V-SLAM pose updates at approximately 7.9 Hz (126 ms interval). Between visual updates, PX4 continues high-rate inertial state propagation, so the control chain does not stall. At the mean flight speed of 0.13 m/s, the per-update displacement is 0.13 × 0.126 ≈ 1.6 cm; at the monitoring-speed ceiling of 0.50 m/s, it rises to 6.3 cm, both below the 10 cm waypoint tolerance. We did not estimate a separate closed-loop jitter distribution in this study; therefore we report this displacement-margin check plus stable Section 3.2 flight outcomes as the available stability evidence.

Comment 5:A potential conflict exists between the claim of infrastructure independence and experimental design. The paper initially asserts that SLAM provides the sole global positioning during flight yet subsequently describes ultra-wideband (UWB) as the exclusive global positioning method. Moreover, installing a marker board to enhance SLAM performance constitutes adding environmental infrastructure, thereby contradicting the stated independence.

Response 5:

The reviewer identifies a genuine conflict in the original manuscript, and we agree it needed to be resolved.

First, the SLAM vs. UWB conflict. The original text was misleading in calling UWB "the exclusive global positioning method." In practice, V-SLAM is the sole positioning source during flight. UWB was deployed exclusively as a metrologically independent evaluation reference and plays no role in the operational system. Section 2.4.1 now states this distinction explicitly (Lines 381-385).

Second, the infrastructure contradiction. We removed infrastructure-independence claims about this deployment and now state the marker context explicitly. Marker boards were present on greenhouse racks from a prior ArUco-based study in the same site (Wang & Lin, 2023) and remained during this campaign. OpenVSLAM processed these patterns as generic visual features without ArUco-specific detection. The manuscript now treats marker presence as a disclosed experimental condition and a limitation on transferability, not as evidence of infrastructure independence (Section 1, Lines 83-103; Section 2.4.1, Lines 417-427; and Section 4, Lines 665-688). Any generic "infrastructure-free alternatives" wording in Section 1 is literature taxonomy rather than a claim about this campaign.

References:

Wang, J.-Y.; Lin, T.-T. Application of a Visual-Based Autonomous Drone System for Greenhouse Muskmelon Phenotyping. In Proceedings of the 2023 ASABE Annual International Meeting; ASABE: St. Joseph, MI, USA, 2023; Paper No. 2300294. https://doi.org/10.13031/aim.202300294

 

Comment 6:While the paper specifies parameters such as YOLO learning settings and BOT-SORT thresholds, it omits essential details required for reproducibility. In particular, conditions ensuring safe system rotation—including coordinate system transformations, time synchronization, and fallback procedures when SLAM pose data is supplied to PX4—are not described.

Response 6:

We agree and added these reproducibility elements in Section 2.2.2 (Lines 252-259) using a compact manuscript form: (1) coordinate-frame convention - V-SLAM pose estimates are computed in the camera optical frame and published in the ROS 2 ENU convention; MAVROS converts these to the PX4 NED body-frame convention before forwarding to the flight controller, (2) companion-computer/PX4 clock alignment via MAVLink TIMESYNC, and (3) fallback behavior when SLAM pose is unavailable (optical-flow position hold followed by autonomous landing after source timeout). For implementation clarity, source availability is checked every 0.5 s; unresolved source loss beyond 2 s triggers autonomous landing. This keeps the method reproducible in the main text while avoiding excessive implementation detail.

 

Comment 7:The approach of removing numerous samples using DBSCAN, gating, smoothing, Kalman filtering, and hot-zone flagging complicates the separation and interpretation of SLAM and UWB errors. If UWB is treated as an uncertain reference rather than ground truth, the manuscript should provide (i) comparisons before and after filtering, (ii) justification for threshold selection, and (iii) analysis of the impact of excluded data segments on the results.

Response 7:

We appreciate the reviewer highlighting the need for filtering transparency. We have clarified our methodology with the following updates in Section 2.4.1 (Lines 399-416) and Figure 11 discussion (Lines 553-561). All trajectories use the same fixed filtering parameters (no per-run tuning).

  • Before-and-after comparison: We introduced Figure 11 to overlay raw UWB readings against the filtered reference trajectory. The figure demonstrates that our filtering preserves the raw spatial geometry. Discarded samples exhibit radiating scatter patterns typical of multipath reflections from metal racks. The central trajectory shape is unaffected.
  • Threshold selection: Standard UWB accuracy is roughly 10 cm under dynamic conditions. In this greenhouse, metal racks and foliage increase multipath and NLOS effects, so raw UWB uncertainty can exceed the SLAM error scale. Using raw, unfiltered UWB directly would confound the reference comparison. Our fixed signal-conditioning pipeline reduces UWB reference uncertainty to MAD = 0.26-1.25 cm and p95 = 1.39-8.01 cm before SLAM-UWB error analysis. The low retention rates (21-46%) therefore indicate measurement-environment difficulty, not a claim of UWB hardware failure.
  • Impact of excluded data: Figure 11 shows discarded points clustering near rack boundaries and corridor transitions, consistent with multipath-prone regions. The filtered trajectory remains geometrically consistent with the central raw path in those runs. We therefore report retention ratios, hot-zone proportion, and post-filter uncertainty explicitly, and do not treat filtered UWB as error-free.

These modifications are in Section 2.4.1 (Lines 399-427) and Figure 11 (Lines 553-561).

 

Comment 8:The authors attribute the increased RMSE in the greenhouse to lighting conditions or reflections; however, they provide no observational evidence, such as identifying affected sections, changes in reprojection errors, or tracking failure rates. Furthermore, the claim that improved rotation accuracy results from the trajectory lacks a quantitative comparison with trajectory characteristics.

Response 8:

We agree that the original causal language was too strong for the available evidence. The original manuscript stated that the greenhouse RMSE increase "was primarily due to" variable lighting and sunlight reflections, but no per-section breakdown, reprojection error data, or tracking failure logs were collected to support this claim.

The revised Section 3.1.1 (Lines 480-483) softens the attribution from "was primarily due to" to "is consistent with," citing Al-Tawil et al. (2024) to frame lighting and reflective surfaces as known degradation factors in monocular SLAM rather than a confirmed cause in this dataset. We did not collect reprojection errors or per-frame tracking diagnostics during the campaign, so we cannot provide the direct observational breakdown requested.

Regarding trajectory effects, Section 3.2.2 (Lines 562-591) now reports the statistical comparison explicitly and interprets it with bounded language rather than as confirmed causality: rotation-dominant segments show higher median error than straight segments (0.044 m vs 0.033 m; Mann-Whitney p = 0.022; Cliff's delta = 0.243), and the loop trajectory has the highest RMSE (8.0 cm) while zigzag has the lowest (5.4 cm).

 

Comment 9:The experiment involved five handheld loops rather than in-flight tests. It remains unclear whether intentionally disabling the repositioning module prevents relocalization during operation. Before concluding that the system adequately supports autonomous navigation, experimental results should verify parameter stability under flight conditions that include vibration, motion blur, and altitude variations.

Response 9:

The reviewer raises three valid points. First, the handheld endurance test (Section 3.1.2) was designed to isolate V-SLAM temporal stability under continuous operation, not to replicate flight conditions. The revised manuscript now states this scope explicitly there.

Second, regarding the disabled relocalization module: it was disabled in both the endurance test (Section 3.1.2, Lines 505-508) and the autonomous flight experiments (Section 3.2.2, Lines 545-548). This is deliberate: with relocalization disabled, the system uses only the features established during the initial map-building phase, avoiding contamination from features that may have shifted due to plant growth or human activity between mapping and operation. Section 3.2.2 now states this configuration explicitly. The practical impact is bounded: SLAM pose-update interruptions (gaps > 300 ms) remained below 0.45% of flight time across all trajectories.

Third, regarding flight-condition verification: the autonomous flight experiments in Section 3.2 (Lines 543-545) provide the in-flight validation across three trajectory patterns at mean speeds of 0.128-0.136 m/s. The flight-condition RMSE (5.4-8.0 cm) is comparable to the handheld RMSE (6.4 cm), indicating that vibration and motion dynamics at the tested low speeds did not substantially degrade positioning. A dedicated study isolating vibration and motion blur effects individually was not performed, and we note this as a limitation.

 

Comment 10:The paper attributes the drop from 9 Hz to 6 Hz to heat buildup and memory issues; however, no supporting evidence is provided, such as temperature logs, memory usage, CPU/GPU utilization, or throttling events. If the "8-minute limit without flight" is also a consequence, the manuscript should clarify whether SLAM tracking constitutes the bottleneck and quantify improvements achieved by adjusting resolution, frame rate, or other parameters.

Response 10:

We agree that the original causal attribution (heat buildup, memory issues) was speculative and unsupported. The revised manuscript removes this speculation entirely in Section 3.1.2.

Regarding evidence: no CPU/GPU thermal telemetry, memory usage logs, or throttling event records were collected during the experiment. We acknowledge this instrumentation gap transparently and identify it as a concrete improvement for future deployments.

Regarding the 8-minute limit: the maximum continuous operation time of 8 minutes and the tracking rate drop from 9 Hz to 6 Hz were observed together during the same endurance test, but we cannot confirm whether they share a common cause without the instrumentation data described above. The revised manuscript no longer attributes either observation to a specific mechanism.

Regarding bottleneck attribution and tuning: we did not collect per-process profiling or run resolution/frame-rate ablation; therefore, we do not claim a component-level mechanism in this revision. We only report observed behavior under fixed settings (720p, 15 fps) and bound impact operationally in Section 3.1.2 (Lines 493-508): at 6 Hz and 0.3 m/s, per-update displacement is approximately 5 cm, within the 10 cm waypoint tolerance used in evaluation.

 

Comment 11:In the experiment, flower detection demonstrates low mean average precision, and tracking exhibits a significant undercounting bias, with a MOTA of 49%. Treating this bias as corrected by regularization appears arbitrary. A detailed analysis is required to determine whether to undercount results from missed detections, identity switches, or merging errors. Additionally, a correction model should be developed to address heterogeneity caused by increased flower counts, incorporating cross-validation and temporal variation as covariates

Response 11:

We agree that error decomposition is essential for understanding the MOTA result. The complete breakdown for both tracking levels is:

- Plant Tracking; MOTA: 72.4%; FNR: 17.96%; FPR: 8.6%; IDSW rate: 1.0%

- Flower Tracking; MOTA: 49.1%; FNR: 32.2%; FPR: 10.0%; IDSW rate: 8.7%

The decomposition reveals a FN-dominant failure mode for flower tracking: missed detections (FNR = 32.2%) account for the majority of error, followed by identity switches (8.7%) and false positives (10.0%). The MOTA drop from plant to flower level is primarily driven by increased FNR (+14.2 percentage points) and IDSW (+7.7 pp), consistent with the smaller size and higher visual similarity of individual flowers compared to whole plants. In this evaluator output, merge/split events are not reported as a separate term and are reflected in FN and IDSW components.

This FN-dominant pattern directly explains the systematic undercounting bias (mean = −1.63 flowers/plant): the system misses flowers more frequently than it generates false detections (FNR >> FPR), producing a conservative net undercount. Per-frame absolute counting accuracy therefore remains limited, and this limitation is explicitly stated in the revised manuscript (Section 3.3.2, Lines 602-624).

Regarding the correction model: the reviewer raises a valid point about modeling heterogeneity as counts increase. We did not introduce a post-hoc correction model in this revision because the available dataset size (three rows x five plants over 27 days) is limited for robust cross-validated correction with temporal covariates. Accordingly, we bound our claim to temporal trend recovery, not absolute count correction. The manuscript now states this boundary explicitly and lists bias-aware correction modeling as future work in Section 3.3.3 (Lines 625-632).

Comment 12:Although overall comprehension is adequate, several sentences are lengthy and complex, hindering the clear communication of the core argument and contribution. Specifically, the sections addressing the method's prerequisites, model limitations, and result interpretation would benefit from more concise sentence structures to improve technical clarity and persuasiveness.

Response 12:

        We agree that several sentences in the original manuscript were unnecessarily complex. We therefore completed a targeted simplification pass across Sections 2 and 3, with the goal of making prerequisites, limitations, and result interpretation more concise and easier to verify.

        Representative examples include: in Section 2.1 (Lines 173-181), where the communication-role description was separated into shorter function-based statements and accompanied by explicit module mapping and in Section 3.1.2 (Lines 498-508), where the 9 Hz -> 6 Hz degradation is now reported as observed behavior with bounded operational impact rather than speculative thermal or memory causality.

        Across these revisions, we followed a consistent writing rule: one primary concept per sentence, metrics before interpretation, and explicit separation of observation from inference.

Reviewer 5 Report

Comments and Suggestions for Authors

Dear Authors,

I find your work well written and of interest not only to specialists in agricultural data processing, but also to end users such as agronomists and producers.

However, I believe that several major revisions are necessary to enhance the overall value of the manuscript:

  • It is important that you include a concise analysis of the practical implications of your experimental results and provide clearer guidance regarding the feasibility of implementing the proposed monitoring method in real greenhouse operations. Further details are provided in the attached document, which contains specific remarks on the main text.
  • Similarly, potential adopters of your method would likely appreciate an approximate estimate of the proposed hardware setup costs, particularly in comparison with well-known, affordable commercial solutions, both drones and related hardware, currently available on the agricultural market.
  • Regarding the training of the AI models: fixing the number of epochs a priori is not advisable. Instead, I would expect you to use the validation mAP (e.g., mAP@0.5 or mAP@0.5:0.95) as the stopping criterion, which is the appropriate approach to optimize training while avoiding underfitting and overfitting.
  • The bibliography should be partially revised. Several references are outdated and do not adequately reflect the current state of the art, particularly in rapidly evolving fields such as UAV systems and AI models.

As mentioned above, you will find detailed comments on the manuscript in the attached document.

Comments for author File: Comments.pdf

Author Response

Comment 1: It is important that you include a concise analysis of the practical implications of your experimental results and provide clearer guidance regarding the feasibility of implementing the proposed monitoring method in real greenhouse operations. Further details are provided in the attached document, which contains specific remarks on the main text.

Response 1:

We appreciate this suggestion. The revised manuscript now includes practical deployment guidance at system-operation level: platform cost and COTS component strategy (Section 2.1), mission preparation and runtime envelope (Section 2.4.2), off-board source-loss fallback behavior (Section 2.2.2), and operational boundaries/limitations from field deployment (Section 4).

Specifically, the manuscript now reports: core platform cost of approximately $1,800 USD (Section 2.1), daily mission preparation of approximately 30-60 minutes and flight duration of approximately 4-6 minutes per 2,200 mAh battery (Section 2.4.2), and safety fallback logic when V-SLAM updates are unavailable (optical-flow hold followed by autonomous landing after timeout; Section 2.2.2). To keep claims evidence-bound, we frame feasibility within tested conditions (single greenhouse, marker-present environment, low-speed operation) and avoid claiming broad transferability beyond the demonstrated deployment envelope (Section 4).

 

Comment 2: Similarly, potential adopters of your method would likely appreciate an approximate estimate of the proposed hardware setup costs, particularly in comparison with well-known, affordable commercial solutions, both drones and related hardware, currently available on the agricultural market.

Response 2:

We provide a three-tier itemized cost breakdown. All components are commercially available off-the-shelf (COTS); no custom hardware was developed.
Tier 1 — Core navigation platform ($1,790):

- Component: UAV frame; Cost (USD): $50

- Component: Motors ×4; Cost (USD): $40

- Component: ESC ×4; Cost (USD): $80

- Component: Propellers; Cost (USD): $10

- Component: Pixhawk 6X flight controller; Cost (USD): $450

- Component: RPi CM4 (2 GB) + carrier board; Cost (USD): $30

- Component: Jetson Orin Nano (4 GB) + carrier; Cost (USD): $400

- Component: Walksnail Avatar FPV (VTX + camera + RX); Cost (USD): $410

- Component: ARK Flow (optical flow + ToF rangefinder); Cost (USD): $250

- Component: Battery 2200 mAh LiPo; Cost (USD): $20

- Component: Misc (power distribution, wiring, mounts); Cost (USD): $50

- Component: Subtotal; Cost (USD): $1,790
Tier 2 — + Phenotyping application ($1,890): adds a side-mounted 4K action camera ($100) for perpendicular crop imagery.
Tier 3 — + Evaluation setup ($2,290): adds a UWB kit (5 anchors + 1 tag, $400) used solely for V-SLAM validation.

For context, integrated commercial agricultural drone solutions are typically priced substantially higher than the core Tier 1 setup above, although exact market prices vary by region, payload class, and configuration. The modular COTS design also allows independent component upgrades without vendor lock-in. Revised in: Section 2.1 (platform cost statement); detailed itemized breakdown provided in this response.

 

Comment 3:Regarding the training of the AI models: fixing the number of epochs a priori is not advisable. Instead, I would expect you to use the validation mAP (e.g., mAP@0.5 or mAP@0.5:0.95) as the stopping criterion, which is the appropriate approach to optimize training while avoiding underfitting and overfitting.

Response 3:

We appreciate this methodological point and agree that the original description was insufficiently detailed. Training used a composite fitness metric (0.9 × mAP@0.5:0.95 + 0.1 × mAP@0.5) evaluated after each epoch, and final model weights were selected from the epoch achieving the highest validation composite fitness, not from the last epoch. Specifically:

  • Plant detection model: trained for 150 epochs; best fitness at epoch 91.
  • Flower detection model: trained for 200 epochs; best fitness at epoch 173.

A patience window of 100 epochs was configured but not triggered in either case, as both models reached their epoch ceiling before 100 consecutive non-improving epochs occurred. The best-epoch weight selection ensures that the deployed model reflects peak validation performance regardless of total training duration. The revised Section 2.3.2 now states the fitness metric, best epochs, and weight-selection procedure explicitly.

 

Comment 4:The bibliography should be partially revised. Several references are outdated and do not adequately reflect the current state of the art, particularly in rapidly evolving fields such as UAV systems and AI models.

Response 4:

We have completed bibliography synchronization in the revised manuscript. The changes fall into three categories:

Replaced (4 entries):

- [10] Bellvert 2014 → Ndlovu et al. (2024), UAV thermal remote sensing review

- [22] Roldán 2015 → Al-Najadi et al. (2025), drone thermal sensing in greenhouse

- [26] Gomes 2016 → Iaboni et al. (2022), motion capture quadrotor localization

- [30] Jin 2018 → Elamin et al. (2025), visual-inertial odometry for indoor UAV

Removed (5 entries, no longer cited in revised text):

- [4] Bechar & Vigneault 2016 (agricultural robots review)

- [5] Cheng et al. 2024 (greenhouse robot positioning)

- [12] Shakhatreh et al. 2019 (UAV civil applications survey)

- [35] Schmuck & Chli 2017 (multi-UAV collaborative SLAM)

- [36] Chhikara et al. 2021 (deep neural net UAV navigation)

Added (5 entries):

- Al-Tawil et al. (2024), V-SLAM review (reviewer-suggested, added alongside retained [33,34])

- Krul et al. (2021), visual SLAM for indoor livestock drone

- Sukvichai et al. (2023), monocular ORB SLAM for indoor agricultural drone

- Islam et al. (2023), stereo agricultural SLAM

- Zhai et al. (2025), dynamic Kalman filtering for fruit tracking

References:

Ndlovu, H. S.; Odindi, J.; Sibanda, M.; Mutanga, O. A systematic review on the application of UAV-based thermal remote sensing for assessing and monitoring crop water status in crop farming systems. International Journal of Remote Sensing 2024, 45(15), 4923-4960. https://doi.org/10.1080/01431161.2024.2368933

Al-Najadi, R.; Al-Mulla, Y.; Al-Abri, I.; Al-Sadi, A. M. Effectiveness of drone-based thermal sensors in optimizing controlled environment agriculture performance under arid conditions. Scientific Reports 2025, 15, 9042. https://doi.org/10.1038/s41598-025-94432-0

Iaboni, C.; Lobo, D.; Choi, J.-W.; Abichandani, P. Event-based motion capture system for online multi-quadrotor localization and tracking. Sensors 2022, 22(9), 3240. https://doi.org/10.3390/s22093240

Elamin, A.; El-Rabbany, A.; Jacob, S. Event-based visual/inertial odometry for UAV indoor navigation. Sensors 2025, 25(1), 61. https://doi.org/10.3390/s25010061

Al-Tawil, B.; Hempel, T.; Abdelrahman, A.; Al-Hamadi, A. A review of visual SLAM for robotics: evolution, properties, and future applications. Frontiers in Robotics and AI 2024, 11, 1347985. https://doi.org/10.3389/frobt.2024.1347985

Islam, R.; Habibullah, H.; Hossain, M. T. AGRI-SLAM: a real-time stereo visual SLAM for agricultural environment. Autonomous Robots 2023, 47, 649-668. https://doi.org/10.1007/s10514-023-10110-y

Zhai, Y.; Zhang, L.; Hu, X.; Yang, F.; Huang, Y. A Dynamic Kalman Filtering Method for Multi-Object Fruit Tracking and Counting in Complex Orchards. Sensors 2025, 25, 4138. https://doi.org/10.3390/s25134138

 

Comment 5:As mentioned above, you will find detailed comments on the manuscript in the attached document.

Response 5:

We thank the reviewer for the detailed line-by-line annotations. We converted the attachment into a closure checklist and addressed each item in the corresponding manuscript section:

  • Terminology: "LiDAR" adopted as the consistent spelling throughout the manuscript.
  • Acronyms and phrasing: redundant "unmanned aerial vehicle (UAV)" expanded form removed after first use; "extended Kalman filter (EKF)" acronym introduced at first mention.
  • 2.1 hardware description: the manuscript now focuses on execution-role clarity and explicitly states the two-camera setup in Section 2.1 (execution-role wording at Lines 151-167; two-camera description and COTS/cost statement at Lines 182-192).
  • 2.1 redundancy removal: the three-tier communication description has been consolidated to avoid repetition. The base-station setup is clarified in Section 2.1 (Jetson ground-station role at Lines 160-162; mission-monitoring laptop role at Lines 176-180).
  • 2.4.1 definitions: first-use definitions added for R99 (99th-percentile radial error), NLOS (non-line-of-sight), and p95/p99 percentiles in Section 2.4.1 (Lines 392-416).
  • Figure numbering: all figure numbers have been corrected and cross-checked against in-text references throughout the manuscript. The duplicate "Figure 1" issue has been resolved.
  • Figure typography: caption style inconsistencies identified in the attachment were corrected in the revised manuscript.
  • 3.3 Figure 14: orientation description updated from "left and right" to "top and bottom." Error bars represent across-plant variability within each row (A, B, C); this is now stated in the figure discussion (Lines 628-631).
  • 3.3 and §4 yield prediction: a brief note on the potential use of temporal flower-count trends for yield estimation has been added as a future-work item in Section 4 (Lines 689-690).
  • 3.1.2 hardware performance: the rate degradation (9 Hz -> 6 Hz) is addressed in Section 3.1.2 (Lines 498-508) and in our response to Reviewer #4, Comment 10. Causal attribution is now limited to observed behavior and quantified impact, with no unsupported mechanism claim.
  • Bibliography updates: references were revised under a design-framework criterion: retain/add literature that directly informs architecture constraints, localisation/tracking method choices, and deployment trade-offs in greenhouse UAV operation; remove entries that are outdated or no longer support the revised argument flow. The update set remains 4 replacements, 5 removals, and 5 additions.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has been revised appropriately.

Author Response

Comments:

The manuscript has been revised appropriately.

Response:

We thank the reviewer for the careful re-evaluation and for confirming that the previous concerns have been addressed.

Reviewer 3 Report

Comments and Suggestions for Authors

The research novelty of the study remains limited, which was the basis of my initial recommendation for rejection. However, the authors have clarified that their contribution lies in system integration and deployment evidence rather than algorithmic novelty. Given the knowledge gap identified in lines 101-103, I find this an acceptable justification.

Author Response

Comment:

The research novelty of the study remains limited, which was the basis of my initial recommendation for rejection. However, the authors have clarified that their contribution lies in system integration and deployment evidence rather than algorithmic novelty. Given the knowledge gap identified in lines 101-103, I find this an acceptable justification.

Response:

We thank the reviewer for re-evaluating the contribution scope. As clarified in the revision, the manuscript consistently positions this work as a systems contribution (constraint-driven architecture, 27-day deployment evidence, and a phenotyping proof-of-concept) rather than an algorithmic-novelty paper. This framing is maintained in Section 1 and Section 4.

Reviewer 4 Report

Comments and Suggestions for Authors

 

1. Detailed Review by Item
Comment 1.
The reviewer's requests to identify key bottlenecks and clarify the paper's focus were adequately addressed. Specifically, clarifying that the dual-link architecture and edge offloading are design responses to the compute and payload bottlenecks of small UAVs, rather than independent research objectives, improved the manuscript.
Comment 2.
Separating the rationale for selecting monocular SLAM from the mitigation strategy was appropriate. Additionally, distinguishing UWB as an independent evaluation reference rather than for operational use was suitable.
Comment 3.
The reviewer noted that repeated mentions of online processing, offline analysis, and offloading within the same paragraph obscured the execution location and data flow.
Comment 4.
The response differentiates the control hierarchy and offers practical stability evidence by calculating the displacement margin during the pose-update interval. However, this does not fully satisfy the request for acceptable end-to-end jitter and closed-loop stability. Consequently, the claim should be limited to evidence of practical stability under low-speed monitoring conditions.

Comment 5.

Clearly distinguishing SLAM as the sole positioning source during actual flight, with UWB serving as the evaluation reference, is appropriate.

  • Comment 6.
    Including coordinate system transformation, MAVLink TIMESYNC, and fallback procedures for SLAM pose loss significantly enhances reproducibility. Notably, the addition of ENU-NED transformation and timeout-based autonomous landing conditions is valuable.
    Comment 7.
  • The explanation of the necessity of filtering, the filtered overlay versus raw data, the retention ratio, and the hot-zone ratio represents a clear improvement. However, the reviewer requested a before-and-after comparison, a justification for the threshold selection, and a separate analysis of the impact of excluded segments. Therefore, a table specifying fixed parameter values for DBSCAN, gating, smoothing, Kalman, and hot-zone should be provided, along with quantitative retention rates per run, excluded segment lengths or ratios, and changes in error statistics before and after filtering.
    Comment 8.
    Softening the statement by attributing the cause to lighting/reflection with "consistent with" is appropriate. Additionally, presenting median, p-value, and Cliff's delta for the trajectory effect enhances persuasiveness.
    Comment 9.
  • Limiting the handheld endurance test's purpose to temporal stability verification and stating that the relocalization-disabled setting was applied equally to the flight test is appropriate. Moreover, the logical connection between handheld RMSE and flight RMSE during low-speed flights is sound.
    Comment 10.
  • Eliminating assumptions about causes, such as heat accumulation and memory issues, in favor of reporting only observed phenomena is beneficial. Transparently acknowledging the instrumentation gap is also appropriate for defensive purposes.
    Comment 11.
  • Specifying the FN/FPR/IDSW decomposition of flower tracking is appropriate and explaining that undercount bias arises from FN-dominant error is logical. However, the reviewer requested a correction model, cross-validation, and inclusion of temporal covariates. The response indicates that a correction model was not introduced due to data limitations; thus, this requirement is not fully met. Consequently, sentences imply that regularization corrects bias should be removed, limiting the contribution to "temporal trend recovery."
    Comment 12.
    The response clearly states the principle of sentence simplification and provides examples of actual corrections, which is satisfactory. The presentation of a writing rule separating observation and inference is particularly noteworthy.

 

Comments on the Quality of English Language

Although overall comprehension is adequate, several sentences are lengthy and complex, hindering the clear communication of the core argument and contribution. Specifically, the sections addressing the method's prerequisites, model limitations, and result interpretation would benefit from more concise sentence structures to improve technical clarity and persuasiveness.

Author Response

 

Comment 1:The reviewer's requests to identify key bottlenecks and clarify the paper's focus were adequately addressed. Specifically, clarifying that the dual-link architecture and edge offloading are design responses to the compute and payload bottlenecks of small UAVs, rather than independent research objectives, improved the manuscript.

Response:

We thank the reviewer for confirming that the revised Introduction now communicates the bottleneck-to-design mapping more clearly. The current manuscript retains this structure throughout.

 

Comment 2:Separating the rationale for selecting monocular SLAM from the mitigation strategy was appropriate. Additionally, distinguishing UWB as an independent evaluation reference rather than for operational use was suitable.

Response:

We appreciate the reviewer's confirmation. The distinction between system operation and evaluation is maintained across Sections 2.1 and 2.4.1.

 

Comment 3:The reviewer noted that repeated mentions of online processing, offline analysis, and offloading within the same paragraph obscured the execution location and data flow.

Response:

We are glad to hear that the restructured Section 2.1, together with the module-to-hardware mapping table (Table 1), resolved the data-flow clarity issue raised previously.

 

Comment 4:The response differentiates the control hierarchy and offers practical stability evidence by calculating the displacement margin during the pose-update interval. However, this does not fully satisfy the request for acceptable end-to-end jitter and closed-loop stability. Consequently, the claim should be limited to evidence of practical stability under low-speed monitoring conditions.

Response:

We agree with this qualification. Accordingly, the stability claim in the revised manuscript is limited to the tested low-speed monitoring conditions, supported by the 10 Hz outer-loop operation, the approximately 126 ms pose-update interval, and the flight results reported in Section 3.2. We do not claim a separately measured end-to-end jitter bound or a general closed-loop stability guarantee beyond the tested operating regime (Section 3.1.2, Lines 499–504).

Comment 5:Clearly distinguishing SLAM as the sole positioning source during actual flight, with UWB serving as the evaluation reference, is appropriate.

Response:

We thank the reviewer for this confirmation. This operational-versus-evaluation distinction is maintained in the manuscript.

 

Comment 6:Including coordinate system transformation, MAVLink TIMESYNC, and fallback procedures for SLAM pose loss significantly enhances reproducibility. Notably, the addition of ENU-NED transformation and timeout-based autonomous landing conditions is valuable.

Response:

We thank the reviewer for recognizing these additions. These reproducibility details are retained in Section 2.2.2 of the current manuscript.

 

Comment 7:The explanation of the necessity of filtering, the filtered overlay versus raw data, the retention ratio, and the hot-zone ratio represents a clear improvement. However, the reviewer requested a before-and-after comparison, a justification for the threshold selection, and a separate analysis of the impact of excluded segments. Therefore, a table specifying fixed parameter values for DBSCAN, gating, smoothing, Kalman, and hot-zone should be provided, along with quantitative retention rates per run, excluded segment lengths or ratios, and changes in error statistics before and after filtering.

Response:

We agree that the filtering workflow should be documented more explicitly. The revised manuscript (Section 2.4.1, Lines 397–414) already reports all fixed parameter values, aggregate retention ranges, hot-zone definitions, and post-conditioning uncertainty statistics. Figure 11 overlays the raw UWB scatter with the conditioned reference trajectory for each path, providing the requested before-and-after visual comparison.

To give the reviewer full transparency beyond the manuscript text, we additionally provide the following two summary tables and the before-conditioning residual statistics in this response:

  • Parameter specification (Table A). Table A consolidates all fixed conditioning parameters (DBSCAN, kinematic gating, median smoothing, Gaussian smoothing, Kalman refinement, and hot-zone flagging) from the manuscript text into a single table with their functional roles and selection rationale. All values were fixed across trajectories and chosen to suppress multipath-driven excursions relative to the vendor-specified UWB dynamic accuracy and the motion scale of the greenhouse missions.
  • Run-wise retention and exclusion (Table B). Table B breaks out the per-trajectory retention statistics that the manuscript reports in aggregate: raw sample count, retention after each rejection stage, the overall excluded ratio, final reference-point count, and hot-zone ratio.
  • Before-and-after error statistics. Before conditioning, the raw UWB trajectory residuals showed MAD = 1.61–1.72 cm, p95 = 13.36–15.02 cm, and p99 = 37.04–42.41 cm. After conditioning, these reduced to MAD = 0.26–1.25 cm, p95 = 1.39–8.01 cm, and p99 = 1.80–16.40 cm, as reported in the manuscript.
  • Impact of excluded segments. The rejected samples are spatially concentrated near metal-rack boundaries and corridor transitions, where non-line-of-sight propagation is expected. The raw-versus-conditioned overlay in Figure 11 supports that the retained reference follows the same central path geometry in these three trajectories. We report the retention ratios and post-conditioning uncertainty explicitly and use the conditioned path as a bounded evaluation reference, rather than claiming that filtering eliminates all reference bias.

 

Table A. Fixed UWB Signal-Conditioning Parameters

Stage

Parameter(s)

Fixed value

Role in pipeline

Selection basis

DBSCAN

eps, minPts

0.01, 10

Remove spatially isolated multipath outliers

Set to the spatial resolution of the UWB tag under static conditions.

Kinematic gate

Window

0.2 s

Reject non-physical local jumps

Corresponds to the maximum plausible displacement at the monitored flight speed.

Median smoothing

Window

7 samples

Suppress short residual spikes

Short support window to remove spikes without reshaping turns.

Gaussian smoothing

Window, sigma

10 samples, 2

Average local fluctuation with center weighting

Balances noise reduction with preservation of local trajectory geometry.

Kalman refinement

Q, R

3e-4, 1.0

Generate the final conditioned reference path

Favors a smooth motion prior over noisy raw UWB observations.

Hot-zone flagging

Window, threshold

0.2 s, sigma > 8 cm

Flag locally degraded reference segments

Exposes degraded regions transparently without excluding them from the dataset.

 

Table B. Run-Wise Retention and Exclusion Summary

Trajectory

Raw samples N

After DBSCAN %

After kinematic gate %

Excluded by DBSCAN + gate %

Final reference points n

Hot-zone ratio %

Linear

9,500

21.3

20.0

80.0

189

8.8

Loop

16,787

34.7

32.9

67.1

552

10.9

Zigzag

21,592

46.8

46.2

53.8

996

9.4

 

 

 

Comment 8:Softening the statement by attributing the cause to lighting/reflection with "consistent with" is appropriate. Additionally, presenting median, p-value, and Cliff's delta for the trajectory effect enhances persuasiveness.

Response:

We thank the reviewer for this assessment. The revised text maintains the causal language at the level of observed association rather than direct proof, and the statistical summary (median, Mann-Whitney p, Cliff's delta) is retained in Section 3.2.2.

 

Comment 9:Limiting the handheld endurance test's purpose to temporal stability verification and stating that the relocalization-disabled setting was applied equally to the flight test is appropriate. Moreover, the logical connection between handheld RMSE and flight RMSE during low-speed flights is sound.

Response:

We appreciate the reviewer's confirmation. The endurance test remains scoped to temporal stability verification in Section 3.1.2, and the cross-reference to flight-condition RMSE is retained in Section 3.2.

 

Comment 10:Eliminating assumptions about causes, such as heat accumulation and memory issues, in favor of reporting only observed phenomena is beneficial. Transparently acknowledging the instrumentation gap is also appropriate for defensive purposes.

Response:

We agree. The runtime discussion in Section 3.1.2 now reports only the observed rate degradation and its bounded operational impact, without attributing it to a specific mechanism.

 

Comment 11:Specifying the FN/FPR/IDSW decomposition of flower tracking is appropriate and explaining that undercount bias arises from FN-dominant error is logical. However, the reviewer requested a correction model, cross-validation, and inclusion of temporal covariates. The response indicates that a correction model was not introduced due to data limitations; thus, this requirement is not fully met. Consequently, sentences imply that regularization corrects bias should be removed, limiting the contribution to "temporal trend recovery."

Response:

We agree with this qualification. We do not claim a post-hoc correction model or cross-validated bias correction in this revision. The contribution is limited to error decomposition and temporal trend recovery, not absolute count correction. Any wording that could imply regularization corrected the undercount bias has been removed, and bias-aware correction modeling is noted as future work in Section 3.3.3 (Lines 657–658).

Correspondingly, the Introduction (Lines 116–117) now states that the proof-of-concept flower counting pipeline supports temporal trend recovery despite systematic undercounting, and the Conclusions (Lines 679) bound the claim to trend-level longitudinal monitoring.

 

Comment 12:The response clearly states the principle of sentence simplification and provides examples of actual corrections, which is satisfactory. The presentation of a writing rule separating observation and inference is particularly noteworthy.

Response:

        We thank the reviewer for recognizing this improvement. We have maintained this writing discipline (one concept per sentence, metrics before interpretation, observation separated from inference) throughout the final text.

Reviewer 5 Report

Comments and Suggestions for Authors

Dear Authors,
thank you for your feedback. I believe the manuscript is now ready for publication.

Author Response

Comment:Dear Authors, thank you for your feedback. I believe the manuscript is now ready for publication.

Response:

We sincerely thank the reviewer for the positive assessment and for the constructive suggestions provided across the review process.

Back to TopTop