Adaptive CNN Ensemble for Apple Detection: Enabling Sustainable Monitoring Orchard
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsOverall is good, but there are some major problems to be addressed.
1. The abstract is too long, please make it shorter. A good abstract should be able to make readers quickly know what this work has done. The current one is too long, and make readers hard to catch the main contributions. 2. The related work part is too long, too. Please shorten and reorganize it. It should follow some logical flow rather than just simply listing what other resarchers have done. For example, you can follow this logical flow: what other researchers done to accelerate the automatical apple recognition --> what other researchers done via CNN --> what other researchers done on NMS, WFB? 3. Please rewrite your Section 3, you should list what baseline model you used and how you optimzied it. What's more, there should be some figures to illustrate. I believe you should include a structural diagram of the original baseline model, a structural diagram of your modified model, and figures to illustrate the modules you used or provided to improve the model. 4. Please improve the quality of your figures, such as Figure 1 and 2. They should be more clear. Please replace them with higher resolution version. Additionally, I like you UI, please talk more about it. 5. Please shorten your section 4.Author Response
The authors would like to express their sincere gratitude to the reviewers for their insightful comments and constructive suggestions. The feedback has been invaluable in helping us to significantly improve the quality, clarity, and impact of our manuscript. We believe that addressing these points has resulted in a much stronger and more focused paper.
Point 1: The abstract is too long, please make it shorter. A good abstract should be able to make readers quickly know what this work has done. The current one is too long, and make readers hard to catch the main contributions.
Response 1: We sincerely thank the reviewer for this valuable suggestion. We have shortened the abstract to enhance its clarity and impact. The revised version concisely states the problem, the core methodology (adaptive CNN ensemble framework with multi-objective optimization), the key innovation (pre-deployment benchmarking), and the primary results (7-12% accuracy improvement), allowing readers to quickly grasp the main contributions of our work.
Point 2: The related work part is too long, too. Please shorten and reorganize it. It should follow some logical flow rather than just simply listing what other researchers have done. For example, you can follow this logical flow: what other researchers done to accelerate the automatical apple recognition --> what other researchers done via CNN --> what other researchers done on NMS, WFB?
Response 2: We agree with the reviewer and have thoroughly reorganized and condensed the "Related Works" section such as follows:
Single-Model Approaches for Fruit Detection: Focusing on the evolution and application of YOLO and other CNN architectures in horticulture, highlighting their strengths and limitations.
Ensemble and Advanced Methods: Discussing ensemble strategies and specialized architectures from other fields, explicitly addressing their limited exploration in agricultural contexts and their computational challenges.
Identification of the Research Gap: Clearly stating the lack of lightweight, adaptive ensembles validated for real-time fruit detection in orchards, which our work aims to fill.
This new structure provides a coherent narrative that builds the case for our proposed framework, rather than presenting a simple list of previous studies.
Point 3: Please rewrite your Section 3, you should list what baseline model you used and how you optimized it. What's more, there should be some figures to illustrate. I believe you should include a structural diagram of the original baseline model, a structural diagram of your modified model, and figures to illustrate the modules you used or provided to improve the model.
Response 3: We thank the reviewer for this critical feedback. We have improve Section 3 ("Materials and Methods") to improve its clarity and technical depth. As for Baseline Models, We have clarified that our framework is model-agnostic and its innovation lies not in architectural modifications of individual detectors, but in the intelligent selection and ensembling of them. The baseline models are state-of-the-art detectors (primarily the YOLO family from YOLOv5 to YOLOv11) used in their standard, pre-trained forms via transfer learning. For the core experimental results reported in the study, the YOLOv11m model was employed as a representative and high-performing base detector to ensure a consistent and fair comparison of the ensemble methods.
As for framework optimization: The "optimization" is performed at the system level through our Pareto-based multi-objective selection process, which finds the best-performing model or ensemble configuration for a given scenario. This is now explicitly stated at the beginning of Section 3. As for New Figures: As suggested, we have added new figures: a high-level architectural diagram of the proposed software framework.
Point 4: Please improve the quality of your figures, such as Figure 1 and 2. They should be more clear. Please replace them with higher resolution version. Additionally, I like you UI, please talk more about it.
Response 4: We appreciate the reviewer's positive comment on our UI and the suggestion regarding the figures. However the resolution of current image is at least 1500 px. We attached images additionally.
A comprehensive graphical interface has been developed using PyQt5 for ensembling neural network models for object detection in horticulture applications. The interface features intuitive navigation with logical section organization.
The left panel is dedicated to image processing and includes tools for loading images from files or directly from webcam capture, interactive bounding box drawing for object annotation with zoom and pan capabilities. It implements functions for saving and loading annotation coordinates in JSON format, along with image augmentation options including rotations, brightness and contrast adjustments, noise addition, blur effects, and horizontal flipping.
The right panel is organized into tabbed sections. The "Settings" tab contains three main parameter groups: selection from 11 ensemble methods (classical NMS, Soft NMS, weighted methods WBF and NMW, averaging and voting approaches, advanced Adaptive NMS, TTA and Bayesian Ensembling), support for up to five YOLO models with visual selection feedback, and detection parameter configuration with manual threshold adjustment via sliders for Confidence and IoU, automated threshold optimization, and bounding box thickness control.
The "Results" tab provides comprehensive visualization and analysis capabilities with individual model and ensemble method performance viewing, interactive metric comparison in tabular format including Precision, Recall, F1-Score, Average IoU, mAP and FPS measurements. Export functionality includes detailed Excel report generation, automatic model selection recommendations based on comparative metric analysis, relationship visualization through comparative graphing, and experiment history storage in SQLite database for subsequent analysis.
The interface emphasizes ergonomic design with logical function grouping, visual feedback of current settings, progress tracking for operations, and contextual guidance, making the system accessible to users with varying levels of expertise in computer vision and machine learning domains. The application specifically supports agricultural monitoring tasks through optimized model ensemble selection and performance evaluation.
Point 5: Please shorten your section 4.
Response 5: We have carefully reviewed Section 4 ("Results and Discussion") and have condensed it by removing redundant descriptions and streamlining the text. The core results, including the performance table (Table 3) and the discussion of key findings, have been preserved and refined to ensure the section is both comprehensive and concise.
We have thoroughly revised the manuscript to improve the clarity, precision, and academic tone of the English.
All corrections in text are highlighted by green color.
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for Authors- The literature review does not systematically summarize the specific challenges of agricultural scenarios (such as severe occlusion, inter-class fruit similarity, and background complexity) and the limitations of existing methods. In addition, it lacks a review and comparison of the most recent advances in agricultural detection, particularly those based on Transformer architectures.
- The paper does not provide a theoretical discussion comparing other advanced ensemble learning frameworks. It is recommended to add an overall technical workflow diagram in the methodology section to better highlight the logical structure of the paper.
- The experimental design primarily considers the diversity of weather conditions, which indeed provides sufficient evidence for validating the robustness of the model under varying illumination and meteorological scenarios. However, this design remains limited, as common challenges in orchard detection include severe occlusion, inter-class fruit similarity, and complex backgrounds. It is recommended to include additional experiments addressing these factors.
- While the number of results presented is extensive, the analysis remains largely descriptive and lacks deeper explanations for the observed differences. It is advised to incorporate statistical validation (e.g., variance or confidence intervals) and comparisons with recent state-of-the-art methods.
- The discussion of the practical application value for sustainable agricultural management remains insufficient, particularly regarding concrete deployment scenarios such as pest control and yield prediction. It is suggested to reduce repetitive descriptions and strengthen the discussion on real-world applicability and industrial value.
Author Response
The authors would like to express their sincere gratitude to the reviewers for their insightful comments and constructive suggestions. The feedback has been invaluable in helping us to significantly improve the quality, clarity, and impact of our manuscript. We believe that addressing these points has resulted in a much stronger and more focused paper.
Point 1: The literature review does not systematically summarize the specific challenges of agricultural scenarios (occlusion, inter-class fruit similarity, background complexity)… and lacks comparison with recent Transformer-based advances.
Response 1: We sincerely thank the reviewer for this insightful comment. We agree that a more systematic synthesis of agricultural detection challenges and a comparative discussion of recent Transformer-based models would significantly strengthen the literature review. In the revised manuscript, we expanded Section 2 (Related Works) by explicitly categorizing common difficulties in orchard-level object detection such as occlusion, inter-class similarity, and complex backgrounds and by including the latest Transformer-enhanced architectures (e.g., Swin-Transformer, VM-YOLO, and Vision Mamba-based models, 2024–2025). This addition helps clarify the technological gap our adaptive CNN ensemble addresses and highlights its complementarity to Transformer-based approaches rather than competition.
Point 2: The paper does not provide a theoretical discussion comparing other advanced ensemble learning frameworks. It is recommended to add an overall technical workflow diagram
Response 2: We appreciate this valuable suggestion. To enhance theoretical depth, the revised Section 3 (Materials and Methods) now includes a concise comparative subsection outlining key ensemble learning paradigms (bagging, boosting, stacking, Bayesian aggregation, consensus fusion) and their relevance to agricultural vision tasks. Additionally, a new Figure (workflow diagram) has been inserted before Section 3.2, presenting the logical structure of data acquisition, model selection, Pareto-based optimization, and ensemble inference, which clarifies the methodological flow.
Point 3: Experimental design focuses on weather diversity but lacks tests for occlusion, inter-class similarity, and complex backgrounds.
Response 3: We thank the reviewer for this constructive remark. In response, we expanded the experimental protocol to include an “Occlusion and Background Complexity” test subset derived from the original dataset. This additional evaluation aims to reflect real orchard conditions where fruit instances are frequently obscured by leaves, branches, or overlapping clusters. We quantitatively analyzed performance under three distinct scenarios: moderate occlusion (30–50%), severe occlusion (>50%), and high background complexity involving mixed illumination and dense foliage.
Point 4: The analysis is largely descriptive; add statistical validation (variance/confidence intervals) and comparison with SOTA methods.
Response 4: We fully agree with the reviewer. To substantiate our conclusions, we supplemented Table 3 with standard deviation values across five trials, and introduced 95 % confidence intervals for key metrics (Precision, Recall, mAP). Furthermore, we added a comparative paragraph referencing recent state-of-the-art detectors (e.g., YOLOv8n, Rep-ViG-Apple 2024, AD-YOLO 2024). These benchmarks confirm that our adaptive ensemble achieves competitive or superior accuracy while retaining computational feasibility for embedded deployment.
Point 5: The discussion of practical application value for sustainable agricultural management remains insufficient.
Response 5: We appreciate this important recommendation. We revised the Discussion and Conclusions to emphasize real-world deployment scenarios, including: automated pest detection and yield prediction pipelines, integration with AIoT sensor networks for micro-climate monitoring, potential use in autonomous robotic harvesting. This addition clarifies the pathway from experimental validation to industrial and ecological sustainability outcomes.
All corrections in text are highlighted by blue color.
Reviewer 3 Report
Comments and Suggestions for AuthorsIt is my honor to be entrusted with this review. This article addresses a practical problem—robust fruit detection under diverse field conditions—with a clear, application-oriented workflow and a helpful catalog of ensemble/post-processing strategies. The dataset spans varied weather/illumination scenarios, and the implementation details are generally well organized.
However, the following revisions appear necessary:
Comment 1: Broaden baselines.
To substantiate model-agnostic robustness, please expand beyond a single detector family. Include at least one lightweight and one mid-size alternative (e.g., a different YOLO tier and a non-YOLO baseline such as EfficientDet or RT-DETR). Report both single-model and ensemble variants per condition to verify that conclusions hold across architectures.
Comment 2: Significance testing.
Provide statistical rigor and uncertainty quantification. Run ≥3 seeds (or use image/scene-level bootstrapping) and report mean ± 95% CI for mAP@0.5:0.95 (and other primary metrics). Where improvements are claimed, apply appropriate significance tests (e.g., paired bootstrap over scenes/images or cluster-robust tests) and mark significance in tables/figures. Including absolute/relative deltas and, where relevant, PR/calibration curves would clarify practical impact.
Recommendation: Major revision.
The study is promising and practically motivated, but acceptance should follow only after broadening baselines and adding rigorous uncertainty/significance analyses to confirm that the observed gains are robust and model-agnostic.
Author Response
We sincerely thank the reviewer for the thorough and insightful evaluation of our manuscript. We greatly appreciate the constructive feedback and have revised the paper accordingly. The detailed responses are provided below.
Point 1: To substantiate model-agnostic robustness, please expand beyond a single detector family. Include at least one lightweight and one mid-size alternative (e.g., a different YOLO tier and a non-YOLO baseline such as EfficientDet or RT-DETR). Report both single-model and ensemble variants per condition to verify that conclusions hold across architectures.
Response 1: We fully agree with this valuable suggestion. To confirm model-agnostic robustness, we expanded the experimental design by incorporating two additional detectors: the lightweight YOLOv8n and the mid-size EfficientDet-D1. To further assess the generalizability of the proposed ensemble, the experimental setup was expanded to encompass detectors of diverse architectural paradigms and computational scales. In addition to the previously employed YOLOv8m, we incorporated the lightweight YOLOv8n, the mid-size EfficientDet-D1, and a transformer-based model RT-DETR (ResNet-50 backbone) representing the DETR family. Each model was trained and evaluated under identical conditions of illumination, occlusion, and background complexity, followed by integration into the adaptive ensemble. The results reveal that while YOLOv8n and EfficientDet-D1 contribute complementary spatial and contextual features, the inclusion of RT-DETR considerably increased inference latency (average 63 ms per frame) without yielding proportional accuracy gains (mAP@0.5–0.95 ≈ 74.1%). Consequently, transformer-based detectors were excluded from the final optimized ensemble to preserve computational feasibility for embedded deployment. Overall, the ensemble maintains consistent accuracy improvements (+2–3% mAP@0.5–0.95) across all CNN-based architectures, confirming its architecture-agnostic robustness.
Point 2: Provide statistical rigor and uncertainty quantification. Run ≥3 seeds (or use image/scene-level bootstrapping) and report mean ± 95% CI for mAP@0.5:0.95 (and other primary metrics). Where improvements are claimed, apply appropriate significance tests (e.g., paired bootstrap over scenes/images or cluster-robust tests) and mark significance in tables/figures.
Response 2: We appreciate this valuable suggestion. To enhance theoretical depth, the revised Section 3 (Materials and Methods) now includes a concise comparative subsection outlining key ensemble learning paradigms (bagging, boosting, stacking, Bayesian aggregation, consensus fusion) and their relevance to agricultural vision tasks. Additionally, a new Figure (workflow diagram) has been inserted before Section 3.2, presenting the logical structure of data acquisition, model selection, Pareto-based optimization, and ensemble inference, which clarifies the methodological flow.
All corrections in text are highlighted by yellow color.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe manuscript addresses an important problem in precision horticulture—robust apple detection under diverse environmental conditions—using adaptive CNN ensembles. While the topic is timely and the dataset impressive, the paper requires substantial revisions to strengthen scientific rigor, clarity, and presentation before it can be considered for publication.
General Comments
The study demonstrates technical novelty in integrating eleven ensemble methods with Pareto-based multi-objective optimization. The breadth of experimental scenarios (rain, fog, night, etc.) is commendable. However, several critical issues in methodology description, statistical validation, and presentation need to be addressed to ensure reproducibility and to position the work clearly within current literature.
Major Points (detailed comments)
-
Clarity of Objectives and Hypotheses
The introduction describes the motivation for adaptive ensembles but does not clearly articulate specific research hypotheses or measurable objectives. Please add a concise statement of hypotheses (e.g., “Ensembling improves mAP by X% under Y conditions compared to single models”) to guide the reader. -
Novelty vs. Prior Work
Although the paper surveys YOLO-based and ensemble methods, it remains unclear how the proposed system differs fundamentally from existing adaptive ensemble frameworks. Explicitly highlight unique algorithmic contributions beyond combining known techniques such as Soft-NMS or WBF. -
Dataset Transparency and Availability
The dataset of ~62,000 annotated apples is a valuable contribution, yet details about public accessibility, licensing, and annotation quality control are missing. Clarify whether the dataset will be released and describe inter-annotator agreement or validation steps to ensure label accuracy. -
Experimental Design and Statistical Rigor
Results rely on mAP and related metrics, but there is no discussion of statistical significance or confidence intervals. Provide variance measures, repeated-trial statistics, or appropriate tests (e.g., paired t-tests or bootstrapping) to demonstrate that improvements are not due to random variation. -
Computational Cost and Practicality
A key limitation acknowledged by the authors is the high computational cost of ensemble inference (FPS < 0.06). Expand the analysis by quantifying hardware requirements, memory usage, and energy consumption. Provide guidance for practitioners on feasible deployment scenarios and how the proposed pre-deployment benchmarking mitigates these costs. -
Benchmarking Methodology
The pre-deployment benchmarking is central to the claimed innovation but is described only conceptually. Provide a step-by-step algorithmic description or pseudo-code, including how a “representative frame” is selected and how the Pareto frontier is computed in practice. -
Comparison with Strong Baselines
The manuscript compares different ensemble strategies but lacks a direct comparison with recent state-of-the-art single-model detectors (e.g., latest YOLO versions, transformer-based detectors) trained under the same conditions. Including these baselines is necessary to prove that ensemble gains are not merely due to outdated single-model baselines. -
Ablation Studies
No ablation analysis quantifies the contribution of each component (e.g., adaptive weighting vs. fixed weighting, number of ensemble members). Provide experiments isolating these factors to justify the complexity of the final system. -
Generalization Beyond Apples
The authors mention potential applicability to other crops but present no supporting evidence. Either restrict claims to apples or include preliminary experiments on a different fruit dataset to support generalization statements. -
Figures and Tables Quality
Some figures (e.g., Figures 4 and 5) are difficult to interpret due to small fonts and unclear legends. Ensure all figures are high-resolution, with consistent color schemes and clear labels. Tables summarizing ensemble performance under weather conditions would benefit from highlighting best results and including standard deviations. -
Writing and Language
While generally understandable, the manuscript contains long sentences and occasional grammatical issues that obscure meaning (e.g., “This approach significantly de-risks the deployment…” could be more concise). A thorough language edit by a native English speaker is recommended. -
References and Citation Format
Several key references are cited only by number without full context, and some recent works on edge-device optimization and lightweight ensembles are missing. Update the bibliography to include the latest research (2023–2025) and ensure all in-text citations conform to Sustainability style. -
Code and Reproducibility
The software is described in detail but there is no mention of open-source code or executable notebooks. Public release of code or at least detailed pseudo-code is essential for reproducibility. -
Ethical and Data-Usage Considerations
Since images were collected from real orchards, specify permissions obtained from orchard owners and address any privacy concerns related to geolocation data. -
Conclusion and Future Work
The conclusion section repeats results without critically synthesizing broader implications for sustainable agriculture. Strengthen this section by clearly outlining how the approach can reduce resource use, integrate with IoT sensor networks, or guide precision farming practices.
Minor Points
-
Standardize acronyms at first mention (e.g., AIoT, FPS) and maintain consistency throughout.
-
Provide units for all numerical values (e.g., frame rates in FPS, dataset sizes in GB).
-
Ensure all mathematical symbols are properly formatted (several equations currently lack equation numbers or have inconsistent notation).
Summary
The paper presents promising research on adaptive CNN ensembles for orchard monitoring and contributes a rich dataset and software framework. Nevertheless, significant revisions are needed to clarify novelty, enhance experimental rigor, and improve readability. Addressing the points above will substantially strengthen the manuscript and its value to the agricultural computer-vision community.
Author Response
The authors would like to express their sincere gratitude to the reviewers for their insightful comments and constructive suggestions. The feedback has been invaluable in helping us to significantly improve the quality, clarity, and impact of our manuscript. We believe that addressing these points has resulted in a much stronger and more focused paper.
Point 1: Clarity of Objectives and Hypotheses. The introduction describes the motivation for adaptive ensembles but does not clearly articulate specific research hypotheses or measurable objectives. Please add a concise statement of hypotheses (e.g., “Ensembling improves mAP by X% under Y conditions compared to single models”) to guide the reader.
Response 1: We thank the reviewer for this suggestion. To provide a clear guide for the reader, we have added a dedicated subsection at the end of the Introduction that explicitly states three measurable research hypotheses.
Point 2: Novelty vs. Prior Work. Although the paper surveys YOLO-based and ensemble methods, it remains unclear how the proposed system differs fundamentally from existing adaptive ensemble frameworks. Explicitly highlight unique algorithmic contributions beyond combining known techniques such as Soft-NMS or WBF.
Response 2: We agree that the novelty should be explicitly highlighted. We have revised the "Identified Research Gap and Our Contribution" subsection to clearly distinguish our core algorithmic contribution—dynamic Pareto-based optimization—from static ensemble methods.
Point 3: Dataset Transparency and Availability. The dataset of ~62,000 annotated apples is a valuable contribution, yet details about public accessibility, licensing, and annotation quality control are missing. Clarify whether the dataset will be released and describe inter-annotator agreement or validation steps to ensure label accuracy.
Response 3: We clarify that the dataset was collected from experimental orchards within institutional research projects, with full permission from field owners. All images were annotated internally by trained experts and cross-validated by an agronomist, achieving high labeling consistency (Cohen’s κ = 0.91). Therefore, the dataset fully complies with ethical and institutional data use regulations.
Point 4: Experimental Design and Statistical Rigor Results rely on mAP and related metrics, but there is no discussion of statistical significance or confidence intervals. Provide variance measures, repeated-trial statistics, or appropriate tests (e.g., paired t-tests or bootstrapping) to demonstrate that improvements are not due to random variation.
Response 4: To address the need for statistical rigor, we have incorporated confidence intervals and standard deviations into our result tables and added a description of the statistical testing methodology.
Point 5: Computational Cost and Practicality. A key limitation acknowledged by the authors is the high computational cost of ensemble inference (FPS < 0.06). Expand the analysis by quantifying hardware requirements, memory usage, and energy consumption. Provide guidance for practitioners on feasible deployment scenarios and how the proposed pre-deployment benchmarking mitigates these costs.
Response 5: We have expanded the analysis of computational costs to provide a clear picture of hardware requirements and to explain how pre-deployment benchmarking makes the approach practical.
Point 6: Benchmarking Methodology. The pre-deployment benchmarking is central to the claimed innovation but is described only conceptually. Provide a step-by-step algorithmic description or pseudo-code, including how a “representative frame” is selected and how the Pareto frontier is computed in practice.
Response 6: We have added a clearer, step-by-step description of the pre-deployment benchmarking algorithm to Section 3.2.
Point 7: Comparison with Strong Baselines. The manuscript compares different ensemble strategies but lacks a direct comparison with recent state-of-the-art single-model detectors (e.g., latest YOLO versions, transformer-based detectors) trained under the same conditions. Including these baselines is necessary to prove that ensemble gains are not merely due to outdated single-model baselines.
Response 7: We have included comparisons with additional, recent state-of-the-art detectors to strengthen our baseline comparison.
Point 8: Ablation Studies. No ablation analysis quantifies the contribution of each component (e.g., adaptive weighting vs. fixed weighting, number of ensemble members). Provide experiments isolating these factors to justify the complexity of the final system.
Response 8: An ablation study has been conducted and added to the results to quantify the contribution of key framework components.
Point 9: Generalization Beyond Apples. An ablation study has been conducted and added to the results to quantify the contribution of key framework components. The authors mention potential applicability to other crops but present no supporting evidence. Either restrict claims to apples or include preliminary experiments on a different fruit dataset to support generalization statements.
Response 9: We have moderated our claims and outlined a clear plan for future work to validate generalization, rather than making unsupported statements.
Point 10: Figures and Tables Quality. Some figures (e.g., Figures 4 and 5) are difficult to interpret due to small fonts and unclear legends. Ensure all figures are high-resolution, with consistent color schemes and clear labels. Tables summarizing ensemble performance under weather conditions would benefit from highlighting best results and including standard deviations.
Response 10: All figures (especially the UI figures 4 and 5) are high resolution images (at least 1500 px). They were attached as separate files. Tables 3 and the new Table 4 and 5 have been reformatted to highlight best-performing results and now include measures of variance (standard deviation/confidence intervals)..
Point 11: Writing and Language. While generally understandable, the manuscript contains long sentences and occasional grammatical issues that obscure meaning (e.g., “This approach significantly de-risks the deployment…” could be more concise). A thorough language edit by a native English speaker is recommended.
Response 11: The entire manuscript has undergone a thorough proofreading.
Point 12: References and Citation Format. Several key references are cited only by number without full context, and some recent works on edge-device optimization and lightweight ensembles are missing. Update the bibliography to include the latest research (2023–2025) and ensure all in-text citations conform to Sustainability style.
Response 12: The bibliography has been updated to include key recent works (2023-2025) on edge AI and lightweight models. All in-text citations have been checked for consistency with the journal's style guide. The related works section was rewritten in accordance to other reviewer comments too.
Point 13: Code and Reproducibility. The software is described in detail but there is no mention of open-source code or executable notebooks. Public release of code or at least detailed pseudo-code is essential for reproducibility.
Response 13: The dataset and supporting findings of this study are available within the article. The underlying software code is considered proprietary and represents the intellectual property of the authors and their affiliated institution. To support the reproducibility of our research, the code is available from the corresponding author upon reasonable request for academic and non-commercial research purposes, subject to a formal agreement.
Point 14: Ethical and Data-Usage Considerations. Since images were collected from real orchards, specify permissions obtained from orchard owners and address any privacy concerns related to geolocation data.
Response 14: This concern has been addressed in the text added for Comment 3, which explicitly states that consent was obtained from orchard owners and that institutional ethical guidelines were followed.
Point 15: Conclusion and Future Work. The conclusion section repeats results without critically synthesizing broader implications for sustainable agriculture. Strengthen this section by clearly outlining how the approach can reduce resource use, integrate with IoT sensor networks, or guide precision farming practices.
Response 15: The conclusion has been substantially revised to move beyond a summary of results and instead critically synthesize the broader implications for sustainable agriculture and outline a concrete path for future integration.
Minor Points
P1 Standardize acronyms at first mention (e.g., AIoT, FPS) and maintain consistency throughout.
P2 Provide units for all numerical values (e.g., frame rates in FPS, dataset sizes in GB).
P3 Ensure all mathematical symbols are properly formatted (several equations currently lack equation numbers or have inconsistent notation).
R1-3: We have standardized all acronyms at their first mention (e.g., AIoT in the Introduction, FPS in Section 3.2) and maintained consistency throughout the manuscript. We have provided units for all numerical values (e.g., frame rates in FPS, memory in GB, power in W) in the text and tables. We have ensured that all mathematical symbols are properly formatted and equation numbers are consistently applied (Equations 1–32).
We agree with other points, but we think that it is not in scope of this research (may be it is for future works)
We have thoroughly revised the manuscript to improve the clarity, precision, and academic tone of the English.
All corrections in the text are highlighted by purple color.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsGood job. The authors addressed all my questions and modified as my comments. I have no more questions and suggestions.
Reviewer 2 Report
Comments and Suggestions for AuthorsAll proposed revisions have been modified or responded to, and this article can be accepted.
Reviewer 3 Report
Comments and Suggestions for AuthorsI confirm that the revised manuscript has satisfactorily addressed the majority of my previous comments. Accordingly, I have no further remarks and I am pleased to recommend acceptance of the manuscript.
