Detection of Apple Trees in Orchard Using Monocular Camera
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors1. The paper provides some introduction to both SSD and YOLO models, but lacks intuitive architectural diagrams of the models. It is recommended to supplement the paper with model structure diagrams or flowcharts, highlighting the advantages of YOLO in feature extraction, to help readers understand how differences in network design affect detection performance. Additionally, the paper lacks a workflow diagram, and it is suggested to add one to summarize the author's work process for the readers.
2. The experimental data includes only 213 images, which is too small a dataset to demonstrate generalization capability. Moreover, the experiment lacks consideration of external influencing factors such as lighting, seasonal changes, and weather variations, which could introduce errors and reduce the robustness of the findings.
3. The overall depth of innovation in the paper is somewhat lacking, particularly in the fourth section, where the main content focuses on parameter tuning of the YOLO model and derives the optimal parameter combination through comparative experiments. However, this experimentation essentially constitutes a parameter tuning process for the YOLO model and lacks sufficient innovation, failing to fully reflect the core contribution of the paper. It is suggested to optimize or improve key modules of the YOLO model to better align with the specific needs of the research, thereby enhancing the paper's innovativeness and academic value.
4. The current comparison only involves YOLOv4 and SSD. It is recommended to include performance comparisons with the latest models (such as YOLOv8, DETR, or deeplabv3+) to more comprehensively evaluate the advancement of the method. Additionally, the trade-offs (accuracy vs. speed) of lightweight models (such as YOLO Nano) in orchard scenarios could be explored. The paper briefly mentions semantic segmentation models, stating only that they are more troublesome to annotate and train. It is hoped that a quantitative comparison between the target detection model and the semantic segmentation model can also be conducted.
5. The conclusion section needs to supplement discussions on the limitations of the model (such as adaptability to sudden rainfall events, complexity of parameter tuning) and future improvement directions (such as integrating multi-source meteorological data, exploring lightweight deployment). Additionally, potential extended applications of the model in agricultural irrigation or disaster warning could be proposed to enhance its practical value.
Comments on the Quality of English LanguageIt should be improved.
Author Response
We would like to thank the reviewer for their thoughtful feedback. Responses for each of the comments are given below.
- comment 1: The paper provides some introduction to both SSD and YOLO models, but lacks intuitive architectural diagrams of the models. It is recommended to supplement the paper with model structure diagrams or flowcharts, highlighting the advantages of YOLO in feature extraction, to help readers understand how differences in network design affect detection performance. Additionally, the paper lacks a workflow diagram, and it is suggested to add one to summarize the author's work process for the readers.
- response 1: We have added a flowchart that provides an overview of the method, along with a corresponding explanation in the text. As architectural diagrams of each of the models can be found in the corresponding papers, we did not see the added value in replicating the figures of other authors.
An overview of the experiments performed in this paper is given in figure~\ref{fig:flowchart}. First, we collected data at an orchard dedicated to agricultural research, as explained in section~\ref{sect:data}. Then, orchard trees were annotated using LabelImg and saved to the Pascal VOC format, as explained in section~\ref{sect:ann}. For data loaded into YOLO, the annotation data was additionally converted to the YOLO format. Model hyperparameters for SSD and YOLO, which are explained in more detail in section~\ref{sect:method}, were varied as explained in section~\ref{sect:train} to optimize the model via transfer learning. Finally, object detection results were evaluated as described in section~\ref{sect:eval}.
- comment 2: The experimental data includes only 213 images, which is too small a dataset to demonstrate generalization capability. Moreover, the experiment lacks consideration of external influencing factors such as lighting, seasonal changes, and weather variations, which could introduce errors and reduce the robustness of the findings.
- response 2: While we acknowledge the importance of a diverse dataset, unfortunately, the climate at the site of field experiments is known for being covered by thick clouds for a large percentage of the year, so building a dataset that explicitly takes into account the variation in sunlight would require far more extensive planning, potentially over the span of months or more. Additionally, capturing all potential variations in weather patterns would require a specialized waterproof system that is also able to operate in sub-zero environments, which is not currently viable with the current experimental setup.
- comment 3: The overall depth of innovation in the paper is somewhat lacking, particularly in the fourth section, where the main content focuses on parameter tuning of the YOLO model and derives the optimal parameter combination through comparative experiments. However, this experimentation essentially constitutes a parameter tuning process for the YOLO model and lacks sufficient innovation, failing to fully reflect the core contribution of the paper. It is suggested to optimize or improve key modules of the YOLO model to better align with the specific needs of the research, thereby enhancing the paper's innovativeness and academic value.
- response 3: The goal of our research was to achieve good performance on object detection on trees, complex, irregular objects that are difficult to distinguish between when arranged in rows, such that the environment found in an orchard. These properties are not well captured in common objects found in typical benchmark datasets, so originality can be found in the types of objects we are detecting. Since this could be achieved via transfer learning, we did not find it necessary to alter the core structure of the YOLO model.
- comment 4: The current comparison only involves YOLOv4 and SSD. It is recommended to include performance comparisons with the latest models (such as YOLOv8, DETR, or deeplabv3+) to more comprehensively evaluate the advancement of the method. Additionally, the trade-offs (accuracy vs. speed) of lightweight models (such as YOLO Nano) in orchard scenarios could be explored. The paper briefly mentions semantic segmentation models, stating only that they are more troublesome to annotate and train. It is hoped that a quantitative comparison between the target detection model and the semantic segmentation model can also be conducted.
- response 4: We have added some additional explanation as to why segmentation models were not analyzed in this study. Basically, the models listed above assume that a single instance of a segmented object has a continuous set of pixels with the same label for that object, which is not observed in our dataset for trees. As for the additional models proposed, DeepLabv3+ is several years older than YOLOv4, and while DETR was proposed at the same time as YOLOv4, it achieves basically the same accuracy on benchmark datasets with less than half the FPS, so we have not examined these models in this study.
For this dataset in particular, it is difficult to determine which leaves belong to which trees, so annotating segmentation masks by hand would likely lead to many errors in the ground truth data. In addition, since the narrower branches of the trees are not always clearly shown in all images, attempts at instance segmentation would lead to inference results that would lack large parts trees due to apparent lack of pixel continuity between these parts. Thus, we have chosen to focus on object detection models in this study.
- comment 5: The conclusion section needs to supplement discussions on the limitations of the model (such as adaptability to sudden rainfall events, complexity of parameter tuning) and future improvement directions (such as integrating multi-source meteorological data, exploring lightweight deployment). Additionally, potential extended applications of the model in agricultural irrigation or disaster warning could be proposed to enhance its practical value.
- response 5: Comments on the limitations of this study have been added in two parts in the discussion as follows.
Limitations of this study include a small dataset that may not reflect the wide variety of meteorological conditions that orchards experience, such as snowfall, rain, and more extreme events. In order to incorporate these changes into our model, long-term observational studies of the orchard environment using systems that can record in extreme environments should be implemented. Development of a diverse dataset could also be assisted by the use of style transfer methods that use generative deep learning models, such as diffusion models.
Thus, an object detection algorithm that projects two-dimensional predictions onto a three-dimensional space could be proposed. Then, by combining visual orchard data with other types of IoT devices such as temperature and humidity sensors, an automated multi-modal approach to orchard reconstruction that reduces the workload of farmers can be imagined. By incorporating deep learning methods into this digital twin, extensions such as the optimization of agricultural operations such as irrigation, as well as the detection of rare events such as natural disasters, could also be implemented.
- comment 6: Comments on the Quality of English Language: It should be improved.
- response 6: The primary author of this manuscript is a native English speaker with substantial experience proofreading journal articles for both collaborators and other researchers. In addition, they have even received compensation for this service on par with market value on several occasions. Therefore, we believe that claims that this manuscript does not have sufficient English proficiency require more specific evidence than simply "It should be improved.”
Reviewer 2 Report
Comments and Suggestions for AuthorsScientific Review Report
Title:
Detection of Apple Trees in an Orchard Using a Monocular Camera
Authors:
Stephanie Nix, Airi Sato, Hirokazu Madokoro, Satoshi Yamamoto, Yo Nishimura, Kazuhito Sato
1. Brief Description
The paper describes the deep learning approach for the detection of apple trees in an orchard as one of the first steps in developing agricultural digital twins. The paper compares two state-of-the-art object detection algorithms, such as Single Shot MultiBox Detector (SSD) and You Only Look Once (YOLOv4), on a custom orchard dataset. The results have shown that YOLOv4 outperforms SSD by a large margin in terms of mean Average Precision (mAP), reflecting its capability for more complex feature extraction. This work adds value to smart farming through the resolution of issues associated with automated orchard monitoring and the provision of a vision for integrating machine vision into farming.
Main Contributions and Strengths:
- Topic well aligned with smart farming and automation in general.
- Comparative research on two of the most popularly known object detection models.
- Application implication in the digital twin field on agriculture.
- Rather extensive experimental analysis, including variation of parameters batch sizes, frozen layers, etc.
2. General Concept Comments
Novelty and Scope:
The article addresses a significant lacuna because this is for the detection of entire apple trees and not individual fruits, which are fewer in comparison to fruit detection studies. The research is in the domain of smart farming and farm automation and thus must be discussed in detail regarding what it holds for real-world applications, e.g., scalability, cost-effectiveness, and compatibility with existing farm management systems.
Hypothesis and Objectives:
There is no clear hypothesis stated in this manuscript. A correct hypothesis could be included to enhance scientific weight.
Methodological Inaccuracies, Statistical Analysis and Dataset:
In terms of methodology, the strengths of this work are reflected in the transparent experimental design and detailed training conditions: optimizer, batch sizes, frozen layers. The mAP is employed as an assessment measure, sufficient to compare the performance of the detection models. However, weaknesses are evident when it comes to the size of the dataset (which is relatively small, only 213 images), which may limit the generalizability of the findings. While the authors acknowledged that, a more promising discussion of how that weakness is going to be overcome by future research would be desirable. Furthermore, the choice to work only with a monocular camera alone and the compromises of the accuracy and viability of the detection are up for further discussion. Statistic tests, i.e., tests of significance or confidence intervals, between SSD and YOLO are not present; their inclusion would have rendered results much stronger.
Scope and Relevance:
Chapter “Related Works” is prettily summative of the overview of Digital twin and Smart Farming. The literature review, however, misses some recent research works in the past five-year duration on Deep Learning for Agriculture. It would be nice to summarize research on hybrid object detection models or multi sensor data-based research to give strength to the context. The literature is full of theoretical standpoints related to digital twins but scarce in references to real-world implementations or case studies where digital twins have been used in orchards or similar agro-ecosystems. And, if I'm not mistaken, according to the Instructions for Authors, that chapter should be part of the "Introduction." According to the Instructions for Authors of the journal Agriculture, the sections of a research paper should consist of: "Introduction, Materials and Methods, Results, Discussion, Conclusions (optional)".
References:
Furthermore, when citing references, both in the article itself and in the References section, the authors did not fully follow the Instructions for Authors. Please correct this. Additionally, there is no evidence of excessive self-citation.
3. Specific Comments
Introduction:
Lines 10-21: The introduction effectively frames the problem but could benefit from a more explicit research question or hypothesis.
Lines 22-30: While the global role of apple farming is strongly argued, a quantitative comparison with other high-value crops in smart farming applications would render the argument stronger.
Related Work:
Lines 44-78: The agricultural digital twins section is informative; however, there are no specific examples regarding successful adoptions. References could be added for an overview on digital twins in agriculture to strengthen the section.
Lines 79-97: The LiDAR vs. RGB-based approach - the comparison is relevant but brief. A table presenting some advantages and disadvantages for both methods would be welcome.
Materials and Methods:
Lines 99-108: The dataset is well described, but very limited images, 213 in number, is a cause for concern regarding model generalization. There is a need for discussion on how such limitation would affect model training and performance.
Lines 120-159: The description of the SSD and YOLO architectures are good but a bit technical for the general reader. Access could be improved by the inclusion of additional simplified diagrams.
Lines 193-206: It's okay to use mAP as a sole metric; this could be complemented by other metrics, such as F1-score or precision-recall (PR) curves, which would enable a more fine-grained analysis.
Results and Discussion:
Lines 236-241: Although YOLO outperforms others in the results, this discussion does not go further into the practical meaning of such results, e.g., requirements concerning computation and possible challenges on deployment.
Lines 262-270: The identification of dataset size as a limitation is accurate but would be reinforced by specific suggestions for dataset expansion (e.g., inclusion of images under varying environmental conditions).
Lines 271-276: The discussion of future directions is forward-looking but needs specificity. For example, how would incorporating Elman feedback loops or neural radiance fields specifically enhance detection accuracy?
Figures and Tables:
Lines 208-230: The figures, Figures 3-5, are useful, but their readability could be improved by increasing the size of the labels and normalizing colour schemes.
References:
References are relevant but somewhat scanty and include relatively few recent publications. The paper would benefit from a review of the literature for 2022-2025 to increase its currency. Finally, regarding referencing: in-text and References listing, authors have followed the Instructions for Authors only partially. This needs rectification. There doesn't appear to be excessive self-citation.
4. Evaluation Criteria
1. Clarity and Relevance: The paper is clear and relevant to the field but fails to clearly state the practical implications and limitations of the paper.
2. Citations: The citations are sufficient but need to include more recent sources, especially in "Related Work."
3. Experimental Design: Experimentally sound design but is limited by the limitation in the dataset. Recommend conducting statistical tests to add further credibility to the results.
4. Figures and Tables: Generally well-organized but need more readability and consistency. More material should be incorporated, such as annotated examples of the dataset.
5. Discussion & Conclusions: Adequate evidence is provided but should be extended to cover broader implications for agricultural practice.
6. English Language: The language is appropriate but needs some minor proofreading for grammatical correctness.
5. Recommendations
Major:
1. Increase the dataset size and conduct more validation experiments under different conditions.
2. Add statistical tests to make the comparison between YOLO and SSD more robust.
3. Update the references and examples in the "Introduction (Related Work)" section.
Minor:
1. Improve readability for figures and tables.
2. Clearly mention the research question/hypothesis in the introduction and explain it in the Discussion section, referring to the results obtained from your research..
3. Talk about the scalability and feasibility of implementing the proposed solution.
Final Recommendation: Major Revisions
This contribution is novel and valid in the context of digital twin and smart farming applications. While scientifically correct, its impact and readability will be significantly improved by the inclusion of the limitations and more context.
Author Response
We would like to thank the reviewer for their thoughtful feedback. Responses for each of the comments are given below.
- comment 1: Lines 10-21: The introduction effectively frames the problem but could benefit from a more explicit research question or hypothesis.
- response 1: We have included the following additional information about our research question in the introduction.
While much research has been performed on object detection of objects with easily recognizable characteristics, such as fruits and flowers of said fruits, other objects such as trees prove more challenging. In particular, trees captured in images have several characteristics that make detection challenging, such as a lack of continuity of pixels of the same tree, which leads to difficulty in distinguishing trees partially occluded by other trees. Accurately detecting the number and position of orchard trees is important for extracting the state of the orchard, detection of rare events that impact the health of the orchard such as extreme weather events, harvest management, and other applications.
- comment 2: Lines 22-30: While the global role of apple farming is strongly argued, a quantitative comparison with other high-value crops in smart farming applications would render the argument stronger.
- response 2: We are unsure as to what kind of quantitative argument is being requested, but as the issues discussed around apple farming apply to many types of crops, we have added the following generalization to the end of the specified paragraph.
As physically demanding operations and a large body of necessary domain-specific knowledge are issues across a large swath of agriculture, such smart farming systems would also have a large demand among farmers of many high-value crops.
- comment 3: Lines 79-97: The LiDAR vs. RGB-based approach - the comparison is relevant but brief. A table presenting some advantages and disadvantages for both methods would be welcome.
- response 3: We have added an additional table on page 3 that gives some advantages and disadvantages.
- comment 4: Lines 99-108: The dataset is well described, but very limited images, 213 in number, is a cause for concern regarding model generalization. There is a need for discussion on how such limitation would affect model training and performance.
- response 4: Strategies for improving the dataset are explained in more detail in the response to comment 8.
- comment 5: Lines 120-159: The description of the SSD and YOLO architectures are good but a bit technical for the general reader. Access could be improved by the inclusion of additional simplified diagrams.
- response 5: As architectural diagrams of each of the models can be found in the corresponding papers, we did not see the added value in replicating the figures of other authors.
- comment 6: Lines 193-206: It's okay to use mAP as a sole metric; this could be complemented by other metrics, such as F1-score or precision-recall (PR) curves, which would enable a more fine-grained analysis.
- response 6: In evaluating the models in this paper, we have followed the primary usage of AP of leading papers in object detection, such as those introducing the SSD and YOLO v4 models. It is unclear how much additional value would be obtained by appending metrics that use the same precision and recall that are incorporated in the calculation of mAP.
- comment 7: Lines 236-241: Although YOLO outperforms others in the results, this discussion does not go further into the practical meaning of such results, e.g., requirements concerning computation and possible challenges on deployment.
- response 7: YOLO is a popular model that already has reports of deployment on devices as simple as the Raspberry Pi 4, including the official TensorFlow GitHub repository ([https://github.com/tensorflow/examples/tree/master/lite/examples/object_detection/raspberry_pi](https://github.com/tensorflow/examples/tree/master/lite/examples/object_detection/raspberry_pi)) , so we do not anticipate difficulties that rise to the level of independent investigation.
- comment 8: Lines 262-270: The identification of dataset size as a limitation is accurate but would be reinforced by specific suggestions for dataset expansion (e.g., inclusion of images under varying environmental conditions).
- response 8: We have added the following description of the limitations of this dataset in the discussion.
Limitations of this study include a small dataset that may not reflect the wide variety of meteorological conditions that orchards experience, color{red} such as snowfall, rain, and more extreme events. In order to incorporate these changes into our model, long-term observational studies of the orchard environment using systems that can record in extreme environments should be implemented. Development of a diverse dataset could also be assisted by the use of style transfer methods that use generative deep learning models, such as diffusion models.
- comment 9: Lines 271-276: The discussion of future directions is forward-looking but needs specificity. For example, how would incorporating Elman feedback loops or neural radiance fields specifically enhance detection accuracy?
- response 9: While previous versions of this paper referenced Elman feedback loops, a search of the submitted paper did not find any reference to the Elman feedback loops referenced in this comment. The reference to neural radiance fields is not directly discussing detection accuracy, but other image-based methods that can be used in combination with object detection methods to produce a more comprehensive orchard management system.
- comment 10: Lines 208-230: The figures, Figures 3-5, are useful, but their readability could be improved by increasing the size of the labels and normalizing colour schemes.
- response 10: Figures 3-5 are produced automatically via the detection modules of each separate model. While concerns about the readability of the labels are understood, modification of the repositories of separate research groups to standardize the detection output visualizations is outside the scope of this study; rather, the different representations emphasize the fact that the results of different models are being shown.
- comment 11: References are relevant but somewhat scanty and include relatively few recent publications. The paper would benefit from a review of the literature for 2022-2025 to increase its currency. Finally, regarding referencing: in-text and References listing, authors have followed the Instructions for Authors only partially. This needs rectification. There doesn't appear to be excessive self-citation.
- response 11: Out of the references listed in the references section, 30% date from within the 2022-2025 range requested. The references section was produced automatically by pdflatex/bibtex using the MDPI latex template and a references list managed by Zotero, as recommended in the Instructions for Authors, so without specific examples, this does not seem to be a problem that the authors can easily address. Reference numbers have been added to references that were only referenced using a \citeauthor{} command.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsMain concerns have been addressed. Some minors: some related works should be added.
1 Winding pathway understanding based on angle projections in a field environment. 2023.
2 Bending Path Understanding Based on Angle Projections in Field Environments. 2024
3 Muddy irrigation ditch understanding for agriculture environmental monitoring. 2024
4 Vanishing point estimation inspired by oblique effect in a field environment. 2024.
Reviewer 2 Report
Comments and Suggestions for AuthorsThank you for your thorough and thoughtful revisions in response to the provided feedback. I appreciate the effort you have put into addressing each comment in detail and enhancing the clarity and comprehensiveness of the manuscript.
After reviewing your responses and the corresponding modifications, I find that the paper has been suitably improved and that the concerns raised have been adequately addressed. The revisions effectively strengthen the manuscript, and no further changes are necessary.
I support the publication of the paper in its current form.