Tomato Recognition and Localization Method Based on Improved YOLOv5n-seg Model and Binocular Stereo Vision
Round 1
Reviewer 1 Report
Dear Author,
This manuscript is original and innovative. But the aim of this manuscript wasn’t given shortly in the abstract and introduction section. The authors have emphasized the importance and justification by references. It was determined using the model and program in the study were denoted precisely in the section of material and method. Results in detail were discussed with relative references. Conclusions adequately were emphasized. The author's articles weren’t used as references in this manuscript. As a result, this manuscript is an original, innovative and recent study in my opinion. But Which was made environment this application? On tomatoes grown in a greenhouse? Or was it made on tomatoes grown outdoors? should be specified in the material section of the study. This manuscript is at the qualification that would be published in this journal.
Comments for author File: Comments.pdf
Author Response
Dear Editor and Reviewers,
We would like to express our sincere gratitude again to the Editor and Reviewers for the constructive and positive comments concerning our manuscript entitled “Tomato Recognition and Localization Method Based on Improved YOLOv5n-seg Model and Binocular Stereo Vision” (ID: 2572617). Those comments are valuable for revising the paper, as well as of important guiding significance to our researches. We have studied those comments carefully and have made corresponding correction. The revised portions are highlighted in red color in the revised manuscript.
The main corrections in the paper and the responses to the reviewer’s comments are as follows:
Reviewer 1:
- Which was made environment this application? On tomatoes grown in a greenhouse? Or was it made on tomatoes grown outdoors? should be specified in the material section of the study.
Response: Thanks for your valuable comments, the environment for this application is tomatoes grown in a greenhouse, which we have added in the Research Materials section as shown in lines 131-132. To further highlight this environment, we have added lines 151-152 and 156-159. In order to emphasize more superiorly in the part of abstract, we have revised the abstract, the modifications were marked in red, as shown in the revised manuscript-maked. In order to emphasize more superiorly in the part of introduction, we have also revised the introduction to remove redundant statements in lines 112-117.
Reviewer 2 Report
This paper introduces an architecture based on YOLO-tomato for tomato recognition and localization. Here are my suggestions for the authors.
1. The name assigned to the model, YOLO-Tomato, collides with the one proposed by the following relevant citation, which should be included: https://doi.org/10.3390/s20072145. Please underline the difference to avoid misunderstanding from both general and expert audiences.
2. The Introduction proposes a brief state-of-the-art, which does not account for several novel applications of YOLO to closely related fields. Please find enclosed the following relevant citations to be included to improve this section.
a. https://doi.org/10.1016/j.biosystemseng.2021.08.015
b. https://doi.org/10.3390/agronomy12020319
c. https://doi.org/10.1016/j.compag.2023.107757
3. Section 2.1, rows 110 – 113: the statement is too generic. Please provide accurate information about the different degrees of tomato plant foliage occlusion, fruit overlapping, and lighting conditions. This can help with reproducibility.
4. Section 2.1: Did the train/test split account for the number of labels in each image? In other words, did the authors also consider the number of labels during splitting, ensuring the 8:2 ratio?
5. Did the authors consider other mechanisms for gathering the three–dimensional point cloud (e.g., structured light, line laser)?
6. From the specifics (the Intel i9-12900H CPU, mainly), it seems the authors used laptop-based hardware for the experiments. However, one of the main aims of such a work should be testing the feasibility of the proposed approach directly on the field, as stated in the introduction, where the authors aimed to propose a lightweight tomato recognition model. Hence, I suggest extending the experiment using a System-on-a-Chip, such as an NVIDIA Jetson, to evaluate the feasibility of the model directly on the field.
7. Table 3 provides the AP for the four compared models. Since all the values are very similar, I suggest the authors compare the results at the second decimal digit.
While the work appears promising, several problems should be addressed before publication. Hence, I recommend a major revision.
Author Response
Dear Editor and Reviewers,
We would like to express our sincere gratitude again to the Editor and Reviewers for the constructive and positive comments concerning our manuscript entitled “Tomato Recognition and Localization Method Based on Improved YOLOv5n-seg Model and Binocular Stereo Vision” (ID: 2572617). Those comments are valuable for revising the paper, as well as of important guiding significance to our researches. We have studied those comments carefully and have made corresponding correction. The revised portions are highlighted in red color in the revised manuscript.
The main corrections in the paper and the responses to the reviewer’s comments are as follows:
- The name assigned to the model, YOLO-Tomato, collides with the one proposed by the following relevant citation, which should be included: https://doi.org/10.3390/s20072145. Please underline the difference to avoid misunderstanding from both general and expert audiences.
Response: Thank you for your valuable comments. We have changed the original name assigned to the model, YOLO-Tomato, to YOLO-TomatoSeg to avoid conflicts with other papers and to ensure the uniqueness of the revised name.
- The Introduction proposes a brief state-of-the-art, which does not account for several novel applications of YOLO to closely related fields. Please find enclosed the following relevant citations to be included to improve this section.
- https://doi.org/10.1016/j.biosystemseng.2021.08.015
- https://doi.org/10.3390/agronomy12020319
- https://doi.org/10.1016/j.compag.2023.107757
Response: Thank you for your valuable suggestions. According to your request, we have improved this section by adding relevant references in the introduction, such as lines 57-66 and 69-70. At the same time, we are aware of the shortcomings in the introduction and add additional references related to the high complexity of the current instance segmentation model, as shown in lines 78-86, so that the introduction can more reasonably launch our research objectives
- Section 2.1, rows 110 – 113: the statement is too generic. Please provide accurate information about the different degrees of tomato plant foliage occlusion, fruit overlapping, and lighting conditions. This can help with reproducibility.
Response: Thank you for your valuable suggestions. According to your request, we have provided accurate information about the different degrees of tomato plant foliage occlusion, fruit overlapping, and lighting conditions.as shown in lines 139-145.
- Section 2.1: Did the train/test split account for the number of labels in each image? In other words, did the authors also consider the number of labels during splitting, ensuring the 8:2 ratio?
Response: Thank you for raising this question. Since the dataset we constructed only contains the sample labels of the single category of ripe tomato. and is completely randomly divided in the process of dividing into training set and test set according to the ratio of 8:2, the ratio of the number of sample labels in the training set and test set will be close to 8:2 in theory. In fact, according to statistics, the training set contains 3765 labels and the test set contains 1026 labels, and their ratio is indeed close to 8:2.
- Did the authors consider other mechanisms for gathering the three–dimensional point cloud (e.g., structured light, line laser)?
Response: Thank you for raising this important question. Our current work utilizes binocular cameras to gather the 3D point cloud data. We agree that exploring other mechanisms like structured light and line laser could provide useful comparative analysis. For this initial study, we chose the binocular camera because it costs less than structured light and line laser, does not require additional energy to work, and can not only obtain point cloud data but also take visible light images for tomato identification. However, we recognize that each approach has its own advantages and limitations. In future work, we will explore experiments combining visible light cameras with depth sensors to improve the accuracy of fruit recognition and positioning. We thank you for highlighting this line of inquiry worthy of further investigation, which will greatly enhance the thoroughness of our research.
6 From the specifics (the Intel i9-12900H CPU, mainly), it seems the authors used laptop-based hardware for the experiments. However, one of the main aims of such a work should be testing the feasibility of the proposed approach directly on the field, as stated in the introduction, where the authors aimed to propose a lightweight tomato recognition model. Hence, I suggest extending the experiment using a System-on-a-Chip, such as an NVIDIA Jetson, to evaluate the feasibility of the model directly on the field.
Response: Thank you for your valuable suggestion. As you pointed out, evaluating the model's feasibility in the field using system-on-a-chip is indeed more intuitive. However, due to the current lack of relevant devices and our unfamiliarity with model migration, we are unable to conduct extended experiments in a short period of time. We will carry out research on model deployment and application in the future. Nevertheless, compared to the original model and other models, our model has been significantly reduced in terms of size, parameters, and flops, indicating the lightweight goal has been achieved. Meanwhile, as described in the following References 1, 2, 3 and 4, the smaller the model size and the fewer parameters and flops, the lighter the model, which is more conducive to deployment and application on terminal devices. In Reference 5, a model of 33.7MB in size (ours is 2.52MB) can detect 3024*3024 resolution (our is 1024*1024) zanthoxylum images at 33.23FPS on an NVIDIA Jetson TX 2, which indirectly proves the feasibility of our model in the field.
- https://doi.org/10.3389/fpls.2023.1166296
2 https://doi.org/10.1371/journal.pone.0282297
3 https://doi.org/10.3390/rs13091619
4 https://doi.org/10.1016/j.compag.2022.107534
5 https://doi.org/10.3390/s22020682
7 Table 3 provides the AP for the four compared models. Since all the values are very similar, I suggest the authors compare the results at the second decimal digit.
Response: Thank you for your valuable suggestions. According to your request, we have retained the AP values on Table 3 to two decimal places for comparison. Meanwhile, to maintain consistency in the AP values, we have retained all AP values on Tables 2 and 4 and the manuscript text to two decimal places as well.
Round 2
Reviewer 2 Report
The authors greatly improved the quality of manuscript, which can now be considered for publication.