Next Article in Journal
Psychrotolerant Strains of Phoma herbarum with Herbicidal Activity
Next Article in Special Issue
Mango Fruit Fly Trap Detection Using Different Wireless Communications
Previous Article in Journal
Agronomic and Physiological Performances of High-Quality Indica Rice under Moderate and High-Nitrogen Conditions in Southern China
Previous Article in Special Issue
Pests Identification of IP102 by YOLOv5 Embedded with the Novel Lightweight Module
 
 
Article
Peer-Review Record

Realtime Picking Point Decision Algorithm of Trellis Grape for High-Speed Robotic Cut-and-Catch Harvesting

Agronomy 2023, 13(6), 1618; https://doi.org/10.3390/agronomy13061618
by Zhujie Xu 1, Jizhan Liu 1,2,3,*, Jie Wang 1, Lianjiang Cai 1, Yucheng Jin 1,2,3, Shengyi Zhao 1,2,3 and Binbin Xie 1
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3: Anonymous
Reviewer 4:
Agronomy 2023, 13(6), 1618; https://doi.org/10.3390/agronomy13061618
Submission received: 8 May 2023 / Revised: 3 June 2023 / Accepted: 13 June 2023 / Published: 15 June 2023
(This article belongs to the Special Issue AI, Sensors and Robotics for Smart Agriculture)

Round 1

Reviewer 1 Report

The article presented a real-time picking point decision algorithm for a novel high-speed cut-and-catch grape harvesting robot. The topic of the article is interesting and has practical significance. Minor revisions are recommended according to the following comments:

Question 1The abstract should be polished to clearly highlight the novelties of the work and the main results.

Question 2Some up-to-date literature should be add in the introduction.

Question 3As shown in Table 2, the meaning between the picking efficiency and picking speed should be clarified. It is not clear whether the indicators are measured when the robot chassis is stationary or moving.

Question 4The details in Figure 2 are not clearly shown and are inconsistent with other robotic models in the article.

Question 5The quality of the figures should be improved. I suggest using images with higher resolution or better images in vector format.

Question 6Check each reference format carefully.

Author Response

Dear reviewer, thank you for your guidance!

Our article named Realtime picking point decision algorithm of trellis grape for high-speed robotic cut-and-catch harvesting was revised. We feel great thanks for your professional review work on our article. The changes have been highlighted in the text of the manuscript. The modifications are described and explained in detail if necessary:

Comments 1: The abstract should be polished to clearly highlight the novelties of the work and the main results.

Response: We sincerely thank the reviewer for careful reading. It has been refined and summarized in the abstract.

Comments 2: Some up-to-date literature should be added in the introduction.

Response: Some up-to-date literature has been added to the introduction.

Comments 3: As shown in Table 2, the meaning between the picking efficiency and picking speed should be clarified. It is not clear whether the indicators are measured when the robot chassis is stationary or moving.

Response: Picking efficiency refers to the ratio between the number of picking tasks completed per unit time and the total task amount. Picking speed refers to the time required to complete a single picking task per unit time. The picking efficiency index in Table 3 has been replaced with the picking speed index. The robot adopts a stop-and-go harvesting method. When the target grapes are in the three-dimensional area of ​​interest of the robot, the robot stops walking and harvests, and records the relevant indexes at this time.

The relevant textual narrative has been added to the 3.4 section and highlighted.

Comments 4: The details in Figure 2 are not clearly shown and are inconsistent with other robotic models in the article.

Response: Thanks for the reviewers’ suggestions, the Figure 2 has been replaced to be consistent with other robot models in the palm of the article.

Comments 5: The quality of the figures should be improved. I suggest using images with higher resolution or better images in vector format.

Response: To prevent the manuscript from being too large to upload to the system, the quality of the images was slightly lower and higher resolution images have now been added.

Comments 6: Check each reference format carefully.

Response: Each reference format has been checked.

Author Response File: Author Response.docx

Reviewer 2 Report

1.      The authors have improved the YOLO model by adding a depth input, a hotspot that enables the background to be filtered by depth information in the fruit detection tasks. This has been mentioned in several works, and research progress on this part of the study needs to be added in the first chapter. For example https://doi.org/10.1002/rob.22041,   https://doi.org/10.1016/j.compag.2023.107741, and https://doi.org/10.1016/j.compag.2022.107034.      

 

2.      The improved model needs to be measured using metrics such as mAP, Precision and recall, and to be fair, some other models such as YOLOX, RTMDet, YOLOv7 and YOLOv8 need to be introduced.

 

3.      The error between the true and predicted position is mentioned in the positioning error, how do you get the error between the two in real world measurements?

 

4.      An analysis of the failures could be added, reporting the reasons for failure in the form of tables or pictures, which could be an inspiration to other researchers.

Minor editing of English language required

Author Response

Dear reviewer, thank you for your guidance!

We are sorry about some of the low-level errors involved in the manuscript, which have been corrected. This article is not concise enough and not focused enough, resulting in the novelty of the article not being expressed. After reading your recommended article, the presentation of the paper has been embellished and the innovation points have been refined. The content of the paper has been improved by adding comparative experiments. Your suggested content is added to the article and highlighted, here is a detailed explanation of some of the questions that need to be answered in a focused manner:

Comments 1: The authors have improved the YOLO model by adding a depth input, a hotspot that enables the background to be filtered by depth information in the fruit detection tasks. This has been mentioned in several works, and research progress on this part of the study needs to be added in the first chapter.

Response: Thank you for your suggestions, the above literature has been read and properly cited in the Introduction part, which has been highlighted.

Comments 2: The improved model needs to be measured using metrics such as mAP, Precision and recall, and to be fair, some other models such as YOLO X, RTMDet, YOLO v7 and YOLO v8 need to be introduced.

Response: We sincerely thank the reviewer for valuable suggestions. We agree that more study or more data would be useful to measure the improved model performance. The detection performance of YOLO X and YOLO v7 for different grape samples is added to compare in Figure 21 and Table 1 in 3.2 section. It is presented as a sample detection effect diagram in different interferences. The paper mainly improved the YOLO v4 model, so no comparison of RTMDet model was performed. Due to time constraints, the research on YOLO v8 has not been carried out.

However, we will study more deep learning models to the research and discussion on the location of grape picking points in future work.

Comments 3: The error between the true and predicted position is mentioned in the positioning error, how do you get the error between the two in real world measurements?

Response: The spatial coordinates of the predicted picking points are calculated by the positioning model. The spatial coordinates of the true picking point are read by the X-SEL controller of the SCARA robotic arm. The spatial coordinates of the true picking point make the outer edge of the maximum cutting force of the disc knife close to the grape stem by controlling the SCARA robotic arm. The spatial coordinates of the center of the disc cutter read in the X-SEL controller of the robotic arm at this time are recorded, which is the true picking point.

The relevant textual narrative has been added to the 3.3 section and highlighted.

Comments 4: An analysis of the failures could be added, reporting the reasons for failure in the form of tables or pictures, which could be an inspiration to other researchers.

Response: There are two main reasons for picking failure.

(1) Strong lighting will cause the edge outline of the grapes to be blurred, resulting in inaccurate position information of the acquired grape bounding box. The acquired bounding box is smaller than the complete bounding box. Accurate center point information cannot be extracted from the acquired bounding box.

(2) For grapes with light weight, the disc knife rotates too fast, which will cause the fruit to be thrown behind the knife disc, making picking failure.

The relevant textual narrative has been added to the Discussion part and highlighted.

Author Response File: Author Response.docx

Reviewer 3 Report

This manuscript presents a trellis grape cutting point decision algorithm based on YOLOv4-SE. The algorithm is designed by considering the real planting conditions of trellis grapes, and it shows good recognition performance for trellis grapes in various states. The authors conducted field picking experiments in the orchard using a prototype, demonstrating the effectiveness of the cutting point prediction algorithm for trellis grape picking. If the manuscript could have undergone some revision, it would be a greater work.

1.       By reading this manuscript, it is evident that the authors have conducted extensive work and experiments. The amount of work is substantial, and each proposed approach is supported by sufficient theory and experiments. However, as a journal paper, the length of this manuscript, which is 23 pages, seems to be too long. The authors have dedicated a significant portion of the manuscript to introducing image processing principles and the structure of the prototype. However, these contents are not strongly related to the topic of the manuscript, which is the real-time picking point decision algorithm. It would be better to reduce the content of these parts.

 

2.       The layout of the manuscript could be optimized. For example, the images related to image processing on page 14, could the arrangement of these images be changed?

 

3.       By reading the content in Section 3.2, I am pleased to see that YOLOv4-SE achieves good recognition performance in the application scenario of grape trellises, especially under strong lighting conditions. It is well-known that YOLOv4-SE, compared to the original YOLOv4, introduces the Spatial Excitation Module to enhance the model's representation capability. This module helps the model better learn the spatial relationships between features, thereby improving object detection accuracy. However, in practical applications, YOLOv4-SE may not necessarily outperform the original YOLOv4 in terms of recognition performance. I believe it is not convincing enough to justify the application of YOLOv4-SE in the purple and green grape picking task under complex conditions solely based on comparing the recognition performance of YOLOv4-SE and the original YOLOv4. It is necessary to introduce common object detection models such as Faster R-CNN, SSD, RetinaNet, etc., for comparison. By comparing the performance of each model, such as precision-recall curves, computational efficiency, model quality, robustness, etc., the practicality of the final chosen model can be demonstrated.

 

4.       There is a missing section between Sections 3.2 and 3.4 in the manuscript. It appears that there is an incorrect chapter numbering.

 

5.       In line 368, is the reference to Z incorrect? Based on my understanding from the context, should Z be changed to Z1?

 

6.       In line 407, in Equation 4, according to the indication in Figure 18, should the vertical coordinate related to the predicted cutting point be denoted as T?

 

7.       In Figure 22, the field work was conducted on flat roads between orchard greenhouses. However, I would like to know how the situation would be addressed if the prototype enters a more complex terrain within the orchard. Due to the tilt of the prototype, there may be significant errors in the depth data captured by the Realsense camera, and the predicted cutting point's three-dimensional coordinates obtained after coordinate transformation may deviate significantly from the actual positions. How was this situation addressed?

 

In English writing, this manuscript needs to be further optimized and the language needs to be refined, including but not limited to: the past tense should be used instead of the present tense for work that has already been done, and there are some problems with the use of tenses in the article that need to be revised.

Author Response

Dear reviewer, thank you for your guidance!

We are sorry about some of the low-level errors involved in the manuscript, which have been corrected. This article is not concise enough and not focused enough, resulting in the novelty of the article not being expressed. After reading your recommended article, the presentation of the paper has been embellished and the innovation points have been refined. The content of the paper has been improved by adding comparative experiments. Your suggested content is added to the article and highlighted, here is a detailed explanation of some of the questions that need to be answered in a focused manner:

Comments 1: By reading this manuscript, it is evident that the authors have conducted extensive work and experiments. The amount of work is substantial, and each proposed approach is supported by sufficient theory and experiments. However, as a journal paper, the length of this manuscript, which is 23 pages, seems to be too long. The authors have dedicated a significant portion of the manuscript to introducing image processing principles and the structure of the prototype. However, these contents are not strongly related to the topic of the manuscript, which is the real-time picking point decision algorithm. It would be better to reduce the content of these parts.

Response: We sincerely thank the reviewer for careful reading. We hope that this manuscript articulates our research clearly. The picking point positioning algorithm in the paper was designed and improved based on the prototype itself. Combined with the depth distance threshold, the grape image obtained by the Realsense D455 camera is processed to improve the detection and localization performance of the improved model in the trellis environment.

However, we still shortened the manuscript by optimizing our exposition and cutting out some unnecessary text.

Comments 2: The layout of the manuscript could be optimized. For example, the images related to image processing on page 14, could the arrangement of these images be changed?

Response: Thanks to the reviewers’ suggestions, after adjusting the layout many times, we found that the vertical arrangement works best. Therefore, we have scaled down these images to optimize the layout of the manuscript.

Comments 3: By reading the content in Section 3.2, I am pleased to see that YOLO v4-SE achieves good recognition performance in the application scenario of grape trellises, especially under strong lighting conditions. It is well-known that YOLO v4-SE, compared to the original YOLO v4, introduces the Spatial Excitation Module to enhance the model's representation capability. This module helps the model better learn the spatial relationships between features, thereby improving object detection accuracy. However, in practical applications, YOLO v4-SE may not necessarily outperform the original YOLO v4 in terms of recognition performance. I believe it is not convincing enough to justify the application of YOLO v4-SE in the purple and green grape picking task under complex conditions solely based on comparing the recognition performance of YOLO v4-SE and the original YOLO v4. It is necessary to introduce common object detection models such as Faster R-CNN, SSD, RetinaNet, etc., for comparison. By comparing the performance of each model, such as precision-recall curves, computational efficiency, model quality, robustness, etc., the practicality of the final chosen model can be demonstrated.

Response: We sincerely thank the reviewer for valuable suggestions. We agree that more study or more data would be useful to measure the improved model performance. The detection performance of YOLO X, YOLO v7 and Faster R-CNN for different grape samples is added to compare in Figure 21 and Table 1. It is presented as a sample detection effect diagram in different interferences. The paper mainly improved the YOLO v4 model, so no comparison of SSD and RetinaNet model was performed. Faster R-CNN is a representative Two-stage model, and its detection performance in the trellis grapes is also of great significance for proving the practicability of the improved YOLO v4-SE model. We will study more deep learning models to the research and discussion on the location of grape picking points in future work.

The relevant textual narrative has been added to the 3.2 section and highlighted.

Comments 4: There is a missing section between Sections 3.2 and 3.4 in the manuscript. It appears that there is an incorrect chapter numbering.

Response: We were really sorry for our careless mistakes. Thank you for your reminder. Chapter numbering has been corrected.

Comments 5: In line 368, is the reference to Z incorrect? Based on my understanding from the context, should Z be changed to Z1?

Response: We sincerely thank the reviewer for careful reading. In line 368, Z represents the depth direction in the coordinate axis. In line 369, Z has been changed to Z1.

Comments 6: In line 407, in Equation 4, according to the indication in Figure 18, should the vertical coordinate related to the predicted cutting point be denoted as T?

Response: We feel sorry for our carelessness. In our resubmitted manuscript, the typo is revised. The vertical coordinate related to the predicted cutting point has been corrected.

Comments 7: In Figure 22, the field work was conducted on flat roads between orchard greenhouses. However, I would like to know how the situation would be addressed if the prototype enters a more complex terrain within the orchard. Due to the tilt of the prototype, there may be significant errors in the depth data captured by the Realsense camera, and the predicted cutting point's three-dimensional coordinates obtained after coordinate transformation may deviate significantly from the actual positions. How was this situation addressed?

Response: We sincerely thank the reviewer for the excellent and valuable advice. The research object of this paper is trellis vineyard. Trellis vineyard are semi-artificial natural environments, and the ground was regular and flat at the beginning of the construction. When the robot encounters a ditch or a pipe, the fuselage will tilt. However, this rarely happens. In future research, we consider using inertial unit, electronic compass, or gyroscope to measure the roll angle, pitch angle, heading angle and altitude (ground fluctuation height) of the prototype and introduce these parameters into the algorithm model to improve this problem, so as to improve robot scene adaptability and anti-interference ability.

The relevant textual narrative has been added to the Conclusions and Future Work part and highlighted.

Author Response File: Author Response.docx

Reviewer 4 Report

  In this work, the authors proposed a system to detect grapes picking points and a robotic system to harvest in real-time. The reported method is potentially useful for practical application.

The comments are the following:

1.      The authors should check English expressions. For instance:

 “Due to the intensification of grape industrialization and the increase  in demand, realizing the automation of harvesting has become a realistic demand [2]” on line 36 is an incomplete sentence, “. What’ more” line 93

2.      ROI: please, explain it at the first use.

3.      Why do authors not use depth images to remove background, far, and close objects before passing them to the convolution neural network?

4.      Why do authors use colorful depth images instead of using grayscale depth images such as Z16?

5.      How many classes in Yolo 4 did authors use for training, and what are they? The distinction between complete grapes, overlapping grapes, and occluded grapes is key to eliminating overlapping and occluded grapes, but the authors do not provide the value in numbers to show this ability of the models and do not describe this process clearly.

6.      What will the system do if some overlapping grapes become complete grapes after picking the front grapes?

7.      Figure 21 should be carefully resized and put small images in good order.

8.      Please express clearly how the authors can get the true picking point position values to compare with prediction position values.

9.      The column collision rate has values in (%), but the values of this table‘s column are not numbers.  

The authors should check English expressions.

Author Response

Dear reviewer, thank you for your guidance!

We are sorry about some of the low-level errors involved in the manuscript, which have been corrected. This article is not concise enough and not focused enough, resulting in the novelty of the article not being expressed. Some of the questions you raised have aroused our deep thinking and provided inspiration for our future research. Your suggested content is added to the article and highlighted, here is a detailed explanation of some of the questions that need to be answered in a focused manner:

Comments 1: The authors should check English expressions.

Response: We feel sorry for our English expressions. We tried our best to improve the manuscript and made some changes to the manuscript. And the article has been professionally polished. These changes will not influence the content and framework of the paper. We appreciate for reviewer’s warm work earnestly and hope that the correction will meet with approval.

Comments 2: ROI: please, explain it at the first use.

Response: Thank you for your reminder. The full name of ROI is region of interest. We have explained it and changed it into full form. The complete meaning of the three-dimensional ROI was explained in the 2.1.2 section and highlighted.

Comments 3: Why do authors not use depth images to remove background, far, and close objects before passing them to the convolution neural network?

Response: We sincerely thank the reviewer for careful reading. When using the model for grape detection in the actual prototype, we remove the interference of the irrelevant background in the image through the distance threshold information in the depth image in advance, and then input the image to the YOLO v4-SE model for detection. The highlighted part in Section 2.2.1 illustrates the threshold interval range for interference removal. However, considering the duration and complexity of model training, the distance threshold was not introduced during model training.

Comments 4: Why do authors use colorful depth images instead of using grayscale depth images such as Z16?

Response: We sincerely thank the reviewer for careful reading. The original input method of YOLO v4-SE is three-channel of RGB, so we convert the Z16 grayscale depth map to a three-channel color depth map. In the color depth image, we not only obtain the distance information in the grayscale image but also combine the color information of different backgrounds and targets. By using the RGB image and the corresponding color depth image for multi-modal fusion, more grape feature information can be extracted.

Comments 5: How many classes in Yolo 4 did authors use for training, and what are they? The distinction between complete grapes, overlapping grapes, and occluded grapes is key to eliminating overlapping and occluded grapes, but the authors do not provide the value in numbers to show this ability of the models and do not describe this process clearly.

Response: We sincerely thank the reviewer for careful reading. During the training process for YOLO v4-SE, three types of labels were used for training including grape1 (complete grapes), grape2 (overlapping grapes) and grape3 (occluded grapes). The generalization ability and robustness of the model are improved by increasing the sample size of overlapping and occluded grapes in the dataset. This paper focuses on solving the picking point positioning problem of the grape picking robot for cut harvesting. Eliminating overlapping and occluding grapes is only an effective means of reducing interference during positioning of picking points. Therefore, this paper does not conduct further experiments on the detection performance of the model for overlapping and occluded grapes.

According to your valuable suggestions, we have improved the content in 2.3.2 section and highlight it . In the future, we will conduct more performance testing experiments on the YOLO v4-SE model and conduct data analysis.

Comments 6: What will the system do if some overlapping grapes become complete grapes after picking the front grapes?

Response: Considering the fluency and integrity of the movement process of the manipulator and the work efficiency of the robot, although the overlapping grapes become complete grapes after picking the previous grapes, the robot still picks the complete grapes imaged in the previous field of view. The robot picks according to the sequence of coordinates of picking points stored in the PLC controller. After picking the current complete grape, the image will be refreshed to identify the subsequent overlapping grapes and re-plan their picking sequence.

The relevant textual narrative has been added to the 2.3.5 section and highlighted.

Comments 7: Figure 21 should be carefully resized and put small images in good order.

Response: Thank you for your suggestions, Figure 21 has been resized and put in good order.

Comments 8: Please express clearly how the authors can get the true picking point position values to compare with prediction position values.

Response: The spatial coordinates of the predicted picking points are calculated by the positioning model. The spatial coordinates of the true picking point are read by the X-SEL controller of the SCARA robotic arm. The spatial coordinates of the true picking point make the outer edge of the maximum cutting force of the disc knife close to the grape stem by controlling the SCARA robotic arm. The spatial coordinates of the center of the disc cutter read in the X-SEL controller of the robotic arm at this time are recorded, which is the true picking point.

The relevant textual narrative has been added to the 3.3 section and highlighted.

Comments 9: The column collision rate has values in (%), but the values of this table‘s column are not numbers.

Response: We feel sorry for our carelessness. The meaning of the columns in this table indicates whether the robot will collide when picking under different fruit density, which should not have value in (%). We have corrected the mistake in Table 3 and highlighted the part in 3.4 section.

Author Response File: Author Response.docx

Round 2

Reviewer 3 Report

The revised manuscript has shown significant improvements. It is evident that the author has made thorough revisions to the content, optimizing the language and making the expressions clearer. Additionally, minor errors in the manuscript have been rectified.

 

The author has added YOLO X, YOLO v1, and Faster R-CNN models as a comparison to evaluate the detection performance of grape samples under different conditions. Performance metrics such as P-R curves and F1 scores for these models have been provided, which strongly support the argument made in the manuscript that the improved YOLO v4-SE demonstrates better recognition capabilities for trellis grapes.

 

Furthermore, regarding my concern about the positioning errors of the Realsense depth camera in complex terrains, the author has mentioned in future work the potential utilization of inertial units, electronic compasses, or gyroscopes to measure the prototype's tilt and subsequently introduce algorithms for correction.

 

The revised manuscript exhibits clear expression, reasonable experimental methods, and sufficient theoretical and empirical data to support the work. The research effort has been extensive, and the proposed methods hold practical value for recognition, localization, and picking in trellis grapes.

English writing has improved considerably compared to the pre-revision manuscript.

Reviewer 4 Report

The paper is revised properly.

Thanks for revising.

Back to TopTop