Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Open AccessArticle

Peer-Review Record

Fresh Tea Leaf-Grading Detection: An Improved YOLOv8 Neural Network Model Utilizing Deep Learning

Horticulturae 2024, 10(12), 1347; https://doi.org/10.3390/horticulturae10121347

by Zejun Wang^1,2, Yuxin Xia², Houqiao Wang^1,2, Xiaohui Liu¹, Raoqiong Che¹, Xiaoxue Guo³, Hongxu Li^1,2

, Shihao Zhang³ and Baijuan Wang^1,2,*

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Horticulturae 2024, 10(12), 1347; https://doi.org/10.3390/horticulturae10121347

Submission received: 29 October 2024 / Revised: 10 December 2024 / Accepted: 13 December 2024 / Published: 15 December 2024

(This article belongs to the Special Issue Application of Smart Technology and Equipment in Horticulture—2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

In the paper “ Tea leaf grading detection：An improved YOLOv8 neural net-2 work model utilizing deep learning ” by Zejun Wang et. al. The authors present the results of the improved YOLOv8 model work for the tea leaf pattern recognition.

The research provides a step to achieve the automation of rapid and accurate picking of large-leaf tea leaves and the mechanized classification of quality grades

The novelty of the model is the application of a new algorithm for obtaining input data for training the neural network in the YOLOv8 model

1)The Swin-Transformer method, and the Efficient Multi-Scale Attention Module with Cross-Spatial Learning

The methodology of these new methods is described rather clearly but some explanations are missing:

1) What do “EMA” and “C2f” in Figure 3 mean?

2) What do “LN” and “MLP” in Figure 5 mean?

3) What is Omega in Equations (1) and (2) and how does it follow from these equations that the computational complexity is reduced?

4) The meaning of the Equations (3) (4) is not clear

5) What is “MLP” in Equations (5)-(6)?

6) In the Equation (7) The Equation (7) contains different notations: “Ch”, “C’h”, perhaps this is a typo?

Judging by the graphs in Figure 10, the authors managed to improve the algorithm YOLOv8 in Recall, and F1.

The references Are appropriate.

I have some notes concerning the Figures:

1) The pictures in figure 20 are very small, and the inscriptions are blurry. I think the drawing should be broken down into smaller ones and made larger.

2) It is not clear what the yellow broken line and blue straight line in figures 9 c, d and 13 b mean

Author Response

Thanks very much for your time to review this manuscript. I really appreciateyou’re your comments and suggestions. We have considered these comments carefully and triedour best to address every one of them.

1) What do “EMA” and “C2f” in Figure 3 mean?

Modification instructions: Thank you for reviewing our manuscript and for your valuable feedback. Figure 3 illustrates the structure of the improved YOLOv8 network in this study, where "EMA" stands for Efficient Multi-Scale Attention, a weight decay method used for parameter updates during model training to enhance the model's generalization capabilities. "C2f" stands for Convolution to Feature, which is the convolutional layer used for feature extraction in our model.

2) What do “LN” and “MLP” in Figure 5 mean?

Modification instructions: Thank you for your review of our manuscript and for the valuable suggestions you have provided. In response to your inquiry regarding the meanings of "LN" and "MLP" in Figure 5, we have made the necessary revisions and additions within the text. "LN" stands for "Layer Normalization," a normalization technique used to stabilize the training process of deep networks and reduce internal covariate shift. "MLP" stands for "Multi-Layer Perceptron," which refers to the fully connected layer structure used for feature transformation in the Swin Transformer. It is capable of learning complex nonlinear relationships.

3) What is Omega in Equations (1) and (2) and how does it follow from these equations that the computational complexity is reduced?

Modification instructions: Thank you for your review of our paper and for raising the questions. Regarding the term "Omega" in Equations (1) and (2), it refers to the complexity calculation formula for the computation method of the attention mechanism. The Swin Transformer backbone network begins by segmenting the image, where every 4x4 adjacent pixels are divided into a block. Then, each block is flattened in the channel direction. During the construction of the feature extraction network, feature maps of varying sizes are generated through four different stages. For the second, third, and fourth stages, each stage downsamples the features from the previous stage to reduce the size. After downsampling, the feature maps are passed to a series of block structures for further processing and feature extraction. Each block is composed of a normalization layer, a window attention module, a shifted window attention module, and a multi-layer perceptron. The feature maps are divided into individual windows of a set MxM size, and self-attention calculations are performed within each window separately, a method known as Window Attention W-MSA. Compared to the original self-attention calculation method, this module reduces computational complexity.

We have made detailed supplements and revisions in the paper to ensure that our arguments are clearer and more convincing. Thank you again for your valuable feedback, and we look forward to your further guidance.

4) The meaning of the Equations (3) (4) is not clear

Modification instructions: Thank you for your review of our paper and for the feedback provided. In response to the issue of unclear meanings regarding Equations (3) and (4), we have made the following revisions:

In equations (4)-(6), W-MSA represents the multi-head self-attention module based on windows, SW-MSA denotes the multi-head self-attention module with shifted windows, LN signifies the layer normalization module, MLP indicates the multi-layer perceptron, Z^l-1refers to the output features of the multi-layer perceptron at layer l-1, z^ldenotes the output features at layer l for the W-MSA output, z^l signifies the output features of the multi-layer perceptron at layer l, z^l+1 indicates the output features at layer l+1 for the SW-MSA output, and z^l+1 refers to the output features of the multi-layer perceptron at layer l+1.

5) What is “MLP” in Equations (5)-(6)?

Modification instructions: Thank you for your review of our paper and for the feedback provided. Regarding the issue of the unclear meanings of Equations (5) and (6), we have made comprehensive revisions within the text. Specifically, denotes the multi-layer perceptron.

6) In the Equation (7) The Equation (7) contains different notations: “Ch”, “C’h”, perhaps this is a typo?

Modification instructions: Thank you for pointing out the issue with the notation in Equation (7) of our paper. Upon careful examination, we realized it was indeed a typographical error. We have standardized the notation in the revised manuscript and ensured consistency in the use of symbols throughout the text. Specifically, we have corrected "C'h" in Equation (7) to "Ch" to maintain consistency with the rest of the document. We have also reviewed the entire paper to ensure there are no other similar errors.

We apologize for this oversight and appreciate your meticulous review, which has helped to enhance the quality of our paper.

7) The pictures in figure 20 are very small, and the inscriptions are blurry. I think the drawing should be broken down into smaller ones and made larger.

Modification instructions: Thank you for your valuable feedback on the figures in our paper. Following your suggestions, we have made the following modifications to the images within the text: the image resolution has been adjusted to meet the journal's requirement of 300 dpi, ensuring clarity in both printing and display. In consideration of the congestion in the charts, we have divided the figures into several smaller ones, each clearly labeled to facilitate a better understanding of the content for the readers. We have also optimized the text and labels within the figures to ensure that they are of an appropriate size and clarity for reading. Thank you again for your valuable comments, and we look forward to your further guidance.

8) It is not clear what the yellow broken line and blue straight line in figures 9 c, d and 13 b mean

Modification instructions: In the loss function and convergence curves of the improved YOLOv8 model, the yellow dashed line and the blue solid line typically represent different training statuses and performance metrics. The yellow dashed line indicates the smoothness performance in the training and validation processes, reflecting the model's loss function values on the training or validation set. In the context of loss and convergence curves, the yellow dashed line represents an intermediate state of the model during training and a performance metric on the validation set, which is used to demonstrate the model's generalization capability during the training process. The blue solid line, on the other hand, represents the performance metrics on both the training and validation sets in the loss function and convergence curves. This is used to showcase the model's performance on the training data, including loss values and accuracy, among other metrics. Overall, the yellow dashed line and the blue solid line in the loss function and convergence curves represent different performance indicators of the model during the training and validation processes, assisting us in evaluating the training effectiveness and generalization capability of the model.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Manuscript evaluation

The manuscript proposes an enhanced method for tea classification and detection based on the YOLOv8 network, integrating a Hierarchical Vision Transformer and a Multi-Scale Attention Module to reduce computational load and improve accuracy. The authors note that replacing the loss function with SIoU accelerated model convergence and increased accuracy. The results are promising, showing significant improvements in metrics such as Precision, Recall, and mAP, along with an increase in FPS rates. The enhanced model offers broad prospects for efficient and rapid recognition, which could support the development of tea harvesting robots and classification devices.

General concept comments

The authors do a good job of reviewing the context of the tea industry, highlighting the challenges faced, such as the need for automation and precise detection of tea leaves. They effectively justify the transition to advanced technologies like deep learning and machine learning algorithms by pointing out the limitations of traditional methods. The identified knowledge gap is clearly addressed, and the research objectives are well defined. The references used are relevant and help to appropriately ground the state of the art, making the research well-founded and aligned with the field’s needs.

The materials used to illustrate the proposed method were adequately detailed, allowing for a clear understanding of the process. The methodology was well described, with a solid framework that facilitates further exploration by interested readers. Additionally, the controls used appear sufficient to meet the research objectives, ensuring the reliability and relevance of the results presented.

In the analysis results section, the authors focused on the main findings, remaining consistent with the stated objectives. Tables and figures were appropriately used to assist in presenting and interpreting the results. However, it is worth noting that the titles of the figures and tables are somewhat brief and could be more descriptive, making them more self-explanatory and facilitating comprehension without the need to refer back to the text.

The authors adequately discussed their results in the "Results" section and provided a specific section for further discussion. However, I believe this discussion section could be expanded by incorporating references to the literature cited throughout the text. By including more comparisons with other studies, the authors could better contextualize their findings, offering a deeper and more robust analysis of the implications of their results in relation to existing work in the field. This would also help strengthen the relevance and originality of the research in the academic context.

In the conclusion section, the authors clearly highlighted the main findings, aligning them with the research objectives. However, it is important to note that the discussion section should ideally come before, as an analysis of the results, providing a more in-depth reflection on the implications of the findings. The conclusion section would work better at the end, synthesizing the discussions and reinforcing the research implications, which would help provide a more cohesive closing to the paper.

Specific Comments

As mentioned earlier, the manuscript is well-structured, and the results are relevant to the field. Below, I present suggestions to improve the quality of the text.

1) Overall, the titles of the tables and figures are overly concise. The titles should be more descriptive and self-explanatory.

2) Line 322: Correctly cite the software used.

3) The discussion section could be expanded, including comparisons with other works in the literature and citing relevant references.

4) The authors present the conclusions first and then the discussion. I believe these sections could be swapped, as the conclusions should ideally close the manuscript.

Author Response

1) Overall, the titles of the tables and figures are overly concise. The titles should be more descriptive and self-explanatory.

Modification instructions: Thank you for your valuable feedback on the captions of the figures and tables in our paper. We have made the following revisions to all table and figure titles:

We have enhanced the descriptiveness of the titles to ensure that each title accurately summarizes the content of the figure or table and can be understood independently of the text.
For figures containing multiple subplots, we have ensured that the titles include all subplots and their corresponding letter labels to improve the completeness of the information.
We have employed active voice and strong verbs in the titles to make them more direct and impactful.

We believe these modifications will enhance the readability and comprehension of the figures, allowing readers to quickly grasp the core messages of the visuals. We have highlighted these sections in red in the revised manuscript for your easy location and review.

Thank you again for your valuable comments, and we look forward to your further guidance.

2) Line 322: Correctly cite the software used.

Modification instructions: Thank you for your review of our paper and for the suggestions provided. In response to your recommendations, we have made the following revisions to the manuscript:

Table 2. Experimental environment configuration and parameter settings.

Configuration items	Configuration parameters
Operating System	Windows 10
CPU	Intel(R)CORE(TM)i7-11700
Random Access Memory	2933MHz DDR4 ECC
Solid State Disk	M.2 1TB PCIe NVMe Class 50
GPU	NVIDIA RTX A6000
Compilation Language	Python 3.9.7
Frameworks	PyCharm 2019
CUDA	CUDA Version：12.2
Epochs	1000
Batch size	128

3) The discussion section could be expanded, including comparisons with other works in the literature and citing relevant references.

Modification instructions: Thank you for your review of our paper and for the valuable feedback provided. Following your suggestions, we have expanded the discussion section, with specific revisions as follows:

1.Enhanced Literature Comparison: We have included a comparison with relevant literature in the discussion, providing a detailed analysis of the similarities and differences between our study's results and those of other research, emphasizing the advantages of our model in tea grading recognition.

2.Citation of Relevant References: In the expanded discussion, we have cited multiple relevant references to support our comparative analysis and to provide readers with a more comprehensive background.

3.In-depth Analysis of Research Results: We have conducted a more in-depth analysis of the performance improvements of the improved YOLOv8 model, discussing the specific reasons for the model enhancements and their implications for practical applications.

We believe these revisions will make the discussion section more comprehensive and insightful, aiding readers in better understanding the contributions and significance of our research. Thank you again for your valuable comments, and we look forward to your further guidance.

The revised discussion section is shown below:

Discussion

This study addresses the issue of tea grading recognition for Yunnan large-leaf tea plants by proposing a deep learning model based on the improved YOLOv8. By incorporating Hierarchical Vision Transformer using Shifted Windows, Efficient Multi-Scale Attention Module with Cross-Spatial Learning, and the SIoU loss function, our model has demonstrated excellent performance in tea grading recognition tasks. The following is an in-depth discussion of the results of this study.

The improved YOLOv8 model shows significant enhancements in evaluation metrics such as Precision, Recall, F1, and mAP compared to the original YOLOv8 model. Building upon the YOLOv8 network framework, this study presents a tea grading recognition method for the YOLOv8 network. This method replaces parts of the original YOLOv8 network structure with Hierarchical Vision Transformer using Shifted Windows to reduce the computational burden of image-intensive tasks and lower computational costs; it adds Efficient Multi-Scale Attention Module with Cross-Spatial Learning to diminish the weight of irrelevant features in complex backgrounds, thereby enhancing model detection accuracy; and it replaces the loss function with SIoU, which not only improves the model's convergence speed but also more accurately locates defective positions. The improved YOLOv8 algorithm model has increased Precision, Recall, F1, and mAP by 3.39%, 0.86%, 2.20%, and 2.81% respectively compared to the original YOLOv8 model. Compared to different detection models proposed by Zhiyong Gui et al. [15], Shuang Xie et al. [16], and Shudan Guo et al. [17] in tea leaf recognition detection, accuracy has been improved by 2.86%, 4.53%, and 5.70% respectively. The improved YOLOv8 model in this study outperforms several other mainstream deep learning models such as YOLOv5, YOLOX, Faster RCNN, and SSD in tea grading recognition tasks. Our model has significantly improved in both FPS and mAP, indicating that the improved YOLOv8 not only maintains a high recognition rate but also has the characteristic of rapid detection, which is of great significance for real-time tea grading recognition.

Although the model in this study has achieved good results in tea grading recognition tasks, there are still some limitations. First, the performance of the model largely depends on the quality and diversity of the training data. Future work can focus on expanding the tea grading image dataset to enhance the model's generalization capabilities. Second, this study mainly focuses on tea grading recognition; future exploration can build multimodal fresh leaf representation and visual recognition models to further improve recognition accuracy and robustness. Finally, considering the need for model deployment to edge devices, future research can focus on model lightweighting to adapt to resource-constrained devices.

4) The authors present the conclusions first and then the discussion. I believe these sections could be swapped, as the conclusions should ideally close the manuscript.

Modification instructions: Thank you for your review of our paper and for the valuable suggestions provided. In accordance with your recommendations, we have adjusted the structure of the paper, swapping the order of the conclusion and discussion sections. Now, the discussion section precedes the conclusion, aligning more closely with the conventional structure of academic papers and rendering the logical flow of the paper clearer and more coherent.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The article entitled "Tea Leaf Grading Detection: An Improved YOLOv8 Neural Network Model Utilizing Deep Learning" presents findings on the recognition of Yunnan large-leaf tea trees. It offers a comprehensive introduction and clearly highlights its objectives.

The methodology section provides the process undertaken to achieve the results with YOLOv8, YOLOv5, YOLOX, Faster RCNN, and SSD deep learning models presented in an organized and logical manner. These findings are particularly relevant for the tea-picking robot industry. Overall, the article is well-structured; however, I propose some suggestions to enhance its clarity and comprehension.

Suggestions:

Figure Titles: The figure titles should be more informative. Generally, figures and tables should be self-explanatory and understandable without requiring readers to delve into the text.
Separation of Sections: It is essential to separate Section 3, Model Training and Result Analysis, from the discussion. The discussion section should be used to:
- Interpret the results.
- Highlight the most significant findings.
- Provide detailed comparisons with the original models.
- Explore potential future research directions, which would greatly enrich the article.
Minor Errors: I have underlined minor errors in the article, such as in lines 372 and 380, where abbreviations are repeated unnecessarily. Please ensure these are corrected and verify that similar issues do not occur elsewhere.

These adjustments would improve the overall quality and readability of the article.

Comments for author File: Comments.pdf

Author Response

1.Figure Titles: The figure titles should be more informative. Generally, figures and tables should be self-explanatory and understandable without requiring readers to delve into the text.

Modification instructions: We have carefully revised the titles of the figures in our paper based on your valuable feedback. Here are the changes we have made:

1.We have ensured that each figure title contains sufficient information to convey the core content and purpose of the figure directly to the reader, independent of the main text.

2.All figure titles have been rephrased to enhance their descriptiveness, allowing the figures to be self-explanatory and understandable even without reading the body text.

3.We have paid particular attention to the accuracy and clarity of the titles to ensure they accurately reflect the data and analysis results presented in the figures.

We believe these modifications will make the figures more intuitive and easier to understand, thereby improving the readability and professionalism of the paper. Thank you for your suggestions, and we look forward to your further guidance.

2.Separation of Sections: It is essential to separate Section 3, Model Training and Result Analysis, from the discussion. The discussion section should be used to:

Interpret the results.

Highlight the most significant findings.

Provide detailed comparisons with the original models.

Explore potential future research directions, which would greatly enrich the article.

Modification instructions: Thank you for your review of our paper and for the valuable suggestions provided. We have adjusted the structure of the paper according to your recommendations, with the specific modifications as follows:

1.Section Separation: We have clearly separated Section 3, "Model Training and Results Analysis," from the discussion section, making the content of each part more distinct and independent.

2.Expansion of the Discussion Section: In the revised discussion section, we have provided a detailed interpretation of the research findings, highlighted the most significant discoveries, and offered a detailed comparison with the original model. Furthermore, we have explored potential directions for future research to enrich the content of the paper.

We believe these changes will enhance the logical flow and readability of the paper, allowing readers to better understand our research outcomes and their implications. Thank you again for your valuable feedback, and we look forward to your further guidance.

3.Minor Errors: I have underlined minor errors in the article, such as in lines 372 and 380, where abbreviations are repeated unnecessarily. Please ensure these are corrected and verify that similar issues do not occur elsewhere.

Modification instructions: We have carefully reviewed the minor errors you pointed out in the paper and have made the necessary corrections. Specifically:

1.Correction of Repeated Abbreviations: We have amended the unnecessary repeated abbreviations at lines 372 and 380. These repetitions were likely due to oversights during the editing process. We have now ensured that these abbreviations appear only once in the text, with the full term provided upon their initial use.

2.Comprehensive Check: In addition to the locations you indicated, we have conducted a thorough check of the entire article to ensure that no similar errors occur. We have scrutinized the article line by line to ensure the consistency and accuracy of the use of abbreviations and terminology.

We greatly appreciate your meticulous review and valuable feedback, which has helped us to enhance the quality of our paper. We believe that with these modifications, the article's expression is clearer, and the errors have been corrected.

Thank you again for your feedback, and we look forward to your further guidance.

Author Response File: Author Response.pdf

Article Menu

Fresh Tea Leaf-Grading Detection: An Improved YOLOv8 Neural Network Model Utilizing Deep Learning

Suggestions:

Further Information

Guidelines

MDPI Initiatives

Follow MDPI