Review Reports
- Baishao Zhan1,
- Ming Li1 and
- Wei Luo1
- et al.
Reviewer 1: Anonymous Reviewer 2: Yang Jae Kang
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThank you for the interesting paper
using k-folds is recommended to consolidate the results
The paper’s application topic is interesting (detecting regions of interest and classifying the tea leaves’ infection type), so probably this is an important application in the biology realm. From the computer vision and ML perspective, the main contribution is what the authors presented as the iterative region of interest encoding, and the idea is quite interesting and innovative. However, although they authors exerted good effort to explain the " iterative region of interest encoding", the figures and the wording can be enhanced to make it more comprehensive. For example, the sampling of the positions in the feature map extracted from the inputs through the first convolution model and feeding them to the transformer is not that clear. Also Fig3 looks great but neither the graph, nor the caption give the complete picture of how it works.Also, the contribution can be much better if the authors can publish their source code. Even better, if possible, publishing the data would also be a big contribution. Offering the source code and the data would allow researchers to regenerate the results and build over it to advance the with idea.
Author Response
您好,感谢您的宝贵意见,这是我的回复。
Author Response File:
Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsSuggestions and Issues Identified:
Lines 56-58:
The sentence "Traditional convolution performs well in local feature extraction, but there are still shortcomings in global feature extraction" appears incomplete. There needs to be a more comprehensive explanation on what specific issues arise in global feature extraction. The structure of the Iteration ViT mentioned later acts as a bottleneck for local feature extraction, but what advantages does it offer from a global feature extraction perspective?
Lines 99-109:
Although there is an explanation given in this section, it would be beneficial to provide more detailed insights into the advantages of global feature extraction (computational costs, speed, performance, etc.). A comparative figure outlining the structure of CNN and Iteration ViT would also be helpful.
Line 110:
It would be beneficial to include sample images of various diseases from the IMAGE DATA for a more complete understanding.
Lines 313-314, 331:
The text states that "The cutting parameters of patch _ 16 sizes in P, R, F1, and ACC exceeded the cutting parameters of patch _ 8 size, with F1 being 10.8% higher and ACC exceeding 3.5%". However, in Table 1, the performance of the patch size 16*16 seems to be lower than 8*8. This contradiction needs clarification.
Line 394:
The reliability of the results presented in Figure 5 is questionable. Is the same outcome achieved upon repeated trials? More information on this would be helpful.
Line 468:
There appears to be a typo; "EffificientNet" should be "EfficientNet". It would also be beneficial to include comparison analyses with other models utilizing ViT (e.g., ViT-e, ViT-G).
Comments on the Quality of English Language
None
Author Response
您好,感谢您的宝贵意见,这是我的回复。
Author Response File:
Author Response.docx