Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5

Appl. Sci. 2025, 15(19), 10668; https://doi.org/10.3390/app151910668

by Chenzi Zhao^1,2,3

, Xiaoyan Meng^1,2,3,*, Bing Bai^1,2,3 and Hao Qiu¹

Reviewer 1: Anonymous

Reviewer 2:

Genaro Soto Zarazua

Appl. Sci. 2025, 15(19), 10668; https://doi.org/10.3390/app151910668

Submission received: 10 September 2025 / Revised: 27 September 2025 / Accepted: 30 September 2025 / Published: 2 October 2025

(This article belongs to the Section Agricultural Science and Technology)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

The manuscript presents a relevant, novel, and technically well-executed contribution to the field of agricultural artificial intelligence. The combination of CrossViT and T5, together with the construction of the CottonDP dataset, adds originality and potential value. However, issues related to dataset representativeness, limited qualitative analysis, and insufficient discussion of practical applications and transferability must be addressed before publication.

Comments for author File: Comments.pdf

Author Response

Answers to Reviewer 1:

Response to comments

Comments 1: The title is precise and accurately conveys both the research focus and the methodological approach. The abstract clearly states the motivation, the proposed method, and the main results. However, it is overly technical and heavily focused on metrics, which may reduce accessibility for a broader audience. A stronger emphasis on practical implications and limitations would improve its balance.

Response: Thank you very much for your detailed feedback and valuable suggestions. In response to your comment about the abstract being overly technical and focusing too much on metrics, I have removed the overly technical descriptions of the model structure and replaced them with descriptions of the model's functionality, avoiding overly technical expressions (Lines 6-8,12-13). Additionally, the metrics section has been simplified, retaining only the key evaluation metrics to avoid an excessive focus on metrics (Lines 10-12). Furthermore, in response to your suggestion to “place a stronger emphasis on practical implications and limitations,” we have revised the abstract accordingly (Lines 13–16). Specifically, we now emphasize the potential of our framework for real-world deployment in cotton farms to support pest control personnel and farmers, while also clearly acknowledging its current limitations in terms of generalizability to other crops and environmental conditions.

Comments 2: The introduction provides a solid background on the economic and agricultural importance of cotton, while also reviewing related work in deep learning and image captioning across agriculture, medicine, and remote sensing. The literature review is comprehensive and logically structured, but somewhat lengthy and occasionally repetitive. The research gap is clearly identified, although comparisons with the most recent state-of-the-art multimodal models could be better highlighted to reinforce the novelty of the contribution.

Response: Thank you very much for your detailed feedback and valuable suggestions. In response to your valuable suggestion regarding the introduction being somewhat lengthy and occasionally repetitive, we have streamlined the original text. Specifically, we removed redundant summary paragraphs concerning the limitations of traditional cotton disease and pest diagnosis methods and simplified several overly wordy sentences in the literature review. In addition, to better highlight comparisons with cutting-edge models, we have incorporated a comparative analysis with recent state-of-the-art multimodal models, including BLIP-3o, Florence-2, and LLaMA-4 (Lines 82-86).

Comments 3: This section is well organized and details the research framework, dataset construction preprocessing steps, and model architecture. The CottonDP dataset is a valuable contribution; however, since it is mainly built from web-sourced images, concerns arise about representativeness, annotation reliability, and real-world applicability. The two-stage training strategy (classification pretraining followed by captioning) is thoroughly described, though at times excessively technical. While reproducibility is partially addressed through details on hardware and hyperparameters, more explicit information on expert validation of annotations and data/code availability would strengthen this section.

Response: Thank you very much for your detailed feedback and valuable suggestions. In response to your suggestion that the method section was overly technical, we have revised the original detailed technical descriptions of the model architecture and two-stage training process to focus more on the overall design motivation and logical workflow (Lines 122-135, 244-246, 281-292, 304-307, 314-322, 342-344). This adjustment aims to improve the readability and clarity of the content. Secondly, thank you for your suggestion to provide more information on expert annotation validation. We have added relevant details in the paper, outlining the expert validation process employed during the data annotation phase to ensure the quality and reliability of the annotations (Lines 213-216). Additionally, in response to your suggestion to provide explicit information on expert annotation validation and data/code availability to strengthen credibility, the experimental data can be made available upon request through the corresponding author. We will provide anonymized sample data for verification. However, since this research is part of an ongoing funded project, and to avoid any potential impact on subsequent reviews and intellectual property rights, we are unable to fully disclose the source code at this stage. We apologize for any inconvenience this may cause and sincerely appreciate your understanding.

Comments 4: The results are well structured and show convincing evidence that the proposed model outperforms traditional CNNs and baseline Transformers. Quantitative results are comprehensive, supported by tables and figures, though some visualizations could be improved for clarity. The qualitative analysis is informative but limited to a small number of examples, which restricts the reader's ability to fully assess the model's performance across diverse scenarios.

Response: Thank you for your valuable suggestion regarding the clarity of the visualizations. In response, we have prepared high-definition versions of the images and submitted them with the manuscript. However, due to limitations in the Overleaf platform regarding the compilation of large images, we were unable to embed all of them directly into the main text to avoid potential errors during PDF generation. We kindly ask for your understanding in this matter. Additionally, regarding your comment on the limited number of qualitative analysis examples, we have revised the manuscript to include three additional generation examples for different diseases and pests. This ensures that each type is adequately represented, thereby enhancing the credibility and comprehensiveness of the results (Page 21-22).

Comments 5: The discussion acknowledges key limitations such as class confusion, lack of detail ingenerated captions, and dataset size constraints. However, it reads more as a list of weaknesses than a critical analysis. lt would benefit from deeper exploration of potential improvements and future directions, such as synthetic data augmentation, adaptation to multi-label or mixed-symptom cases, or strategies for deploying the model on low-cost devices in agricultural settings.

Response: We sincerely thank you for your insightful suggestions on the discussion section. In response, we have expanded our discussion of future research directions. Specifically, to address the issue of class confusion, we plan to incorporate metric learning and generative models to enhance the model’s ability to distinguish between similar symptoms (Lines 529-533). To improve the practicality of the generated text, we intend to integrate agricultural knowledge bases and apply reinforcement learning for further optimization (Lines 537-544). In terms of data, we will construct multi-label datasets and utilize generative techniques to synthesize composite symptom images, thereby enriching sample diversity (Lines 545-548). Furthermore, we will explore end-to-end pretraining paradigms and employ model compression techniques to improve inference efficiency and deployment feasibility (Lines 550-557). Your suggestions have significantly enhanced the depth and forward-looking nature of our discussion, and we are truly grateful for your valuable feedback.

Comments 6: The conclusions provide a coherent summary of the study's findings and restate the effectiveness of the proposed framework. Nevertheless, they are general in scope and do not adequately highlight the practical implications for precision agriculture or propose concrete steps for field validation and scalability. Expanding this section with clearer projections for future applications would strengthen the manuscript's impact.

Response: We sincerely thank the reviewer for the constructive suggestion regarding the need to highlight practical implications and future applications in the conclusions. In response, we have revised the conclusion section to more clearly emphasize both the immediate and long-term practical impact of our work. Specifically, we now describe how CottonCapT6 can assist agricultural extension services and agronomists by generating standardized and interpretable reports from field images, reducing reliance on subjective assessments and improving the speed and consistency of disease logging. Furthermore, we have explicitly addressed limitations and outlined concrete next steps for transitioning from laboratory validation to field deployment, including collaborative trials with agricultural research stations, development of a lightweight mobile application using model compression techniques, and validation under diverse real-world conditions. Future directions are also detailed, such as multi-label training, advanced data augmentation, adaptation to other high-value crops via transfer learning, integration of knowledge graphs, and optimization for edge deployment (Lines 561-564, 567-590, 593-595). These additions aim to strengthen the practical relevance and scalability of the proposed framework. We believe these revisions significantly enhance the depth, clarity, and forward-looking perspective of the conclusion section, and we sincerely appreciate your valuable guidance.

Comments 7: The references are up to date and relevant, covering work up to 2025. The list is extensive, but could be streamlined by removing redundancies and focusing on the most critical and directly related studies.

Response: We sincerely thank the reviewer for the valuable suggestion regarding the references. We have streamlined the reference list by removing redundant entries that cited similar studies, and have focused more on the most critical and directly related works to improve the relevance and conciseness of the references (Page 25-26). We believe this revision will further enhance the academic quality and readability of the paper.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript, titled "CottonCapT6: A Multi-task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5," presents an original and well-founded contribution to the field of artificial intelligence applied to agriculture, specifically to the diagnosis of cotton diseases and pests using advanced image captioning techniques. The combination of CrossViT and T5 in a two-stage framework is innovative and addresses a real need in precision agriculture. The construction of the CottonDP dataset is a valuable contribution to the community. The overall structure is clear, and the experimental results are robust and convincing. However, there are some errors that need to be corrected before the article can be considered for publication. Below are my comments for correction:

The Introduction needs a final paragraph that explicitly states the objective of the article (the aim for this paper was...), and the transition between paragraphs can be improved for fluidity

Methodology: The access dates (e.g., L142, 143) should be reviewed for redundant descriptions. It would be important to specify the use of the validation set during training of the captioning model. Image titles should also be spoken in the third person and should be legible without referring to the text. Check all titles and correct them. I consider Table 1 irrelevant and can be included in the text.

Results: Figure 9 could be presented differently for better understanding. It would be interesting to include a brief discussion of the practical significance of the improvements in CIDEr and BLEU.

Conclusions: Statements should not be made without quantification, be carefull in this and practical implications and future directions can be emphasized without repeating results merely for the purpose of prospecting.

I find abbreviations at the beginning of the manuscript rather than at the end more useful, but this will depend on whether the journal's format allows it.

Comments on the Quality of English Language

There are phrasing errors, inconsistent academic tone, and occasional use of informal language. L33: "more and more achievements" can be written as "increasingly significant achievements"

L26: "But it is constrained..." can be said as "However, it is constrained..."

L115, 120: "We also proposed" is not appropriate to speak in the first person, but in the third person.

The use of "Ours" in tables and figures comes from the previous comment; it should be edited.

Author Response

Answers to Reviewer 2:

Response to comments

Comments 1: The Introduction needs a final paragraph that explicitly states the objective of the article (the aim for this paper was...), and the transition between paragraphs can be improved for fluidity.

Response: We sincerely thank you for your valuable comments on the Introduction. In response, we have explicitly stated the objective of this study in the concluding paragraph of the Introduction (Lines 93-96), and have improved the flow and logical connection between paragraphs by adding transitional sentences (Lines 27-30, 47-49). We believe these revisions significantly enhance the structural clarity and readability of the Introduction, and we sincerely appreciate your guidance.

Comments 2: Methodology: The access dates (e.g., L142, 143) should be reviewed for redundant descriptions. It would be important to specify the use of the validation set during training of the captioning model. Image titles should also be spoken in the third person and should be legible without referring to the text. Check all titles and correct them. I consider Table 1 irrelevant and can be included in the text.

Response: We sincerely thank the reviewer for the detailed suggestions on the Methodology section. We have carefully addressed each point as follows: First, redundant descriptions of access dates, such as those in lines 142 and 143, have been streamlined (Lines 119-122). Second, we have added a detailed explanation of the use of the validation set in the “Experimental Setup” subsection (Lines 349-356). Third, all image titles have been individually optimized to ensure that the information is complete and clear. Finally, the content of the original Table 1 has been integrated into the main text, and the redundant table has been removed (Lines 176-180). We believe these revisions significantly enhance the clarity and readability of the Methodology section and sincerely appreciate your guidance.

Comments 3: Results: Figure 9 could be presented differently for better understanding. It would be interesting to include a brief discussion of the practical significance of the improvements in CIDEr and BLEU.

Response: We sincerely thank the reviewer for the constructive suggestions regarding the Results section. In response, we have made focused improvements to Figure 9 by redrawing the confusion matrix heatmap to enhance visual clarity and adding detailed figure captions. The new captions clearly specify the meanings of rows and columns (predicted vs. true classes), explain the correspondence between the color gradient and sample counts (deep blue representing high counts, white representing low counts), and indicate that diagonal elements correspond to correct classifications while off-diagonal elements highlight misclassifications (Page 17). Additionally, we have added a brief discussion in the text on the improvements in CIDEr and BLEU scores, emphasizing how these enhancements contribute to generating more accurate and semantically rich image captions (Lines 471-472, 474-476, 483-493). We believe these revisions strengthen both the clarity and practical significance of the Results section.

Comments 4: Conclusions: Statements should not be made without quantification, be careful in this and practical implications and future directions can be emphasized without repeating results merely for the purpose of prospecting.

Response: We sincerely thank the reviewer for the constructive suggestion regarding the Conclusions. In response, we have revised this section to avoid making statements without quantitative support, while still highlighting the practical implications and future directions of our work. Specifically, we now describe how CottonCapT6 can support agricultural extension services and agronomists by generating standardized and interpretable reports from field images, which reduces reliance on subjective assessments and improves the speed and consistency of disease logging. We have also clearly addressed the limitations of the study and outlined concrete next steps for transitioning from laboratory validation to field deployment, including collaborative trials with agricultural research stations, development of a lightweight mobile application using model compression techniques, and validation under diverse real-world conditions. Future directions are further detailed, such as multi-label training, advanced data augmentation, adaptation to other high-value crops via transfer learning, integration of knowledge graphs, and optimization for edge deployment (Lines 561–564, 567–590, 593–595). These revisions emphasize practical relevance and scalability without merely repeating the results. We believe these changes significantly improve the clarity, depth, and forward-looking perspective of the Conclusions, and we sincerely appreciate your valuable guidance.

Comments 5: I find abbreviations at the beginning of the manuscript rather than at the end more useful, but this will depend on whether the journal's format allows it.

Response: We sincerely thank the reviewer for the suggestion regarding the placement of abbreviations. We agree that presenting them at the beginning could be more convenient for readers; however, in accordance with the formatting requirements of the journal, we have placed the abbreviations at the end of the manuscript. We appreciate your understanding.

Comments 6: There are phrasing errors, inconsistent academic tone, and occasional use of informal language.

L33: "more and more achievements" can be written as "increasingly significant achievements"

L26: "But it is constrained..." can be said as "However, it is constrained..."

L115, 120: "We also proposed" is not appropriate to speak in the first person, but in the third person.

The use of "Ours" in tables and figures comes from the previous comment; it should be edited.

Response: Thank you for your valuable feedback regarding phrasing, tone consistency, and formal language. In response, we have made the following revisions:

L33: The phrase "more and more achievements" has been revised to "increasingly significant achievements" for a more formal and academic tone.
L26: The phrase "But it is constrained..." has been changed to "However, it is constrained..." to ensure consistency and formality.
L97, L99: The sentenceshave been adjusted to passive voice sentences, in line with academic conventions of using third-person voice in scientific writing.
Use of "Ours": The term "Ours" in tables and figures has been deleted, as per your recommendation to avoid informal language and maintain consistency with academic writing standards.

We believe these changes enhance the clarity and formality of the manuscript, and we sincerely appreciate your guidance in improving the overall quality of the paper.

Author Response File: Author Response.pdf

Article Menu

CottonCapT6: A Multi-Task Image Captioning Framework for Cotton Disease and Pest Diagnosis Using CrossViT and T5

Further Information

Guidelines

MDPI Initiatives

Follow MDPI