CornViT: A Multi-Stage Convolutional Vision Transformer Framework for Hierarchical Corn Kernel Analysis
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsIt is an enjoyable and understandable study that the authors have put a lot of effort into hierarchical corn kernel analysis with multi-stage convolutional vision transformer. However, according to my opinion, a series of revisions are necessary in relation to the acceptance of the article. The revisions include the technical and presentation aspects of the article, which are mentioned below:
1. In the abstract part, it is recommended to add more detail about the dataset which is used.
2. In the Introduction part, the props and limitation of previous research should be added and show how the proposed network support some previous limitations.
3. In the Introduction part, add more literature review specially those researches which is related to deep learning algorithms for precision agriculture suggested such as: DOI:
https://doi.org/10.1016/j.atech.2025.101472.
4. In the Material and Methods part, it is recommended to add a flowchart and show the workflow of the proposed method.
5. In the Material and Methods part, it is recommended to add more detail about the blocks of
proposed network in a figure.
6. In the Results part, it is recommended to compare proposed method with other state of the art
baseline CNN methods.
7. In general, the visualization analysis in this paper is not sufficient, Some more visual analysis is
recommended
Author Response
Comments 0:
It is an enjoyable and understandable study that the authors have put a lot of effort into hierarchical corn kernel analysis with multi-stage convolutional vision transformer. However, according to my opinion, a series of revisions are necessary in relation to the acceptance of the article. The revisions include the technical and presentation aspects of the article, which are mentioned below:
Response 0:
We thank the reviewer for the careful reading and constructive suggestions. The manuscript has been revised accordingly, and the comments have been addressed. Responses for each of the comments are specified below.
----------
Comment 1:
In the abstract part, it is recommended to add more detail about the dataset which is used.
Response 1:
We agree and have revised the abstract to include concrete details about the curated datasets used at each stage. The abstract now specifies the number of kernels in each stage-specific subset and clarifies that they are derived from a publicly available dataset via manual relabeling and filtering.
----------
Comment 2:
In the Introduction part, the props and limitation of previous research should be added and show how the proposed network support some previous limitations.
Response 2:
We have expanded the Introduction to more clearly summarize both the strengths and limitations of prior work on corn kernel and seed-quality analysis. The highlighted text talks about the contributions of earlier approaches (line 44) and explicitly identifies key limitations (line 52 and 59). We then position the proposed framework as addressing these limitations (line 94).
----------
Comment 3:
In the Introduction part, add more literature review specially those researches which is related to deep learning algorithms for precision agriculture suggested such as: DOI: https://doi.org/10.1016/j.atech.2025.101472.
Response 3:
We appreciate this suggestion and have added a new paragraph in the introduction (line 70) to provide broader context on the use of deep learning in precision agriculture. In addition to the reference you recommended, we have included two more relevant articles, which we believe enhance the discussion significantly.
----------
Comment 4:
In the Material and Methods part, it is recommended to add a flowchart and show the workflow of the proposed method.
Response 4:
We thank the reviewer for pointing this out and have added a flowchart (Figure 1) in section 3.1 that summarizes the overall workflow of the method. The new figure shows input, stage 1, stage 2, stage 3, and the final hierarchical label tuple.
----------
Comment 5:
In the Material and Methods part, it is recommended to add more detail about the blocks of proposed network in a figure.
Response 5:
We have added a new block-diagram (Figure 6) in section 3.4.1 that shows how we use CvT-13 within each stage of our proposed network. The figure shows the internal blocks of a single stage: three CvT transformer stages, global pooling, and a 2-unit classification head. We have also added a short text referencing the same.
----------
Comment 6:
In the Results part, it is recommended to compare proposed method with other state of the art baseline CNN methods.
Response 6:
We evaluated CornViT against two strong and widely used CNN backbones under identical training protocols. We have added a new table 9 that summarizes the comparative performances across all three stages (line 488). We also expanded the description of the baseline CNNs in section 4.1 to clarify that ResNet-50 DenseNet-121 are treated as representative state-of-the-art.
----------
Comment 7:
In general, the visualization analysis in this paper is not sufficient, Some more visual analysis is recommended.
Response 7:
We thank you for this comment and agree that additional visual analysis strengthens the paper. We have added a new subsection in the Results, Section 4.6 Visual Analysis, and added a new figure 10. This section (line 508) provides qualitative visual analyses of CornViT’s predictions. The figure shows representative examples from each stage, including both correctly classified and misclassified kernels. True label and Predicted labels are mentioned as well. We also highlight typical error modes (e.g., borderline impurities, intermediate shapes, subtle embryo cues).
We appreciate your helpful feedback that has improved the clarity and completeness of the manuscript.
Reviewer 2 Report
Comments and Suggestions for AuthorsOverall Recommendation: Minor Revision
The paper proposes CornViT, a three-stage CvT-13 based framework for single-kernel analysis (purity, shape, embryo orientation), together with carefully curated datasets and a simple web interface. The work is technically solid, well written, and practically relevant for seed-quality assessment. The main limitations are the relatively narrow range of baselines, the controlled imaging conditions, and some missing connections to recent work on agricultural product classification.
Abstract and keywords
Points to improve:
The abstract is quite dense and could be slightly shortened by removing a few adjectives without losing content. The sentence listing all three accuracies and baselines is long and might be easier to read if split into two sentences.
Introduction
Points to improve:
The introduction focuses mainly on kernel inspection and seed quality, but only briefly touches on the broader context of agricultural product classification using deep learning. To make the work better anchored in that community, I suggest adding a short paragraph that mentions recent advances in image-based classification of agricultural products. In particular, it would be natural to briefly cite: Succulent-YOLO (2025), as an example of a modern YOLO-based pipeline for agricultural product classification using UAV imagery and CLIP-enhanced features. The work “Assessing kernel processing score of harvested corn silage in real-time using image analysis and machine learning” (2022), which is directly related to corn kernel quality assessment using imaging and machine learning.
These references would help readers see CornViT as part of a broader trend towards automated, image-based assessment of crop and grain products.
Problem formulation and dataset preparation
Points to improve:
There is no quantitative measure of how many images were relabeled or discarded during curation. While not strictly necessary, providing approximate numbers or percentages would give readers a better sense of the effort and the level of noise in the original dataset.
Minor issues:
Class names for “silkcut” and related terms are used consistently, but it may be helpful to define “silkcut” once for readers who are not familiar with seed-industry terminology.
Preprocessing and CornViT architecture
Points to improve:
The paper mentions that the internal pretrained flag is disabled and weights are loaded manually. This is a minor implementation detail that could be shortened in favour of a short paragraph discussing why head-only fine-tuning was chosen instead of partial unfreezing, especially for the more complex Stage 3. This could also be revisited in the discussion as a limitation and opportunity for further improvement.
Minor issues:
I did not see major typographical errors here. Please just double-check capitalization of “ImageNet-22k” and ensure that the links in the references are correctly formatted for the final journal style.
Algorithms and evaluation metrics
Points to improve:
The algorithms rely on the reader understanding that D2 contains only pure kernels and D3 only pure-flat kernels. Although this is described earlier, a short reminder in the algorithm description would improve self-containment.
The metrics section could briefly state which metric is used as the primary model selection criterion. For example, it seems accuracy is emphasized, but accuracy can be misleading with class imbalance; the authors do report macro and weighted scores, but it would be good to state that model selection is based on validation accuracy while also inspecting F1.
Experimental results
Only two CNN baselines are considered. They are reasonable choices, but given the strong focus on agricultural imaging in the paper, it would be informative to mention at least briefly whether any lighter CNNs (for example EfficientNet or MobileNet) were tested and how they performed, even if only in an appendix or a short remark.
The independence assumption used when estimating an approximate end-to-end accuracy of about eighty per cent is clearly labelled as a rough lower bound. Still, it might be better to add one sentence explaining that a direct end-to-end evaluation of the full pipeline on a joint test set would give a more accurate picture and could be included in future work.
Small editorial point: when reporting accuracies, sometimes ranges like “76.56–81.02%” are written. It would be clearer to explicitly say “from 76.56 to 81.02 percent” in continuous prose.
Discussion and Conclusions
The discussion could briefly relate the proposed method to the broader topic of agricultural product classification again, with one sentence indicating how similar hierarchical frameworks could be applied to other products (for example fruits or silage quality), and citing Succulent-YOLO and the 2022 corn silage study as complementary lines of work.
Some sentences repeat numerical results already mentioned several times in the paper. It may be possible to shorten this section slightly without losing important information.
Author Response
Comments 0:
The paper proposes CornViT, a three-stage CvT-13 based framework for single-kernel analysis (purity, shape, embryo orientation), together with carefully curated datasets and a simple web interface. The work is technically solid, well written, and practically relevant for seed-quality assessment. The main limitations are the relatively narrow range of baselines, the controlled imaging conditions, and some missing connections to recent work on agricultural product classification.
Response 0:
We thank you for reading and commenting on the manuscript. The paper has been revised substantially, and the comments have been addressed. Replies for each of the comments are specified below, with a description of the corresponding changes made to the manuscript based on each comment.
----------
Abstract and keywords
Comment:
The abstract is quite dense and could be slightly shortened by removing a few adjectives without losing content. The sentence listing all three accuracies and baselines is long and might be easier to read if split into two sentences.
Response:
We agree and have simplified the abstract. We removed several non-essential adjectives to reduce density while keeping all core info. We rewrote the sentence that listed all three accuracies and the CNN baselines into two shorter sentences: one that reports the three CornViT accuracies and one that summarizes the CNN baselines for comparison.
----------
Introduction
Comment:
The introduction focuses mainly on kernel inspection and seed quality, but only briefly touches on the broader context of agricultural product classification using deep learning. To make the work better anchored in that community, I suggest adding a short paragraph that mentions recent advances in image-based classification of agricultural products. In particular, it would be natural to briefly cite: Succulent-YOLO (2025), as an example of a modern YOLO-based pipeline for agricultural product classification using UAV imagery and CLIP-enhanced features. The work “Assessing kernel processing score of harvested corn silage in real-time using image analysis and machine learning” (2022), which is directly related to corn kernel quality assessment using imaging and machine learning.
These references would help readers see CornViT as part of a broader trend towards automated, image-based assessment of crop and grain products.
Response:
We thank you for this comment and have expanded the introduction to better connect CornViT to the broader agricultural product classification literature. We added a dedicated paragraph (line 70) that briefly describes Succulent-YOLO and 2022 corn silage study that anchors this work more clearly in the community.
----------
Problem formulation and dataset preparation
Comment 1:
There is no quantitative measure of how many images were relabeled or discarded during curation. While not strictly necessary, providing approximate numbers or percentages would give readers a better sense of the effort and the level of noise in the original dataset.
Response 1:
We have revised the Dataset Preparation subsection (line 272) to include quantitative information about curation. We now state the total number of images originally examined from the downloaded dataset, the number retained in the curated pool, and that the remainder were discarded due to duplication, and an approximate percentage of images required re-labeling.
Comment 2:
Class names for “silkcut” and related terms are used consistently, but it may be helpful to define “silkcut” once for readers who are not familiar with seed-industry terminology.
Response 2:
We have added a short definition for “silkcut” (line 230) at its first occurrence.
----------
Preprocessing and CornViT architecture
Comment 1:
The paper mentions that the internal pretrained flag is disabled and weights are loaded manually. This is a minor implementation detail that could be shortened in favour of a short paragraph discussing why head-only fine-tuning was chosen instead of partial unfreezing, especially for the more complex Stage 3. This could also be revisited in the discussion as a limitation and opportunity for further improvement.
Response 1:
Thank you for this comment, and we have substantially revised section 3.4.2 (line 352). We removed the low-level implementation detail about disabling the internal pretrained flag and loading weights manually. We replace it with a concise explanation of our head-only fine-tuning strategy, explicitly the rationale behind selecting this approach. We also note that more aggressive fine-tuning, especially for Stage 3 (embryo orientation), is a promising direction for improving performance. In the Discussion, we now explicitly revisit head-only fine-tuning as a limitation and mention partial unfreezing / advanced adaptation as future work.
Comment 2:
I did not see major typographical errors here. Please just double-check capitalization of “ImageNet-22k” and ensure that the links in the references are correctly formatted for the final journal style.
Response 2:
We have double-checked the capitalizations in all the occurrences of “ImageNet-22k” and the links in the references to ensure consistent formatting.
----------
Algorithms and evaluation metrics
Comment 1:
The algorithms rely on the reader understanding that D2 contains only pure kernels and D3 only pure-flat kernels. Although this is described earlier, a short reminder in the algorithm description would improve self-containment.
Response 1:
We have made the algorithms more self-contained. Added a clarifying statement right before the algorithm (line 377) that reminds the reader what each dataset contains.
Comment 2:
The metrics section could briefly state which metric is used as the primary model selection criterion. For example, it seems accuracy is emphasized, but accuracy can be misleading with class imbalance; the authors do report macro and weighted scores, but it would be good to state that model selection is based on validation accuracy while also inspecting F1.
Response 2:
We have also updated the Evaluation Metrics subsection (line 405) to explicitly describe our model selection criterion. We now state that validation accuracy is used as the primary selection metric for choosing the final checkpoint for each stage.
----------
Experimental results
Comment 1:
Only two CNN baselines are considered. They are reasonable choices, but given the strong focus on agricultural imaging in the paper, it would be informative to mention at least briefly whether any lighter CNNs (for example EfficientNet or MobileNet) were tested and how they performed, even if only in an appendix or a short remark.
Response 1:
Thank you for the comment. Our goal was to compare a strong, commonly used pair of CNN backbones against the proposed CvT-based architecture, rather than to exhaustively benchmark all CNN families. We have added a new paragraph (line 416) that clarifies we did not evaluate additional lightweight CNNs. We also added a new summary table in the Results section that directly compares CornViT to these two CNN baselines across all stages, making the existing comparison more visible.
Comment 2:
The independence assumption used when estimating an approximate end-to-end accuracy of about eighty per cent is clearly labelled as a rough lower bound. Still, it might be better to add one sentence explaining that a direct end-to-end evaluation of the full pipeline on a joint test set would give a more accurate picture and could be included in future work.
Response 2:
We have updated the Overall Pipeline Behaviour and Error Propagation subsection. Immediately after presenting the approximate 0.803 end-to-end accuracy estimate (line 504), we added a sentence noting that a more accurate real-world performance would require a direct end-to-end evaluation of the full three-stage pipeline on a joint test set that covers all purity, shape, and orientation combinations, and we explicitly flag this as future work. Thank you for this suggestion.
Comment 3:
Small editorial point: when reporting accuracies, sometimes ranges like “76.56–81.02%” are written. It would be clearer to explicitly say “from 76.56 to 81.02 percent” in continuous prose.
Response 3:
We have reviewed the entire manuscript and updated all such occurrences. This really improved readability.
----------
Discussion and Conclusions
Comment 1:
The discussion could briefly relate the proposed method to the broader topic of agricultural product classification again, with one sentence indicating how similar hierarchical frameworks could be applied to other products (for example fruits or silage quality), and citing Succulent-YOLO and the 2022 corn silage study as complementary lines of work.
Response 1:
We have extended the Discussion section to reconnect CornViT to the broader agricultural product classification context. We added a short paragraph (line 600) explaining that similar hierarchical frameworks could be applied to other products, such as fruit grading and silage quality assessment, along with relevant citations for Succulent-YOLO and the 2022 corn silage study. This helps position CornViT as part of a wider family of deep-learning solutions for agricultural product quality assessment.
Comment 2:
Some sentences repeat numerical results already mentioned several times in the paper. It may be possible to shorten this section slightly without losing important information.
Response 2:
We have carefully revised the Discussion and Conclusions sections to reduce redundancy. We removed or shortened sentences that repeated detailed numerical results already presented in the Abstract, Results, and tables. We kept only the most essential summary figures.
We thank the reviewer again for the thoughtful and constructive feedback. We believe the revisions have substantially improved both the clarity and the broader positioning of the manuscript.
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAll the comments are addressed. It is recommended for publication
Reviewer 2 Report
Comments and Suggestions for AuthorsThis manuscript can be accepted without revision.

