Ornamental Potential Classification and Prediction for Pepper Plants (Capsicum spp.): A Comparison Using Morphological Measurements and RGB Images as Data Source
Round 1
Reviewer 1 Report
Comments and Suggestions for Authors- In the introduction section, clearly emphasize the innovative aspects of this research and its breakthrough contributions compared to previous studies, to highlight the significance of the study.
- The introduction insufficiently discusses the application of RGB imaging and deep learning methods in horticultural plant research; additional relevant literature should be incorporated.
- The dataset used in this study is relatively small (only 120 samples), which may adversely affect the generalization performance of the model. Since ViT models typically require large-scale datasets, the current results might entail an overfitting risk.
- Please clarify the rationale behind selecting the XGBoost and ViT models, including their specific advantages over alternative methods such as Convolutional Neural Networks (CNN).
- While the ViT model utilizes Linformer to reduce the complexity of the attention mechanism—demonstrating a concern for computational efficiency—other crucial model design details (such as learning rate strategies, regularization techniques, and weight initialization methods) are not described in sufficient detail and should be further supplemented.
- The ViT model achieved an accuracy of 99.2% under the expert evaluation criterion; however, specific error-case analyses beyond the confusion matrix are absent, making it difficult to determine whether the model excessively relies on particular features (e.g., background noise).
- Descriptions of image preprocessing and dataset partitioning are insufficient. For instance, it is unclear whether the test set contains data from the same varieties or environmental conditions as the training set, which may introduce bias.
- Experimental results indicate minimal accuracy variation between different image resolutions (300×200 px and 600×400 px); please analyze and clarify the underlying reasons for this phenomenon.
- Although comparisons between controlled and uncontrolled greenhouses were mentioned, the manuscript lacks detailed discussion on this factor. It is recommended that the authors elaborate on how environmental factors (such as illumination, temperature, and humidity) impact phenotypic traits and model prediction performance.
- Please explain the specific biological or phenotypic mechanisms enabling predictions of ornamental potential up to seven weeks in advance. Specifically, which early features (such as plant morphology, growth habits, or leaf color) strongly influence subsequent ornamental traits?
- The manuscript English expression is generally clear; however, some sentences are lengthy, and certain terminologies are inconsistently used. Professional language editing is suggested to enhance accuracy and ensure terminology uniformity.
Author Response
- In the introduction section, clearly emphasize the innovative aspects of this research and its breakthrough contributions compared to previous studies, to highlight the significance of the study.
Thanks for your accurate suggestion. We have added a sentence in the introduction to further clarify the contribution of our study.
As far as the authors are aware, this work is the first to forecast the ornamental potential of pepper plants (Capsicum spp.) multiple weeks ahead of time using image-based deep learning models.
- The introduction insufficiently discusses the application of RGB imaging and deep learning methods in horticultural plant research; additional relevant literature should be incorporated.
We appreciate the suggestion. A paragraph highlighting the main applications of RGB images and deep learning methods targeting ornamental plants has been included in the introduction.
- The dataset used in this study is relatively small (only 120 samples), which may adversely affect the generalization performance of the model. Since ViT models typically require large-scale datasets, the current results might entail an overfitting risk.
We sincerely appreciate the reviewer’s insightful comment regarding the dataset size and its potential impact on model generalization. It is already stated in the discussion that this fact supposes a limitation of this study. We acknowledge that Vision Transformer (ViT) models often benefit from large-scale datasets due to their high capacity. However, in this study, we employed several strategies to mitigate overfitting and enhance generalization despite the limited sample size (N=120):
Data Augmentation: We applied augmentation techniques (flipping) to artificially diversify the training data, thereby improving the model’s robustness.
Regularization: Linformer’s low-rank projection reduces overfitting by design.
While we agree that a larger dataset would further validate the findings, our experiments demonstrate consistent performance metrics, suggesting minimal overfitting.
The following text appears in the Discussion section:
“The main drawback of this study is the relatively small dataset used, which may constrain the generalizability of the results, particularly for more complex or deeper model architectures. Future work should aim to expand the dataset across different growing conditions, developmental stages, and genotypic backgrounds.”
- Please clarify the rationale behind selecting the XGBoost and ViT models, including their specific advantages over alternative methods such as Convolutional Neural Networks (CNN).
Thank you for your thoughtful question regarding our model selection. We chose XGBoost and Vision Transformer (ViT) for the following reasons:
XGBoost:
Handles structured data efficiently: For tabular or feature-based data, XGBoost often outperforms deep learning models like CNNs due to its robustness, interpretability, and ability to handle missing values.
Computationally efficient: It requires fewer resources for training compared to deep neural networks while achieving competitive accuracy.
Feature importance: Provides clear insights into which features drive predictions, which is valuable for our analysis.
Vision Transformer:
Global context modeling: Unlike CNNs, which process images locally through convolutional filters, ViT leverages self-attention to capture long-range dependencies across the entire image, improving performance on tasks requiring holistic understanding.
Scalability: ViT has demonstrated superior performance on large-scale datasets, and with proper pre-training, it can generalize well even with limited data.
State-of-the-art results: In our experiments, ViT achieved higher accuracy compared to traditional CNNs for the given task.
While CNNs remain a strong baseline for image-related tasks, we found that ViT’s attention mechanism provided better performance for our specific dataset. That said, we acknowledge CNNs’ advantages in scenarios with limited data or requiring inductive bias for spatial hierarchies.
We appreciate your engagement . We have added sentences in the text to further clarify this aspect:
“The rationale behind the use of this algorithm is that it handles structured data efficiently, it is computationally efficient and provides insights about feature importance as well.”
“A vision transformer (ViT) model was used for image-based classification due to their state-of-the-art capabilities and global context modeling over other alternatives such as Convolutional Neural Networks (CNN) that focus on local context instead.”
- While the ViT model utilizes Linformer to reduce the complexity of the attention mechanism—demonstrating a concern for computational efficiency—other crucial model design details (such as learning rate strategies, regularization techniques, and weight initialization methods) are not described in sufficient detail and should be further supplemented.
Thank you for your comment. We added the ViT implementation we built upon and the GitHub repo for the reader interested in knowing further details about the implementation. Furthermore, we added this part in the article:
“was built upon the following GitHub repository [33]”
- Phil Wang (lucidrains). Vision Transformer (ViT) - PyTorch Implementation - GitHub repository. https://github.com/lucidrains/ 473
vit-pytorch, 2020. Accessed: 2025-06-11. 474
- The ViT model achieved an accuracy of 99.2% under the expert evaluation criterion; however, specific error-case analyses beyond the confusion matrix are absent, making it difficult to determine whether the model excessively relies on particular features (e.g., background noise).
We appreciate the reviewer’s valuable feedback regarding the need for deeper error-case analysis. While the confusion matrix provides a high-level overview of misclassifications, we agree that further investigation into specific failure modes, such as potential overreliance on background features or noise, would strengthen the interpretability of our results. Some of the authors of this article are currently working on this issue for a future article in progress.
Since the article is lengthy enough as it is right now, the authors have decided to keep this analysis for future work in which to address this issue, we will:
Perform Error-case Sampling: Select and qualitatively analyze misclassified instances (i.e. false positives/negatives) to identify patterns (such as biases toward background artifacts, lighting conditions, or spurious correlations).
Feature Attribution Analysis: Use explainability tools (attention maps in ViT) to highlight which image regions most influenced the model’s decisions, explicitly testing for dependency on background noise.
Augmentation Testing: Evaluate performance under controlled perturbations (masking foreground/background) to isolate feature reliance.
Thank you for this insightful suggestion anyway.
- Descriptions of image preprocessing and dataset partitioning are insufficient. For instance, it is unclear whether the test set contains data from the same varieties or environmental conditions as the training set, which may introduce bias.
To address the reviewer's concern about potential bias in dataset partitioning, we rigorously designed our splitting strategy to preserve dataset independence and prevent data leakage by partitioning at the plant level (rather than mixing images from the same plant across datasets). Here’s the justification:
- Preventing Data Leakage and Overfitting
Problem if split by image: If individual rotational photographs (n=19) from the same plant were distributed across train/validation/test sets, the model could inadvertently "memorize" features specific to that plant (for example, unique scars, leaf orientations, or pot placement). This would artificially inflate performance metrics because the test set would not be truly independent.
Our solution: By keeping all images of a single plant within the same dataset partition, we ensure no overlap in biological identity or microenvironment. This mimics real-world conditions where the model must generalize to entirely new plants, not just new angles of seen plants.
- Controlling for Bias Across Treatments
Balanced representation: While partitioning by plant, we ensured that each dataset (train/validation/test) contained plants from all treatments, blocks, and greenhouse conditions in representative proportions. This prevents bias toward specific experimental conditions.
Stratification: We stratified splits to preserve the distribution of key variables (treatment and greenhouse) in each partition, avoiding accidental concentration of a specific condition in one dataset.
Thus, while environmental conditions (greenhouse) are represented across partitions, no two images from the same plant appear in different partitions. This ensures the model is tested on novel combinations of treatment × plant × environment, not just novel angles of familiar plants.
To clarify in the manuscript, we have added this text:
"Dataset partitioning was performed at the plant level to prevent data leakage. All 19 rotational images of a single plant were assigned to the same partition (train, validation, or test), ensuring that the model’s performance reflects generalization to new plants under consistent conditions. Partitions were stratified by treatment and growth conditions to maintain balanced representation across experimental factors."
- Experimental results indicate minimal accuracy variation between different image resolutions (300×200 px and 600×400 px); please analyze and clarify the underlying reasons for this phenomenon.
Thanks for your comment. We agree that a further study is needed to further validate this conclusion, testing more resolutions and seeing if the trend stays the same or differences appear at some point.
We believe that the observed behavior is in agreement with what can be expected due to these facts:
- Ornamental value is inherently tied to human perception, which is robust to resolution changes within reasonable limits. The model might be mimicking this behavior by relying on similar high-level cues instead of fine grain details.
- Sufficient information is retained at the lower resolution since key features are preserved. We believe that for ornamental value prediction, the critical visual features (color distribution, texture, shape, flower arrangement) may not require extremely high resolution to be discernible.
The following text has been incorporated into the discussion in this sense:
“… increasing the RGB image resolution beyond 200$\times$300 px does not improve the accuracy and thus seems sufficient for the model to keep enough level of detail about the plant under assessment. Nevertheless, this should be further validated in future work by testing more resolutions with a bigger dataset as well.”
- Although comparisons between controlled and uncontrolled greenhouses were mentioned, the manuscript lacks detailed discussion on this factor. It is recommended that the authors elaborate on how environmental factors (such as illumination, temperature, and humidity) impact phenotypic traits and model prediction performance.
We are grateful for the contribution made by the reviewer. Environmental factors (light, temperature and humidity) influence several aspects of plant development but have not been used as input variables in our predictive models. A paragraph with this focus was added in the discussion section. However, this study did not aim to evaluate the effect of environmental conditions on the development of pepper plants. We believe that the sampling of images under the conditions described (controlled and uncontrolled house) brought representativeness to the dataset used. Future work may evaluate the influence of these environmental variables on ornamental classification.
Added text:
The potential benefits of developing environment-specific ViT models (for controlled vs. uncontrolled greenhouses) or treatment-specific variants warrant further investigation.
- Please explain the specific biological or phenotypic mechanisms enabling predictions of ornamental potential up to seven weeks in advance. Specifically, which early features (such as plant morphology, growth habits, or leaf color) strongly influence subsequent ornamental traits?
We are grateful for the contribution made by the reviewer. The phenotypic expression of the observed characteristics is dependent on several biological processes that will be strongly influenced by genotype and environment. Variables such as photosynthetic rate, transpiration and stomatal opening, which were not measured in this work, are related to the variables of importance of the model, such as: plant height, canopy diameter, growth habit and leaf density. It is not possible to state precisely which mechanisms were involved and this issue sure merits further investigation in future work. However, in general, these would be the characteristics that allowed prediction and were highlighted in a section added to the discussion.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsBy comparing morphological measurement with RGB image analysis, this paper constructs a model based on Vision Transformer (ViT) and XGBoost to achieve classification and early prediction of the ornamental potential of Capsicum plants. The study found that a single RGB image can achieve high-precision prediction 7 weeks before fruit maturity (accuracy>85%), and low-resolution (300×200 pixels) images can achieve classification performance comparable to morphological measurement (accuracy>91%). This method verifies the feasibility of computer vision to replace traditional morphological measurement and provides a new paradigm for non-invasive phenotyping for ornamental plant breeding. The article can be accepted for publication after minor revisions.
1. Strengthen the statement of research innovations in the abstract and clearly distinguish the difference in prediction performance between morphological measurement and image analysis methods. The unique advantages of the ViT model in time series prediction and its innovative significance for breeding decisions can be supplemented.
2. The discussion section needs to deeply analyze the biological mechanism of the divergence between the Veiling Holambra standard and expert evaluation, and it is recommended to combine the dynamic analysis of plant development to analyze the time-varying correlation between morphological indicators and ornamental traits in the standard.
3. In the method section, the image preprocessing process should be refined, the processing method of RAW format data and the basis for selecting the downsampling algorithm should be supplemented, and the specific impact of the data enhancement strategy on the generalization ability of the model should be clarified.
4. The literature review should include the recent progress in the application of deep learning methods in phenotypic prediction, and supplement the performance difference between CNN and ViT in plant image analysis to strengthen the theoretical basis.
5. The conclusion section should emphasize the universal value of the methodology, extend the discussion of the application potential of the framework in other ornamental crops, and quantify the benefits of shortening the breeding cycle brought by the prediction model.
Author Response
By comparing morphological measurement with RGB image analysis, this paper constructs a model based on Vision Transformer (ViT) and XGBoost to achieve classification and early prediction of the ornamental potential of Capsicum plants. The study found that a single RGB image can achieve high-precision prediction 7 weeks before fruit maturity (accuracy>85%), and low-resolution (300×200 pixels) images can achieve classification performance comparable to morphological measurement (accuracy>91%). This method verifies the feasibility of computer vision to replace traditional morphological measurement and provides a new paradigm for non-invasive phenotyping for ornamental plant breeding. The article can be accepted for publication after minor revisions.
Strengthen the statement of research innovations in the abstract and clearly distinguish the difference in prediction performance between morphological measurement and image analysis methods. The unique advantages of the ViT model in time series prediction and its innovative significance for breeding decisions can be supplemented.
Thanks for your suggestion. A paragraph was added in the abstract with the performance for both cases.
- The discussion section needs to deeply analyze the biological mechanism of the divergence between the Veiling Holambra standard and expert evaluation, and it is recommended to combine the dynamic analysis of plant development to analyze the time-varying correlation between morphological indicators and ornamental traits in the standard.
We appreciate the reviewer's valuable suggestion; a paragraph was added in the discussion dealing with this issue.
In the method section, the image preprocessing process should be refined, the processing method of RAW format data and the basis for selecting the downsampling algorithm should be supplemented, and the specific impact of the data enhancement strategy on the generalization ability of the model should be clarified.
Thanks for your comment. We did not use a RAW images for this case, but we acquired them for future use. Regarding the downsampling, Lanczos algorithm was employed. This has been clarified in the methods section.
The literature review should include the recent progress in the application of deep learning methods in phenotypic prediction, and supplement the performance difference between CNN and ViT in plant image analysis to strengthen the theoretical basis.
We appreciate the suggestion. A paragraph highlighting the main applications of RGB images and deep learning methods targeting ornamental plants has been included in the introduction.
- The conclusion section should emphasize the universal value of the methodology, extend the discussion of the application potential of the framework in other ornamental crops, and quantify the benefits of shortening the breeding cycle brought by the prediction model.
We appreciate the suggestions, a paragraph was inserted in the discussion, listing the benefits of anticipating plant selection by seven weeks within a breeding program. In conclusion, we believe that this result has been well quantified. Regarding the generalization ability of the methodology to other ornamental crops, this should be further studies in future studies, but we believe this methodology can be applied.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsIt is not clear what is the difference between the two cases with corresponding confusion matrices in Figure 10 and Figure 8. The same applies for those cases of Figures 9 and 11. It seems to me that the difference lies on the image size and the size should be specified in Figure descriptions.
How the XGBoost model parameters were determined? How the ViT architecture parameters were chosen?
How were the morphological descriptors calculated from analyzing images? What image morphology operators were used?
Author Response
It is not clear what is the difference between the two cases with corresponding confusion matrices in Figure 10 and Figure 8. The same applies for those cases of Figures 9 and 11. It seems to me that the difference lies on the image size and the size should be specified in Figure descriptions.
Thanks for your comment but using images of resolution 600 × 400 px vs 300 × 200 px, you are right. This information is included in the caption of the figures to clarify this aspect. Thanks.
How the XGBoost model parameters were determined? How the ViT architecture parameters were chosen?
XGBoost hyperparameters were optimized by using a grid search. ViT architecture was chosen as indicated by bibliography for small datasets and the images, but further optimizations can be achieved. This has been left as a future line of research though.
How were the morphological descriptors calculated from analyzing images? What image morphology operators were used?
The morphological descriptors were calculated as indicated in the Methodology (2.3. Morphological measurements), no image analyzing techniques were used in any case but manual measuring instead.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsCan be accepted