Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Comprehensive Evaluation of Paprika Instance Segmentation Models Based on Segmentation Quality and Confidence Score Reliability

Horticulturae 2025, 11(5), 525; https://doi.org/10.3390/horticulturae11050525

by Nozomu Ohta^1,*

, Kota Shimomoto¹

, Hiroki Naito²

, Masakazu Kashino¹, Sota Yoshida¹

and Tokihiro Fukatsu¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3:

Luis Pires

Horticulturae 2025, 11(5), 525; https://doi.org/10.3390/horticulturae11050525

Submission received: 4 April 2025 / Revised: 8 May 2025 / Accepted: 9 May 2025 / Published: 13 May 2025

(This article belongs to the Special Issue Application of Smart Technology and Equipment in Horticulture—2nd Edition)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Although the paper mentions that the dataset comes from large-scale commercial greenhouses, it is recommended to further clarify the diversity and representativeness of the dataset. Especially, does it cover different lighting conditions, growth stages, and varieties of chili fruits.
2. It is recommended to explore how to dynamically adjust the confidence threshold in practical applications to adapt to environmental changes when discussing the reliability of confidence scores.
3. It is suggested to propose more specific future research directions in the conclusion section, such as how to further improve the segmentation quality of the model in complex environments.

Author Response

We would like to express our sincere gratitude for the numerous appropriate and valuable comments we have received regarding this submitted paper. We deeply appreciate the feedback from various perspectives. Below, we have outlined our responses and actions taken in response to each comment, and we kindly ask for your consideration in reviewing them. For your convenience, we have attached a version of the manuscript with all changes highlighted.

Comments 1: Although the paper mentions that the dataset comes from large-scale commercial greenhouses, it is recommended to further clarify the diversity and representativeness of the dataset. Especially, does it cover different lighting conditions, growth stages, and varieties of chili fruits.

Response1: Thank you for your thoughtful comment. The primary objective of this study is to establish reliable evaluation metrics for paprika detection models, rather than to create a comprehensive dataset covering diverse conditions. For this purpose, we deliberately chose controlled conditions: consistent LED lighting during nighttime and a single variety (Capsicum annuum L. 'Nagano') in commercial greenhouses. This controlled setting was essential to minimize confounding variables and ensure the validity of our proposed evaluation metrics. We acknowledge that our dataset does not aim for diversity or broad representativeness across different lighting conditions, growth stages, or varieties. As you correctly noted, we have now added a clearer description of the growth stages included in our study to Section 3.1 (P3L120). We have revised the discussion section (P14L455-L460) to explicitly state that our proposed metrics were validated under these specific conditions, and that their applicability to diverse environments would require further investigation. This clarification ensures readers understand the scope and limitations of our contribution.

Comments 2: It is recommended to explore how to dynamically adjust the confidence threshold in practical applications to adapt to environmental changes when discussing the reliability of confidence scores.

Response 2: Thank you for your suggestion. We have added a discussion in Section 5.3 (P13L439-L446) about dynamic confidence threshold adjustment methods, referencing relevant studies [30, 31] and potential applications for immature fruit detection and seasonal variations. While not implemented in our current work, these approaches are important for future practical applications.

Comments 3: It is suggested to propose more specific future research directions in the conclusion section, such as how to further improve the segmentation quality of the model in complex environments.

Response 3: Thank you for your suggestion. Following your recommendation, we have added specific future research directions to Section 6 (P14L480-L489), including technical improvements for complex environments, extension to other crops and varieties, and integration with commercial agricultural systems.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

This paper proposed the evaluation methods for paprika instance segmentation models that not only considers traditional mask AP metrics but also introduces AJI and confidence score reliability assessments, showcasing a degree of innovation and practical value. It is recommended for publication after minor revision. Some concerns:

The theoretical basis for the confidence score reliability assessment is not sufficiently discussed.
The rationale for model selection could be more detailed.
To include detailed statistical information about the dataset, such as the distribution of image counts in the training, validation, and test sets.
To add comparative experiments under different lighting conditions.
Supplementary data on computational resource consumption is suggested.
To include comparisons with the latest related works.
The discussion on the limitations of the method is not in-depth enough.
Provide a more detailed rationale for model selection.

Author Response

Comments 1: The theoretical basis for the confidence score reliability assessment is not sufficiently discussed.

Response 1: Thank you for this important comment. To address your concern about the theoretical basis for confidence score reliability assessment, we have added a detailed explanation in Section 3.3.4 (P8L252-P9L267). Specifically, we have formulated the variance of confidence score thresholds and demonstrated the mathematical relationship with the correlation coefficient of the Segmentation Reliability Diagram. This provides a clear theoretical justification for using the correlation coefficient as a reliability metric in our proposed method. We believe this addition strengthens the theoretical foundation of our approach.

Comments 2: The rationale for model selection could be more detailed.

Response 2: Thank you for this comment. While we had briefly mentioned the rationale for model selection in Section 3.2, we have now expanded it with more detailed explanations based on your suggestion (P4L132-L135). We have cited a review paper [18] and clearly stated that Mask R-CNN was selected for its robustness in detecting small and densely clustered fruits, while the YOLO series was chosen for its high detection accuracy and faster inference speed, both being widely used in agricultural applications. We have also added more detailed discussion on their proven track records in agricultural segmentation tasks.

Comments 3: To include detailed statistical information about the dataset, such as the distribution of image counts in the training, validation, and test sets.

Response 3: Thank you for this comment. The distribution of image counts in the dataset is provided in Section 3.1 (P4L123, training set: 1,042 images, validation set: 190 images, test set: 372 images). If this does not adequately address your concern, please let us know.

Comments 4: To add comparative experiments under different lighting conditions.

Response 4: Thank you for this comment. We recognize that comparative experiments under different lighting conditions are important for validation. However, in this study, we intentionally maintained constant lighting conditions to focus on the comparative analysis of evaluation metrics and model performance. We acknowledge this as one of the limitations of our study (P14L455-460) and have added it to the conclusion section (P14L485-L486) as a future research direction.

Comments 5: Supplementary data on computational resource consumption is suggested.

Response 5: Thank you for this comment. We have added quantitative comparisons of computational resource consumption in Section 4.3 (P11L352-L356). Specifically, we evaluated each model using Multiply-Accumulation Operations (MACs) and the number of learnable parameters. Our results show that YOLO11 achieves the lowest values for both MACs (10.4 GFLOPs) and parameter count (2.8 M), confirming its suitability for real-time applications.

Comments 6: To include comparisons with the latest related works.

Response 6: Thank you for this comment. Following your suggestion, we have added comparisons with the latest related works in Section 5.2 (P13L405-L418). Specifically, we compared our YOLO11 results with Escamilla et al. [15]'s sweet pepper detection study. While they achieved an AP@50 of 0.803 and F1-score of 0.77 using YOLOv5, our YOLO11 achieved a Box AP@50 of 87.16% and F1-score of 82.28%, demonstrating superior performance in both metrics. We attribute this improvement to our nighttime imaging under stable illumination conditions and the adoption of the latest YOLO11 model.

Comments 7: The discussion on the limitations of the method is not in-depth enough.

Response 7: Thank you for this comment. We agree that our discussion of methodological limitations was insufficient. To address this, we have added a new Section 5.4 (P13L454-P14L467) to thoroughly discuss the limitations of our proposed method. Specifically, we address three main limitations: (1) the constraint of using only a single pepper variety photographed during winter nights, which limits generalizability to other crops, pepper varieties, or environmental conditions, (2) our focus on evaluation metrics rather than developing models optimized for complex environments, and (3) our emphasis on offline server-based processing rather than real-time detection on edge devices.

Comments 8: Provide a more detailed rationale for model selection.

Response 8: Thank you for this comment. This addresses the same concern regarding model selection rationale as raised in Comments 2, where we have provided a detailed response. We have expanded Section 3.2 (P4L132-L135) to include comprehensive explanations for our model selection from various perspectives, including track records in agricultural applications, computational efficiency, and segmentation accuracy.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

Dear authors, in the next paragraphs, my comments about your manuscript.

Clearly, the study relates to real applications in agriculture, particularly concerning harvesting and productivity forecasting of peppers (paprika) in commercial greenhouses.

The introduction of the aggregated Jaccard index (AJI) and the coefficient of determination (R) as additional metrics to the classical average precision (AP). The use of "segmentation reliability diagram" to assess the correlation between confidence and the actual quality of the segmentation. The data collected in a true agricultural environment, under controlled conditions (nighttime, LED lighting, and shading techniques), increases the reliability and reproducibility of results. Very thorough evaluation of a range of segmentation models (Mask R-CNN, MS R-CNN, YOLO11, Mask R-CNN-SW). Trade-offs considered between detection accuracy and segmentation quality. The scientific and methodological contribution is important and well defined.

The paper reports relevant innovation in segmentation within precision agriculture.

Points to improve

1.Long-windedness of the introduction can be objective making the issue more pertinent with a succinct directness to the science gap that it addresses.

2.The analysis does not consider the variations in time (e.g. plant growth conditions such as seasonal) that could be relevant for continuous IoT application.

3.One such indication could broaden the applicability of this method beyond paprika and more clearly into other crops in general.

4.Although the data were analyzed in an actual environment, the study is restricted to greenhouse and a specific kind of paprika, which makes it difficult to generalize for other crops or environments.

5.The system architecture for image acquisition is without humidity, temperature, and luminosity environmental sensors. Such additions enhance contextual inference to a very large extent. Processing occurs in centralized manner-offline computing. This article, however, does not deal with the feasibility of such processing integration with edge computing architectures or distributed systems as typically found in IoT networks.

6.Some images would benefit from higher-contrast captions, in particular the green/red mask overlays.

7.Figure 8 is truly significant for visual comparison of the models, but it would be more interesting if the quantitative IoU values were involved in the image.

8.Resolution of images is adequate for a scientific publication but lacks samples for more pronounced defects such as extreme cases.

9.Detection devices, in real-time, must communicate with gateways situated in remote agricultural environments. Cloud could be used for data aggregation for incremental model training. Integration of light, temperature, humidity, and CO₂ sensors will help in the dynamic adjustment of the model confidence threshold to allow adaptable behavior. Future implementation could involve embedded devices (Jetson, Coral TPU), along with IoT environmental sensors (temperature, humidity, luminosity) to strengthen field deployment.

10.Minimizing latency and maximizing scalability in multi-greenhouse systems.

11.It allows running detection locally instead of transferring all data to the cloud.

12.Real-time segmentation can guide robotic arms during harvest automation by integrating tactile and visual feedback.

13.Built-in incremental learning techniques are included in low-energy devices, thereby adapting easily to plant cycles.

14.Inference time, energy consumption, and consumption of computing resources are very relevant to IoT/edge applications, but no metrics have been proposed here.

15.In my opinion, the choice of taking the threshold in the range from 0.05 to 0.95 in steps of 0.1 might be too big and thereby potentially exclude some relevant peaks.

16.The analysis would benefit from further metrics, such as F1-Score and Precision-Recall AUC, to complement the performance investigation at various thresholds.

17.A comparative representation of model failures, such as false positives and false negatives, should be included so as to demonstrate the limits of the models.

18.Extend the final section with topics like integration possibilities into commercial agricultural platforms, scalability of the system, and limits on different paprika varieties.

Author Response

Comments 1: Long-windedness of the introduction can be objective making the issue more pertinent with a succinct directness to the science gap that it addresses.

Response 1: Thank you for your feedback. We agree that the introduction was too lengthy before reaching the research gap. We have condensed the introductory paragraphs (P1L30-L41) to more quickly present the scientific gap our study addresses. The revised introduction now provides a more succinct path to the core research problem.

Comments 2: The analysis does not consider the variations in time (e.g. plant growth conditions such as seasonal) that could be relevant for continuous IoT application.

Response 2: Thank you for your comment. We agree that temporal variations are important for continuous IoT applications. While our current study focuses on metrics and model analysis with winter season data, we have acknowledged this limitation in Section 5.4 (P14L455-L460) and listed seasonal variation analysis as future work (P14L485-L486) in the conclusion section.

Comments 3&4: One such indication could broaden the applicability of this method beyond paprika and more clearly into other crops in general. Although the data were analyzed in an actual environment, the study is restricted to greenhouse and a specific kind of paprika, which makes it difficult to generalize for other crops or environments.

Response 3&4: Thank you for your comments. As comments 3 and 4 appear to be related concerns, we will address them together. While our proposed metrics could be applicable to other crops, the scope of this study is specifically limited to a particular variety of paprika, and we did not aim to generalize our approach to other varieties or crops. However, we recognize the need for validation with other varieties or crops. Therefore, we have included the application to other varieties and crops in our future research directions in the conclusion section (P14L483-L485).

Comments 5: The system architecture for image acquisition is without humidity, temperature, and luminosity environmental sensors. Such additions enhance contextual inference to a very large extent. Processing occurs in centralized manner-offline computing. This article, however, does not deal with the feasibility of such processing integration with edge computing architectures or distributed systems as typically found in IoT networks.

Response 5: Thank you for your comment. We agree that incorporating environmental sensors (humidity, temperature, and luminosity) and implementing edge computing architectures or distributed systems as found in modern IoT technologies would be interesting and promising directions for monitoring systems. While our current study concentrates on offline analysis to establish baseline metrics and model evaluation methods, we recognize these as essential next steps. We have addressed these points in our limitations (P14L465-L467) and future work sections (P14L486-L489), particularly noting the need for edge device implementation and real-time processing capabilities.

Comments 6: Some images would benefit from higher-contrast captions, in particular the green/red mask overlays.

Response 6: Thank you for your comment. We have recreated Figure 8 (P11L339) with opaque and high-contrast colors to replace the semi-transparent masks that were difficult to see.

Comments 7: Figure 8 is truly significant for visual comparison of the models, but it would be more interesting if the quantitative IoU values were involved in the image.

Response 7: Thank you for your comment. IoU values are already displayed in Figure 8 at the bottom-right of each image panel. Please let us know if this response does not address your concern.

Comments 8: Resolution of images is adequate for a scientific publication but lacks samples for more pronounced defects such as extreme cases.

Response 8: Thank you for this valuable comment. We acknowledge that our original submission lacked examples of detection failures in more extreme cases. To address this concern, we have added representative detection failure cases for both Mask R-CNN and YOLO11 in Figure 9 (P11L350), specifically focusing on small fruits occluded by lower leaves. The Mask R-CNN example demonstrates an overconfident false positive detection, while the YOLO11 example shows an underconfident false negative case. Additionally, we have provided a detailed analysis of these failure cases in Section 4.2 (P11L342-L348). We believe these additions better clarify the limitations of our proposed method and highlight specific areas for future improvement.

Comments 9-14: Detection devices, in real-time, must communicate with gateways situated in remote agricultural environments. Cloud could be used for data aggregation for incremental model training. Integration of light, temperature, humidity, and CO₂ sensors will help in the dynamic adjustment of the model confidence threshold to allow adaptable behavior. Future implementation could involve embedded devices (Jetson, Coral TPU), along with IoT environmental sensors (temperature, humidity, luminosity) to strengthen field deployment. Minimizing latency and maximizing scalability in multi-greenhouse systems. It allows running detection locally instead of transferring all data to the cloud. Real-time segmentation can guide robotic arms during harvest automation by integrating tactile and visual feedback. Built-in incremental learning techniques are included in low-energy devices, thereby adapting easily to plant cycles. Inference time, energy consumption, and consumption of computing resources are very relevant to IoT/edge applications, but no metrics have been proposed here.

Response 9-14: Thank you for this important comment. We agree that inference time, energy consumption, and computing resource utilization are critical metrics for IoT/edge applications. Since these metrics are highly device-dependent, we chose to evaluate computational complexity using more universal indicators: Multiply-Accumulation Operations (MACs) and the number of learnable parameters. As shown in the results added to Section 4.3 (P11L352-L356), YOLO11 achieves the smallest values for both MACs and parameter count, confirming its suitability for real-time applications. While our current study focuses on algorithm-level performance evaluation, we recognize that the suggested implementation on edge devices and integration with environmental sensors represent important next steps for practical agricultural deployment. We plan to address these implementation aspects in future research (P14L486-L489).

Comments 15: In my opinion, the choice of taking the threshold in the range from 0.05 to 0.95 in steps of 0.1 might be too big and thereby potentially exclude some relevant peaks.

Response 15: Thank you for this valuable comment. We agree that the 0.1 step size might miss important peaks. To address this concern, we have changed the step size to 0.01 for the AJI (Aggregated Jaccard Index) confidence score threshold analysis, recalculated all results, and updated the Table 1 (P10L314) and descriptions in the manuscript (P10L310-L313). The relative ranking of model performance showed no significant changes.

Comments 16: The analysis would benefit from further metrics, such as F1-Score and Precision-Recall AUC, to complement the performance investigation at various thresholds.

Response 16: Thank you for this valuable suggestion. We have conducted experiments with the F1-Score and added the results to Table 1 (P10L314). Additionally, we have included a comparison with recent state-of-the-art studies in Section 5.2 (P13L405-L418) as part of our analysis. Regarding Precision-Recall AUC, the Average Precision (AP) metric used in object detection is calculated from the interpolated Precision-Recall curve and is closely related to PR-AUC, providing similar information. Therefore, we believe that the mAP metric already reported in our results sufficiently addresses this aspect of performance evaluation.

Comments 17: A comparative representation of model failures, such as false positives and false negatives, should be included so as to demonstrate the limits of the models.

Response 17: Thank you for this comment. As mentioned in our response to the comment 8, we have added representative detection failure cases to Figure 9 (P11L350). Specifically, we present false positive examples for Mask R-CNN, which tends to be overconfident, and false negative examples for YOLO11, which tends to be underconfident. These examples provide important insights into the characteristic failure patterns of each model. We have also added a detailed analysis of these failure cases in Section 4.2 (P11L342-L348).

Comments 18: Extend the final section with topics like integration possibilities into commercial agricultural platforms, scalability of the system, and limits on different paprika varieties.

Response 18: Thank you for your suggestion. We have expanded the conclusion section with specific future research directions (P14L480-L489), including: improving segmentation quality in complex environments, extending to other crops and paprika varieties, validating under different imaging conditions and seasons, and developing edge computing integration for commercial platforms. These additions address the practical considerations necessary for real-world implementation.

Article Menu

Comprehensive Evaluation of Paprika Instance Segmentation Models Based on Segmentation Quality and Confidence Score Reliability

Further Information

Guidelines

MDPI Initiatives

Follow MDPI