Review Reports - Comparison of Mask-R-CNN and Thresholding-Based Segmentation for High-Throughput Phenotyping of Walnut Kernel Color

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript compares a traditional thresholding-based pipeline with a machine-learning Mask-R-CNN pipeline for high-throughput walnut kernel phenotyping. The objective is to quantify color (and size) more accurately and reproducibly than human scoring across three years of breeding material.

Strengths

Very large dataset (over 92000 kernels from 4500 trees) and three years of observations.
Direct comparison of a deep-learning approach with a lightweight thresholding approach under otherwise identical imaging conditions.
Clear description of image acquisition, annotation, training and validation, and open availability of code/data.
High correlation of the two methods for key traits (lightness and kernel size).

Limitations

All images were acquired under highly controlled conditions; the applicability of the approach to less controlled environments (e.g. packhouses) has not yet been evaluated.
The CNN model is trained on a very small annotated set (13 images) and uses only two classes (background vs. walnut); this may limit generalization to more complex tasks.
The discussion could be expanded to include more recent non-invasive plant phenotyping work using CNNs and color features and to highlight possible extensions beyond kernel color.

Specific comments

Title
You may consider explicitly mentioning “walnut kernel color phenotyping” instead of “staged images” for better discoverability.

Abstract
Add a sentence at the end highlighting the potential for transferring this approach to other plant phenotyping tasks.
Consider briefly stating how many annotated images were used to train the CNN (this is strikingly low and may interest readers).

Introduction
Lines 51–66: You may wish to refer to the use of artificial neural networks and color features for non-invasive estimation of hydration level to show that similar methods have been applied successfully in other case studies (e.g. Taheri-Garavand et al. 2021, Acta Physiologiae Plantarum 43:78).
Lines 74–80: When introducing CNNs, consider citing Fanourakis et al. (2024, Plant Growth Regulation 102:485–496), which applied convolutional neural networks to multispectral images to derive tissue water content. This study highlights the power of CNNs on color/multispectral inputs and would help strengthen the rationale for your Mask-R-CNN approach.
Lines 103–114: You could briefly explain why CIELAB may be preferable to RGB for kernels.

Materials and Methods
Lines 172–193: Perhaps clarify how often the color swatch (Figure 1A) was checked to ensure calibration.
Lines 195–201: Lens correction – specify whether any residual distortion remained after correction.
Lines 210–231: Annotation – impressive that only 13 images were annotated; please justify why this is sufficient and how it might affect generalizability.
Lines 233–243: Training/validation/test split – check numbers; text says five images for training, four for validation, four for testing (total 13), but earlier it says “just five images in the training dataset, five images in the test dataset, and three images in the validation dataset” (line 430). Make consistent.
Lines 245–257: Output features – indicate whether any additional color metrics (e.g. a*, b*) are included.

Results
Figure 3: Perhaps explicitly label r² values on subplots.
Lines 321–327: You note CNN kernel size is smaller due to conservative annotation. It would be helpful to quantify the mean difference (%) between the two methods.
Lines 366–375: Lightness thresholds – briefly state how you chose cut-off values.
Figure 5 (lines 378–395): Show sample size for each class (blanks vs. non-blanks) on the confusion matrices.

Discussion
Although only one named cultivar (“Chandler”) was used as the reference in the study, the dataset actually encompasses material from roughly 3000 trees (UC Davis walnut breeding program). This breadth suggests that the approach is likely to be applicable to a wide range of new genotypes, although some degree of cultivar-specific calibration may still be required.

Lines 399–407: You emphasize CNN robustness to variable placement but also mention only staged images were used. It would be valuable to speculate on performance in less controlled conditions and whether transfer learning or larger training sets would be needed.
Lines 409–416: Fragmented kernels. Maybe mention that additional mask classes (kernel vs. embryo) could be trained in the CNN.
Lines 427–439: Small training set – stress that increasing training data should further improve CNN performance.
Lines 493–513: Human scoring variability – interesting; you could link to broader literature on the unreliability of human color scoring and the need for objective imaging.

Motivation for future studies: Besides size, additional features could be extracted from the same images to further improve grading. These include defect or damage detection (cracks, blanks, insect damage) through multi-class segmentation, as well as moisture-related proxies (Taheri-Garavand et al., 2021, Acta Physiol Plant 43:78; Fanourakis et al., 2024 Plant Growth Regul 102:485–496), provided that moisture content is an essential determinant of kernel quality (Gama et al., 2018 Scientia Hortic 242, 116–126; Peng et al., 2021 J. Spectrosc. 9986940).

In the present study, grading is performed from digital images rather than from a single trait such as kernel weight. While weight-based grading is technically simple, it captures only mass and ignores other visual quality attributes. The image-based approach presented here allows simultaneous extraction of multiple criteria — size, shape, color lightness, blank detection — which are highly relevant to both breeding selection and market classification. Once the imaging system and software are calibrated, this method can process large batches automatically with minimal operator intervention, providing a more objective and comprehensive grading than weight alone.

The manuscript clearly demonstrates the research utility of the proposed pipeline. It would strengthen the paper to comment on the potential for commercialization — for example, whether the approach could be integrated into existing commercial grading equipment or adapted to non-staged, high-throughput environments.

Although data from three harvest years are included, the Discussion does not explicitly address year-to-year variability in kernel traits or model performance. Such an analysis would be valuable: it could reveal how stable the image-based grading pipeline is under different seasonal or environmental conditions, and whether retraining or recalibration is needed each year. Discussing this aspect would strengthen the paper and give readers a clearer picture of the model’s robustness for long-term use.

Figures and tables
Make sure axis labels include units (pixels, % etc.).
Consider adding a schematic of the full pipeline (image capture → CNN/thresholding → output) early in the paper.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

- The authors should rewrite the section of 1. Introduction.

The authors should delete the following five paragraphs, which are the paragraph among line 51 and line 61, the paragraph among line 62 and line 73, the paragraph among line 74 and line 80, the paragraph among line 87 and line 93, and the paragraph among line 103 and line 114. It seems most contents of these five paragraphs could be found from existing references and are not very related to the main topic of the manuscript.
The authors need to explain why they should compare Mask-R-CNN with thresholding methods. The authors said "thresholding methods for image segmentation have fallen out of favor compared to machine learning methods" among line 59 and line 60, which may indicate that Mask-R-CNN is most likely better that thresholding methods. It may not be necessary to use worse methods such as thresholding methods again. The authors should substitute one of the YOLO serial methods such as YOLO11 or one of the Transformer serial methods such as RT-DETR for the thresholding method used in the manuscript. The authors may need to read the reference of YOLO-ALDS: an instance segmentation framework for tomato defect segmentation and grading based on active learning and improved YOLO11.
The authors may need to focus on computer vision based or deep learning based grading of walnut kernels in the section of 1. Introduction. Namely, the authors may need to clarify the overall task such as segmentation or segmentation based grading of this manuscript, since task such as segmentation is quite different from task such as segmentation based grading.

- The authors should revise the title of the manuscript.

Keywords such as segmentation or segmentation based grading should be included.
The thresholding method should be replaced by one of the YOLO serial methods such as YOLO11 or one of the Transformer serial methods such as RT-DETR.
The revised title should be within two lines.
The authors should delete . in the title.

- The authors should substitute one of the YOLO serial methods such as YOLO11 or one of the Transformer serial methods such as RT-DETR for the thresholding method in the Figure 1(c). Then, the authors cut Figure 1 and paste Figure 1 after the paragraph among line 155 and line 162, and before the title of 2.2. Manual Color Grading.

- The authors should check whether the link of https://www.mdpi.com/article/doi/s1 in line 515 is correct, since a 404 Error would appear when visit this link.

- The authors need to cite Table 1 at least once in the main text before Table 1.

- The authors should cut Figure 2 and paste Figure 2 after the paragraph among line 187 and line 193, and before the title in line 194.

- The authors should redraw Figure 2 by transforming the current engineering perspective into a scientific perspective, as follows.

The authors should delete the seven consecutive rectangular boxes starting with Legend in the bottom left corner and texts inside them.
The authors should append as much images as possible into elements in Figure 2. For example, the authors need to add both several representative images of walnut kernels and image of image capture system into the box of Image Capture /Image Acquisition and Pre-processing in Figure 2.
The authors should substitute one of the YOLO serial methods such as YOLO11 or one of the Transformer serial methods such as RT-DETR for the thresholding method for elements of (r), (r1), (r2) and (r3) in current Figure 2.
The authors should add two large dashed rectangular boxes each with a lighter different background color, one of which places the boxes of (c), (c1), (c2) and (c3) in current Figure 2 and the other places the boxes of updated (r), (r1), (r2) and (r3) in Figure 2.
The authors should merge or delete other less important boxes such as CVAT, Image Alignment, and Train etc. in Figure 2.

- The authors should ensure titles of related subsections in section of 2. Materials and Methods are same as corresponding elements in Figure 2. The authors should remove the parentheses and numbering in Figure 2 and corresponding titles.

Image Acquisition and Pre-processing (a) VS (a) Image Capture
Lens Correction / Image Adjustment (b) VS (b) Distortion Correction
Blank Kernel Removal VS ?
CNN Method (c) VS (c) CNN Segmentation
CVAT Annotation (c1) VS (c1) Image Annotation
Model Training (c2) VS (c2) Train Model
Mask-R-CNN Image Processing (c3) VS (c3) PyTorch Image Detection
Updated title of 2.6. VS (r) R Thresholding
Updated title of 2.6.1. VS (r1) Image Correction (r2) Coordinate Mapping
Updated title of 2.6.2. VS (r3) R Thresholding
QR Detection and Method Comparisons (d) VS (d) Comparative Analysis In R

- The authors need to delete thresholding based results and append results for one of the YOLO serial methods such as YOLO11 or one of the Transformer serial methods such as RT-DETR in section of 3. Results.

- The authors need to fix or unify the number of grades of walnut kernel in the manuscript. However, it seems that some experiments currently use four grades, i.e., Extra-light, Light, Light Amber, and Amber, and some other experiments use five grades, i.e., Blank, Extra-light, Light, Light Amber, and Amber. Besides, the authors need to show further results for all four or five grades based on accuracy, precision, recall, and F1 score, and confusion matrices.

- The authors should open a new section, i.e., 5. Discussion and Conclusion.

- The authors should check whether it is suitable to have two subsections of 2.5..

- The authors should check whether the () in line 47 is suitable.

Comments on the Quality of English Language

The authors should polish further the English of the manuscript.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The revision comprehensively addressed my technical, conceptual, and editorial requests. Methods are now transparent and replicable. Quantitative and industrial relevance were expanded.