Review Reports - Self-Supervised and Multi-Task Learning Framework for Rapeseed Above-Ground Biomass Estimation

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

Missing clear hypothesis statement and statement as to what statistical experimental design and control is.

Materials and methods section needs more detail: paragraph starting line 137, omits how the plants were selected for harvesting, 1 square meter centered on image, or only the portions of the plant visible in the image or ? Harvest bias versus image plant density can be dramatically influenced by the method utilized. And reader can't duplicate this work without complete details on methods used.

Materials and Methods: Deficiencies for Reproducibility

While the computational methods are clear, the agronomic and data collection methods lack critical details required for duplication.

Major Deficiencies:

Missing Genetic Material Information: This is the most significant omission. The study uses "a collection of 833 rapeseed materials" and splits folds "at the cultivar level". However, the manuscript never states how many distinct cultivars this collection represents, nor does it provide a list of their names or types. It is impossible to reproduce this study without knowing the genetic source material used.
Missing Image Pre-processing Details: The authors describe the camera used but completely omit the pre-processing pipeline for the images. What was the **input resolution** of the images fed to the ViT model (e.g., 224x224, 518x518)? This is a fundamental hyperparameter. * How were the original, high-resolution smartphone images **cropped or resized**? * Was the small pink "label paper" masked or cropped out, or was the model trained to ignore it?
Missing Field Condition Details: The images were captured "in the field". However, no details about the illumination conditions are provided. Were all 833 images captured on the same day under consistent light (e.g., "clear, sunny conditions") or across multiple days with varying conditions (e.g., "overcast, partial sun, and direct light")? This is critical for a computer vision study, as lighting is a primary confounding variable.
Field Plot Design: The text describes agronomic practices (planting density, fertilizer) but not the field plot design. Was this a collection of single plants in one large block, or was it a formal experimental trial with replication (e.g., a randomized complete block design)? This context is important.
Pre-training Data Split: During the SSL pre-training on the 4,993 public images, was a validation set used to monitor the 300-epoch training, or was the entire dataset used (as is common in SSL)? This should be clarified.
Line 165 "Self-Supervised Pre-training with DINOv2" is too vague as to how a Teacher is obtained. Without a teacher, the method falls apart, yet the MS completely omits this critical factor.
Need a LOT more detail on EXACTLY HOW CROSS-FOLD study was conducted. It is critically important that the test set was never shown to the model before final test. With only 833 images; a tiny fraction of what is normally used in training CNN models, even with transfer learning. Makes this reviewer highly skeptical that this was handled correctly; so we will not assume it was done correctly. You need to EXPLICITLY tell us EXACTLY how the cross-fold method was conducted. Typically in this work you use 3 sets of independent images {training, test, validation}; where training set of images are used to train the model, a test set is used to get an assessment of how well that model works, and this is repeated till a suitable model is generated. Then this final model is validated against the never-before-used validation set to establish Validation Accuracy of the model. This works further complicates this approach by partitioning along cultivars, which is wise; but the MS omits the details on how the {train, test, validation} image data-sets are kept independent to ensure the model doesn't simply memorize the images.
lack of details on what image augmentations were used to amplify the data-set
missing agronomic data; number of degree days at time of image acquisition and harvest.
must also include information as the leaf area index so reader knows % canopy fill when the test was conducted and should provide insight as to crop morphology, is this normal crop canopy density at harvest or was this study conducted well before full canopy was established?

Summary: too many missing details from materials and methods section. Must be improved.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

The authors adopted a vision transformer architecture for above-ground biomass estimation of rapeseed from top-down RGB images. Both fresh weight and dry weight ground-truth values were used as model outputs, and the authors demonstrated that combining the two targets led to improved results.

[General]
The introduction section is well written. The authors have also thoroughly considered the data scarcity problem in plant phenotyping and proposed transfer learning and fine-tuning methods that perform well. However, the discussion following the performance metrics is limited, and I therefore recommend major revisions to strengthen the narrative.

[Major]
1. How can the model predictions be interpreted? Which parts of the image does the model focus on most? The authors could consider visualizing the attention maps of the ViT model or applying any model-agnostic explainable AI method to identify the key components of the image that contribute to the regression predictions.

2. The authors should discuss the reason why the predicted (y) versus true (x) fittings consistently have a slope < 1, as observed in Figures 3 and 5, and how such bias could be quantified and corrected in future work for actual applications.

3. Is there any data hallucination (e.g., when inputting soil RGB images without rapeseed seedlings or other types of plants)? The authors could include some failure cases to illustrate and discuss the limitations of the model as well.

[Minor]
4. Line 112 "significantly shortening the breeding cycle": How much time (in percentage) could the proposed pipeline save during an actual breeding cycle, considering both the growing and phenotyping stages? The author could also use a table or figure to better illustrate this.

5. Line 238–240 appear to describe different data-splitting strategies that conflict with each other.
Line 238 "To prevent any data leakage based on genetic similarity, the dataset was partitioned into five folds at the cultivar level, ..."
Line 240 "The entire dataset was randomly partitioned into five equally sized, non-overlapping folds."

6. Figure 4: Since the number of samples is low for the 5-fold CV (N = 5), all individual RMSE data points should be plotted on top of (or next to) the bar plot.

7. Table 3: RRMSE is dimensionless and should not have units in g.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

The authors developed a combined framework for the biomass estimation utilizing DINO2 framework and multitask regression learning by incorporating a multilayer perceptron. This is an important innovation, as it allows the model to learn general plant features without the need for vast amounts of labelled data. The authors report a strong predictive performance on rapeseed imaging dataset.

While the overall methodology is solid, a few aspects could be further improved or clarified. I recommend focussing more on dataset diversity and model explainability.

The dataset used for fine-tuning the model is relatively small (N=833) and does not include diversity of species. At the same time the authors have demonstrated a good performance. It would be helpful to elaborate more about the potential for model overfitting on given small size rapeseed dataset. In addition, more evidence or testing on other species would be beneficial to show how well the proposed framework can generalize to different plants.
It would be valuable to discuss potential methods for interpreting the model’s decisions (Explainable AI), especially in terms of understanding which features are most influential for predicting biomass.

Minor correction:

How many holds out were used (line 255)?
When you compare performance against baseline what baseline parameters did you mean? Can you please explain it? (line 256).
Can you please add in Table 1 and 2 captions for what model (single task regression or multitask regression) the results are provided?
It is not recommended to use MAE abbreviation for Masked Autoencoder (subsection 3.3, lines 365, Table 4).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors have addressed my comments, and I recommend publishing after a minor fix.

Figure 6: The images in the top row of (a) original and (b) attention are not from the same source. The bottom row is correct.