1. Introduction
Ensuring oilseed security remains a persistent national challenge for China, the world’s largest oilseed consumer [
1,
2]. Rapeseed breeding programs are central to addressing this supply deficit, but their progress is hampered by significant bottlenecks. The traditional breeding cycle is notoriously slow, largely due to laborious and time-intensive field phenotyping, the long growth period of winter rapeseed, and an aging agricultural workforce ill-equipped for such manually intensive tasks [
3]. This phenotyping bottleneck creates an urgent need for high-throughput, accurate, and automated tools to accelerate genetic selection [
4].
Among the myriad of traits, Above-Ground Biomass (AGB), typically quantified as Fresh Weight (FW) and Dry Weight (DW), stands out as a primary indicator of plant vigor, resource assimilation, and yield potential [
5,
6]. Unlike 2D metrics such as the Leaf Area Index (LAI), which only capture surface coverage, AGB reflects the plant’s complete 3D structure and density. Early and accurate estimation of AGB is therefore critical for breeders to evaluate nutrient status, forecast yield, and make timely selections [
7,
8].
Despite its importance, quantifying AGB at scale remains a major hurdle. The “gold standard” method—destructive harvesting, oven-drying, and weighing—is fundamentally incompatible with modern, high-throughput breeding [
9]. It is laborious, costly, and, most critically, prevents the longitudinal tracking of promising individuals, as the plant must be destroyed to be measured. This has spurred a decades-long search for reliable, non-destructive proxies for biomass estimation [
10].
The rise of high-throughput phenotyping (HTP), powered by computer vision and accessible sensors like smartphone cameras or UAVs, has opened new frontiers [
11,
12,
13]. AI-driven approaches have shown promise in diverse agricultural tasks, from yield estimation in tea to disease detection [
14,
15]. However, initial attempts often relied on classical computer vision. Techniques like color-space thresholding (e.g., in HSV or Lab*) to segment plants from soil [
16,
17] are notoriously fragile. Their performance is highly sensitive to fluctuating ambient light, shadows, and complex backgrounds (e.g., soil debris), making them unsuitable for robust field deployment [
18].
Modern deep learning, using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs), marked a significant advance in visual recognition [
19,
20]. However, these models introduce their own set of profound challenges when applied to biomass estimation. First, the core task of inferring 3D mass from a 2D image is non-trivial. Many models default to a proxy task, like segmenting the visible leaf area. This approach is fundamentally flawed. Plant canopies are dense, 3D structures. Occlusion—where leaves overlap—is not an edge case, but a primary feature of healthy growth [
21]. A model that only quantifies visible pixels cannot differentiate between a flat, sparse plant and a dense, multi-layered one; it fails to capture the very 3D density that defines biomass [
22]. Second, the “data hunger” of these supervised models is a critical bottleneck. Training a ViT requires vast, meticulously annotated datasets [
23]. For biomass, this is an economic impossibility, as each label requires the same destructive, time-consuming harvesting we seek to avoid. This “annotation bottleneck” is compounded by domain shift, where models trained in one environment fail in another [
24,
25]. Therefore, a clear research gap exists: we need a methodology that can learn to infer 3D mass from 2D occluded views and do so without an impractically large labeled dataset.
To address these twin challenges of 3D inference and data scarcity, we leverage a two-stage transfer learning framework centered on self-supervised learning (SSL) [
26]. SSL allows a model to learn rich visual representations from massive, unlabeled image corpora, mitigating the annotation bottleneck. We first pre-train a DINOv2 ViT backbone [
27] on a large, curated corpus of diverse public plant images (deliberately excluding rapeseed). This process forces the model to learn a fundamental “language” of plant morphology and structure [
28]. Subsequently, this powerful, pre-trained feature extractor is fine-tuned on our small, custom-labeled dataset of 833 rapeseed images. To further enhance robustness and leverage the strong physiological correlation between biomass components, we employ a Multi-Task Learning (MTL) strategy. The model is trained to simultaneously predict both FW and DW using parallel regression heads. This MTL objective acts as a powerful regularizer, forcing the shared model backbone to discover more fundamental, generalizable features related to plant mass and 3D structure, rather than overfitting to a single predictive target [
29].
Focused on the practical challenges in rapeseed breeding, this work aims to automate manual phenotyping by developing and validating a data-efficient deep learning approach. This approach enables accurate, high-throughput biomass estimation (fresh and dry weight) from top-down RGB imagery, guided by three key goals: (i) to investigate the efficacy of self-supervised pre-training on diverse, non-target plant datasets as a strategy to create robust and generalizable features; (ii) to develop and validate a deep regression model that leverages these SSL features to directly estimate biomass (both fresh and dry weight), a 3D attribute, from 2D RGB imagery, thereby implicitly handling challenges of leaf occlusion and canopy density. By achieving these goals, this work aims to significantly increase the efficiency of breeding pipelines and ultimately contribute to shortening the rapeseed breeding cycle.
By pivoting self-supervised learning from commonplace tasks (e.g., leaf area estimation) to the underexplored challenge of biomass prediction, our research establishes a novel and data-efficient framework. It specifically addresses the critical challenges of data scarcity and the difficult inference of plant mass and density from occluded 2D images of rapeseed seedlings. We therefore hypothesize that (i) self-supervised pre-training on diverse, non-rapeseed plant datasets yields more transferable and robust visual representations for biomass estimation than conventional ImageNet pre-training or training from scratch, and (ii) a multi-task learning framework jointly predicting FW and DW serves as an effective regularizer that improves generalization compared with single-task models. Designed to address practical challenges encountered by breeders, this research aims to replace manual labor with AI technology, thereby enhancing efficiency and significantly shortening the breeding cycle. Specifically, compared to the traditional destructive workflow which requires days of harvesting and drying, the proposed imaging pipeline significantly reduces the phenotyping labor time, allowing for real-time decision making in the field.
2. Materials and Methods
2.1. Field Trials and Plant Materials
The experiment utilized rapeseed (Brassica napus L.) sown on 20 October and transplanted to the field on 15 November 2024. It was carried out at the Huzhou Academy of Agricultural Sciences in Zhejiang Province, China (30°48′35.8″ N, 120°11′32.4″ E), in 2024–2025, with all plants adhering to consistent agronomic practices throughout the growth cycle. The field was managed as a single observation nursery under uniform fertilization, irrigation, and pest control conditions, without additional agronomic treatments. The 833 experimental plants used for imaging and biomass measurement were single, pre-tagged individuals sampled from this nursery, with a one-to-one correspondence between each tagged plant, its top-down image, and its harvested above-ground biomass. The trial was established as a single observation nursery rather than a replicated multi-factor experiment: all cultivars and breeding lines were planted in contiguous rows within one field under uniform management conditions. The planting density was 225,000 plants·hm2. The soil type was paddy soil, and the fertilization rate of the rapeseed-specific slow-release fertilizer (N:P2O5:K2O = 25:7:8) was 750 kg·hm2.
2.2. Rapeseed Leaf Images Acquisition and Ground-Truth Fresh Weight and Dry Weight Measurement
Top-down images were captured on 1 January 2025, for a collection of 833 rapeseed materials that displayed diversity in leaf morphology and growth habit, with all individuals being at the seedling stage to the 5-leaf stage. All images were acquired on this single date under stable natural light, producing diffuse illumination with soft shadows. No artificial lights or camera flash were used. These 833 plants belonged to a panel of 5 distinct winter rapeseed cultivars and advanced breeding lines grown in contiguous blocks within the same observation nursery. Each image captured a single tagged plant from one cultivar, with small portions of neighboring plants occasionally appearing near the image borders. To avoid data leakage and to explicitly evaluate generalization to unseen genotypes, we defined the 5-fold cross-validation splits at the cultivar level, such that all images from a given genetic background were assigned to exactly one fold and never appeared in both the training and test sets.
The acquisition of the images were carried out in the field, and a stick marked at 60 cm was used to keep the shooting height the same. The parameters of the mobile camera (Xiaomi 14 Pro, Xiaomi Corporation, Beijing, China, 16 GB RAM + 512 GB ROM, Qualcomm snapdragon 8 Gen 3) were f/1.42, 1/954s, ISO 50. Within each cultivar plot, individual plants were pre-tagged in the field using a 4 cm × 3 cm pink label card placed next to the stem, and the camera was positioned approximately vertically above each tagged plant so that it appeared centered in the frame. Only the tagged plant served as the experimental unit: immediately after imaging, the above-ground organs of the same tagged plant (all rosette leaves and stems emerging from the crown) were cut at the soil surface and harvested for FW and DW measurement, while any neighboring plants that appeared at the image edges were left intact. No fixed-area (e.g., 1 m
2) quadrats were used. Each image was matched to a rapeseed seedling and a 4 cm × 3 cm label paper. Examples of the acquired field images are shown in
Figure 1.
After photographing, all aboveground organs were weighed to determine fresh weight using a high-precision balance (to 0.01 g). All samples were dried in a forced-air oven at 105 °C for 1 h, then at 75 °C until constant mass; dry mass was determined using the same balance and weighing procedure as for fresh mass.
2.3. Public Datasets for Self-Supervised Pre-Training
To build a robust and generalized feature extractor, we constructed a curated pre-training dataset by combining three complementary public datasets. The deliberate selection was designed to provide the model with a broad understanding of plant morphology, growth stages, and in-field conditions while deliberately excluding our target, rapeseed.
Our pre-training dataset was constructed to capture a wide array of visual information. It includes the CVPPP dataset [
30,
31], which provides clear, top-down views of Arabidopsis and tobacco rosettes. This component teaches the model the fundamental patterns of leaf overlap common in rosette-structured plants, which is highly relevant to young rapeseed. We then added the Plant Seedlings dataset [
32], which contains top-down images of 12 different crop and weed species. This forces the model to generalize its understanding from a few rosette types to a wider variety of early-stage plant morphologies. Finally, to ensure real-world applicability, we incorporated the VegAnn dataset [
33]. With its 26+ crop species in cluttered agricultural scenes, this dataset teaches the model to identify target plants amidst complex backgrounds, varying soil colors, real-world shadows, and diverse lighting conditions.
By combining these three datasets, we force the model to learn features that are simultaneously robust to leaf occlusion (from CVPPP), inter-species variation (from Plant Seedlings), and background noise (from VegAnn), creating a powerful foundation for the downstream rapeseed biomass estimation task.
2.4. Model Architecture and Training Strategy
Our methodology for above-ground biomass estimation is based on a two-stage transfer learning framework, designed to maximize data efficiency and predictive accuracy. The overall pipeline is illustrated in
Figure 2. The first stage consists of self-supervised pre-training, where a ViT backbone learns robust and generalizable visual representations from a large, unlabeled dataset of diverse public plant imagery. In the second stage, this pre-trained backbone is adapted for our specific task through a multi-task supervised fine-tuning strategy. During this stage, the model, appended with parallel regression heads, is trained on our labeled rapeseed dataset to simultaneously predict FW and DW. The subsequent subsections will detail each component of this framework.
2.4.1. Self-Supervised Pre-Training with DINOv2
To build a powerful feature extractor without relying on labeled data, we employed a self-supervised pre-training strategy based on the DINOv2 framework [
27]. DINOv2 is a state-of-the-art method that learns robust visual representations by distilling knowledge from a teacher network to a student network, using only a large dataset of unlabeled images. This approach is particularly well-suited for our study, as it allows the model to learn fundamental patterns of plant morphology from a wide range of crop imagery before being exposed to our specific rapeseed dataset.
The core mechanism of DINOv2 is a form of self-distillation, as illustrated in
Figure 2a. Both the student and teacher networks share the same DINOv2 ViT-S/14 architecture. During training, a single input image from our aggregated public plant datasets is processed to create multiple augmented views, specifically a set of high-resolution “global” crops and lower-resolution “local” crops. All crops are passed through the student network, while only the global crops are passed through the teacher network. The objective is to train the student network to match the output probability distribution of the teacher network for the same global crops. This probability distribution,
P, is generated from the network’s output logits,
z, using a temperature-scaled softmax function:
where
is a temperature parameter that controls the sharpness of the distribution, and
K is the dimension of the output logits. Different temperatures are used for the student (
) and the teacher (
). The final training objective is achieved by minimizing the cross-entropy loss between these two distributions.
A critical component of this framework is how the teacher network is updated. Instead of being trained via backpropagation, the teacher’s weights (
) are an exponential moving average (EMA) of the student’s weights (
). This update is performed at each training step according to the following rule:
where
is a momentum coefficient that smoothly increases from 0.996 to 1 during training. This process, which incorporates a stop-gradient operation to prevent trivial solutions, creates a more stable and effective “teacher,” guiding the student to learn discriminative and semantically rich features. For our study, this pre-training phase was conducted on the combined dataset of over 4993 images from the CVPPP, Plant Seedlings, and VegAnn datasets for 300 epochs, leveraging the official DINOv2 implementation and its recommended hyperparameters. The result of this stage is a ViT backbone with a deep understanding of general plant features, poised for effective fine-tuning on our specific biomass regression task.
2.4.2. Fine-Tuning for Multi-Task Biomass Regression
Following the self-supervised pre-training stage, the ViT backbone was adapted for the downstream task of biomass estimation using a multi-task learning (MTL) framework. This approach was chosen based on the hypothesis that forcing the model to simultaneously learn two related-but-distinct tasks—predicting Fresh Weight (FW) and Dry Weight (DW)—would act as a powerful regularizer. By learning a shared representation that benefits both tasks, the model is compelled to capture more fundamental and robust visual features corresponding to plant mass, density, and structure, rather than overfitting to superficial cues for a single task.
To achieve this, the pre-trained ViT backbone was utilized as a shared feature extractor. We appended two parallel, lightweight regression heads to the model, as shown in
Figure 2b. Each head consists of a Multi-Layer Perceptron (MLP) with 2 fully-connected layers and a ReLU activation function. Both heads take the same high-dimensional class token embedding from the ViT output. One head is trained to map this embedding to a single numerical value for predicted FW, while the second head maps it to a prediction for DW.
The fine-tuning process was conducted on our custom dataset of 833 labeled rapeseed images. The model was trained end-to-end to minimize a combined loss function,
, which sums the losses from both tasks:
For each task-specific loss ( and ), we employed the Smooth L1 Loss. This loss function was chosen for its robustness to outliers, which is critical for regression on real-world biological data. It behaves like the L2 loss for small errors, providing smooth gradients, while behaving like the L1 loss for larger errors, which prevents exploding gradients. To further mitigate the risk of overfitting inherent to small datasets (), we relied on the strong regularization effect of the MTL framework and the robustness of the frozen/low-learning-rate DINOv2 backbone. By transferring features learned from diverse species in the pre-training stage, the model avoids learning dataset-specific artifacts.
During this stage, the weights of the entire model were unfrozen. A differential learning rate was employed, where the newly added regression heads were trained with a higher learning rate (5 × 10−4 ), while the pre-trained ViT backbone was updated with a much smaller learning rate (5 × 10−5). This strategy preserves the powerful features learned during pre-training while allowing the heads to adapt quickly and the backbone to adjust subtly to the specific visual characteristics of our rapeseed data. The model was trained using the AdamW optimizer for 200 epochs. This fine-tuning process yields the final specialized model, optimized for accurate multi-task biomass prediction.
2.5. Experimental Setup and Evaluation Metrics
2.5.1. Experimental Setup
All experiments were conducted using the PyTorch 2.0.1 deep learning framework on a NVIDIA 3090 GPU. Our model employs a Vision Transformer Small (DINOv2 ViT-S/14) architecture, with the backbone initialized from the DINOv2 weights pre-trained on the public plant datasets. To ensure a robust and unbiased evaluation of our model on the limited dataset of 833 images, we implemented a 5-fold cross-validation procedure. To prevent any data leakage based on genetic similarity, the dataset was partitioned into five folds at the cultivar level, ensuring that all images of a given variety belonged to the same fold. This strategy tests the model’s ability to generalize to unseen genotypes. Specifically, the list of cultivars was randomly divided into five non-overlapping groups to form the folds. The training and evaluation process was iterated five times; in each iteration, one distinct fold was held out as the test set (ranging from 150 to 200 images depending on the cultivars in the fold), while the remaining four folds were used for training. For hyperparameter tuning and model selection within each iteration, one of these four training folds was designated as a validation set. The hyperparameters were independently tuned for each baseline model to ensure a fair comparison. The final performance of the model is reported as the mean and standard deviation across all five folds. Within this statistical design, all baseline and proposed models are trained and evaluated under the same cultivar-stratified folds, so that the from-scratch and ImageNet-pre-trained models act as explicit controls for testing the added value of domain-specific self-supervised pre-training and the multi-task learning framework.
The original smartphone images were resized directly to an input size of 448 × 448 pixels. This resizing maintained the general structure of the images while minimizing computational overhead. Pixel intensities were then normalized using the mean and standard deviation recommended for DINOv2. A small 4 cm × 3 cm pink label card placed next to each plant remained visible in the lower part of the frame; it contains only a numeric identifier and occupies a small portion of the image, and was therefore left unmasked. In practice, we observed no adverse effect on model convergence, suggesting that the network effectively learned to ignore this card.
During the fine-tuning within each fold, the model was trained for 200 epochs with a batch size of 64. Photometric augmentations (brightness/contrast/saturation jitter and a small hue shift) were applied to mitigate minor illumination variability. We utilized the AdamW optimizer (betas , ) with a starting learning rate of 5 × 10−5 and a weight decay of 0.05. A differential learning rate was employed, where the pre-trained ViT backbone was fine-tuned with a small learning rate of 5 × 10−5, while the newly added regression heads were trained with a larger rate of 5 × 10−4. A cosine annealing scheduler was employed to gradually decrease the learning rates. The model checkpoint that achieved the best average validation performance across both tasks (FW and DW) was selected for final evaluation on the respective hold-out test set.
2.5.2. Evaluation Metrics
To quantitatively assess the performance of our multi-task biomass regression model, we employed a set of four standard statistical metrics. These metrics were calculated independently for the FW and DW tasks. For all formulas, let be the ground-truth biomass value, be the model’s predicted biomass value for the i-th sample, be the mean of the ground-truth values, and n be the total number of samples in the test set.
The primary metric is the Coefficient of Determination (R
2), which measures the proportion of variance in the dependent variable that can be predicted from the independent variable(s). An R
2 value closer to 1.0 indicates a stronger correlation and a better model fit.
The Mean Absolute Error (MAE) measures the average of the absolute differences between predictions and actual values, treating all errors with equal weight and providing a direct interpretation of the average prediction error.
In addition, we report two error-based metrics in the original units of measurement (grams). The Root Mean Square Error (RMSE) quantifies the standard deviation of the prediction errors and is particularly sensitive to large errors due to the squaring term.
Finally, we compute the Relative Root Mean Square Error (RRMSE) by normalizing the RMSE with the mean of the ground-truth observations. This dimensionless metric facilitates comparisons across datasets of different scales and is often expressed as a percentage.
3. Results and Discussion
3.1. Overall Model Performance
The primary objective of this study was to develop a robust model for rapeseed above-ground biomass estimation. By leveraging a two-stage framework combining self-supervised pre-training with a multi-task fine-tuning strategy, our proposed model demonstrated high accuracy and stability. The performance for both FW and DW, evaluated through a rigorous 5-fold cross-validation method, is summarized in
Table 1 and
Table 2, respectively.
Our framework achieved strong predictive performance for both target variables. For FW, as shown in
Table 1, the model achieved a mean R
2 of 0.842 ± 0.041, indicating a strong linear relationship between the model’s predictions and the ground-truth measurements. The model yielded a low RMSE of 3.324 ± 0.230 g and a MAE of 2.323 ± 0.525 g. Performance on the DW task was similarly robust (
Table 2), with an R
2 of 0.829 ± 0.032, an RMSE of 0.414 ± 0.011 g, and a MAE of 0.315 ± 0.008 g. The low standard deviation across all metrics for both tasks highlights the model’s stability and its ability to generalize consistently across different subsets of the data.
To provide a robust horizontal comparison, our framework was evaluated against several baseline and alternative approaches on both tasks. As shown in both tables, our proposed method significantly outperforms models trained from scratch, such as a standard ResNet-50 or a ViT-Small. These “from scratch” models struggled to converge effectively on the limited training data, resulting in poor R2 values and very high prediction errors for both FW and DW. More importantly, it also shows a marked improvement over ViT and ResNet-50 backbones pre-trained on the generic ImageNet dataset. This result strongly underscores the benefits of domain-specific self-supervised pre-training on plant-related imagery. The features learned from various plant morphologies appear to be more transferable to our specific biomass estimation task than those learned from general objects.
The strong quantitative performance of our proposed method is visually corroborated by the comparative scatter plots in
Figure 3. The models trained from scratch, ResNet-50 (
Figure 3a) and ViT (
Figure 3b), exhibit extremely poor performance. Their scatter plots show significant dispersion and contain many points collapsed into horizontal lines, indicating the models failed to learn discriminative features and instead predicted a constant value for large subsets of the input data. In contrast, leveraging pre-training on ImageNet (
Figure 3c,d) yields a dramatic improvement, with the points forming a much clearer linear trend. Finally, our proposed model (
Figure 3e) clearly demonstrates superior performance; the data points are tightly clustered around the identity line with minimal dispersion.
However, we observed a slight tendency for the regression slope to be less than 1 across all models. This systematic bias is largely attributable to the data imbalance characteristic of agricultural datasets, specifically the long-tail distribution where high-biomass samples (e.g., DW > 5 g) are significantly underrepresented. Consequently, the model tends to regress towards the mean of the majority class to minimize the global loss. Future work could mitigate this by implementing weighted loss functions (e.g., assigning higher weights to rare samples) or stratified sampling strategies to improve predictive accuracy at the extremes.
The box plot of RMSE values for the DW task, as shown in
Figure 4, further summarizes these findings. It visually highlights the large error distribution and instability of the models trained from scratch. Our proposed model not only achieves the lowest median error for DW prediction but also the most compact distribution, indicating superior stability across all cross-validation folds. A similar trend was observed for the Fresh Weight RMSE distributions. This comprehensive visual and quantitative comparison affirms that our framework, which integrates domain-specific self-supervised learning with a multi-task objective, provides a more robust and accurate solution than training from scratch or using generic ImageNet pre-training for this challenging agricultural regression task.
3.2. Ablation Studies
To systematically dissect our framework and quantify the contribution of each proposed component, we conducted a series of ablation studies. The goal was to validate the effectiveness of our two core components: (1) domain-specific SSL pre-training (compared to generic ImageNet pre-training) and (2) our MTL framework (compared to training for each task independently).
By evaluating different combinations of these components, we isolated their individual and synergistic impacts on the final performance. The comprehensive results of this analysis for both FW and DW are presented in
Table 3.
The most significant performance gain was unequivocally attributed to the domain-specific self-supervised pre-training strategy. As shown in
Table 3, this holds true for both multi-task and single-task settings. To isolate the effect of pre-training, comparing the ’ImageNet Pre-train (Single-Task)’ model with the ’SSL Pre-train (Single-Task)’ model is most direct. For FW, domain-specific SSL initialization alone boosted the R
2 from 0.736 to 0.817 and cut the RMSE from 4.505 g to 3.603 g. An even larger relative improvement was seen for DW, where the R
2 jumped from 0.663 to 0.798. This substantial improvement confirms our hypothesis that leveraging diverse, unlabeled plant data to learn foundational visual features of plant morphology is the most critical step for success on our specialized biomass estimation task.
Next, we evaluated the impact of our MTL framework.
Table 3 shows that MTL provides a consistent performance boost over single-task training, regardless of the pre-training method used. This is evident when comparing the ’ImageNet (Single-Task)’ model with the ’ImageNet (Multi-Task)’ model, and is confirmed with our SSL models. For our final model, this is demonstrated by comparing the ’SSL Pre-train (Single-Task)’ model with our full ’SSL Pre-train (Multi-Task)’ model. For FW, applying MTL further reduced the RMSE from 3.603 g to 3.324 g. The same trend was observed for DW, where the RMSE dropped from 0.461 g to 0.414 g. This suggests that MTL acts as an effective regularizer. By forcing the model to learn a shared representation that is simultaneously beneficial for predicting both fresh and dry weight, the model is guided to capture more fundamental visual cues related to plant density and structure, rather than overfitting to superficial features of a single task.
The visual results in
Figure 5 corroborate these quantitative findings for the DW task. The plots visually trace the step-by-step improvements detailed in the table: (a) the ’ImageNet (Single-Task)’ model shows significant dispersion; (b) adding MTL (’ImageNet (Multi-Task)’) tightens the cluster slightly; (c) switching to domain-specific SSL (’SSL (Single-Task)’) causes a dramatic improvement, bringing points much closer to the identity line; finally, (d) our full model (’SSL (Multi-Task)’) shows the tightest clustering and least variance, visually confirming its superior accuracy and stability. In summary, these studies provide clear evidence that while domain-specific SSL provides the foundational leap in performance, the MTL framework provides an additional, valuable contribution, working synergistically to achieve the best result.
3.3. Comparative Analysis of SSL Frameworks
To provide a broader context for our choice of DINOv2 and to address the landscape of SSL, we conducted a comparative analysis against other prominent SSL frameworks. We selected three representative methods from different families of SSL: Masked Autoencoders (MAE-SSL) [
34] as a leading reconstruction-based method, SimCLR [
35] as a foundational contrastive learning approach, and iBOT [
36], which, like DINOv2, uses an online knowledge distillation strategy. Each of these frameworks offers a distinct philosophy for learning visual representations from unlabeled data, providing a comprehensive basis for comparison.
The models were pre-trained on the same public plant datasets and subsequently fine-tuned on our specific rapeseed dataset using the identical multi-task learning protocol. The performance of these frameworks on the biomass estimation task is summarized in
Table 4. The results clearly position DINOv2 as a highly effective framework for this agricultural vision task, achieving the robust performance detailed throughout our study.
While all tested SSL methods are exceptionally powerful, the superior performance of DINOv2 may be attributed to its methodology’s alignment with our specific task requirements. Knowledge distillation frameworks like DINOv2 and iBOT are known to excel at learning strong, part-level semantic features. This is particularly crucial for biomass estimation, where the model must identify and aggregate information from small, potentially fragmented or occluded leaf segments to infer complex 3D attributes like density and volume, rather than just recognizing the plant’s holistic 2D shape. In contrast, while MAE learns excellent holistic representations through reconstruction, its focus may be less on the fine-grained semantics of plant structure. Similarly, contrastive methods like SimCLR are optimized for instance-level discrimination, which may be less critical than the part-level understanding required for our regression task. Therefore, the rich semantic features learned by DINOv2 appear to be the most transferable and effective for handling the complexities of in-field canopy images for biomass prediction.
3.4. Model Interpretability
To validate that the model’s predictions are based on relevant morphological features, we visualized the self-attention maps of the [CLS] token from the final layer of the ViT backbone. As illustrated in
Figure 6, the attention mechanism focuses intensively on the leaf canopy, tracing the contours and overlapping regions of the rapeseed leaves, while effectively ignoring the soil background and shadows. This confirms that the model explicitly learns to identify and quantify plant tissue to estimate biomass, rather than overfitting to background noise.
4. Conclusions
This study presents a novel deep learning framework for precisely estimating rapeseed above-ground biomass (FW and DW) from top-down 2D images, and it was successfully developed and validated to meet the critical need for accuracy and efficiency. Our approach moves beyond common 2D phenotyping to tackle the more complex task of inferring 3D mass. By integrating domain-specific SSL with a MTL fine-tuning strategy, our model achieved strong predictive performance for both FW (R2 = 0.842) and DW (R2 = 0.829), demonstrating high accuracy and robustness. Our analyses confirmed that (1) domain-specific SSL provides a substantial performance advantage over generic ImageNet pre-training, and (2) the MTL framework acts as an effective regularizer, further improving accuracy over single-task models. This work provides a powerful and data-efficient pipeline for non-destructive biomass phenotyping, demonstrating the potential of modern self-supervised learning to accelerate agricultural research. While validated on rapeseed, the underlying SSL+MTL framework utilizes general plant features and holds potential for generalization to other crops, warranting further investigation in future multi-species studies.
Despite the promising results, this study has certain limitations. A primary constraint is the model’s performance, which, although robust, was limited by the relatively small fine-tuning dataset. Additionally, the ground-truth biomass measurement method, despite its accuracy, is inherently destructive and time-consuming, creating a fundamental bottleneck for future dataset expansion.