1. Introduction
Accurate crop yield prediction is important for sustainable agricultural management, food and economic stability [
1,
2]. As the global population grows [
3] and climate variability intensifies, producing reliable forecasts has become increasingly important for planning and adaptation [
4]. Robust yield estimates support decision-making across the food-supply chain, including government planning, farmers’ resource allocation and risk management [
5,
6].
Conventional prediction methods consist of field surveys and process-based crop models. Although they offer valuable insights, they are constrained by cost, scalability, and simplifying assumptions that often break down under diverse farming conditions [
7,
8]. Expansion of Earth Observation (EO) remote-sensing technologies, like satellite and Unmanned Aerial Vehicles (UAVs) imagery, has transformed agricultural monitoring field by delivering continuous, high-quality observations of crop growth [
9,
10]. This data abundance has shifted yield prediction toward data-driven methods, where machine/deep learning methods capture the nonlinear, intricate properties [
11].
Traditional approaches relied on machine learning models like Random Forests using engineered vegetation indices [
12,
13,
14,
15], though these often lacked generalization across agro-climatic regions [
16]. Subsequent deep learning methods improved feature extraction via Convolutional Neural Networks (CNNs) [
7,
17] and captured temporal dynamics using Recurrent Neural Networks (RNNs) or attention mechanisms [
18,
19,
20]. However, these models are typically trained from scratch on limited datasets, potentially missing the broader Earth-system context [
9,
10].
Recently, a new computer vision field driven by Vision Transformers (ViTs) and geospatial foundation models has emerged. ViTs use self-attention to capture long-range dependencies within imagery, addressing CNNs’ tendency to focus on local patterns [
21]. For instance, Barman et al. introduced ViT-SmartAgri and demonstrated that ViT architectures combined with smartphone-based applications can outperform traditional models like InceptionV3 in real-time plant disease detection [
22]. Similarly, Mehdipour et al. provided a survey highlighting that ViTs offer superior scalability and handling of long-range dependencies compared to CNNs, particularly in complex field environments [
23]. Furthermore, hybrid approaches and lightweight ViTs are being actively explored to reduce computational costs while maintaining high accuracy in precision agriculture tasks [
24]. Large-scale foundation models, like IBM–NASA’s Prithvi-EO-2.0-600M, are pre-trained on global satellite archives and provide generalizable representations that can be fine-tuned for downstream tasks, including agricultural prediction [
25]. Together, these advances point toward more scalable and transferable frameworks for data-driven crop yield forecasting.
Despite progress, several gaps remain. First, to the best of our knowledge, many existing models are trained from scratch on task-specific agricultural image sets that are relatively small [
17,
18,
26], limiting generalization and forgoing the Earth-system knowledge embedded in pre-trained foundations. Second, most deep learning models target coarse county-level yield, which is useful for regional planning but insufficient for capturing intra-field variability—defined here as the localized yield differences within a single field boundary caused by soil heterogeneity and management zones—which is required by farmers and operational decision-makers. Third, higher-resolution efforts often frame yield as a classification problem (discretized categories), which cannot fully represent the continuous nature of productivity or mixed pixels. Fourth, many crop-yield models are treated as “black boxes” with limited interpretability. To earn trust from decision-makers, models should pair accuracy with insights into the agronomic processes driving predictions. For example, this may include the relative importance of specific growth stages or spectral signatures. To address these limitations, we introduce the
FARM framework (
Fine-tuning
Agricultural
Regression
Models). FARM adapts a pre-trained ViT-based encoder for intra-field crop yield regression, providing a more detailed and flexible tool for mapping agricultural productivity. Building FARM on a foundation model allows us to leverage its rich, pre-trained understanding of the spatio-temporal structure of satellite data for our canola yield-prediction task.
We hypothesize that fine-tuning a pre-trained geospatial foundation model (Prithvi-EO) will result in lower error rates (lower RMSE) compared to similar architectures trained from scratch, due to the model’s ability to transfer learned representations of complex Earth-surface dynamics. We further hypothesize that training FARM on a large, Upsampled county-level supervised dataset with upsampled county-level labels can produce a robust base model that transfers effectively to truly intra-field yield prediction when fine-tuned on a much smaller set of high-resolution yield monitor labels. The objectives of this study are to:
Adapt and fine-tune a geospatial ViT-based foundation model for an intra-field level yield prediction and show the efficiency of using foundation models for crop yield prediction tasks in comparison with other state-of-the-art approaches.
Design a pixel-wise regression architecture that frames yield prediction as a continuous task at high spatial resolution, enabling the model to estimate a precise yield value for every image pixel.
Assess interpretability by quantifying and visualizing embedded attention mechanisms to identify influential phenological stages and spectral signatures that drive the model’s predictions.
Demonstrate, using an independent high-resolution yield monitor dataset, that fine-tuning FARM provides a stronger performance than training the same architecture from scratch, establishing a practical transfer-learning pathway from regional monitoring to precision agriculture.
As a key contribution to crop-yield prediction, we pioneer the application of a large-scale geospatial foundation model (Prithvi-EO-2.0-600M) to yield forecasting in the Canadian Prairies, showing how pre-trained models can be effectively adapted for specialized agricultural tasks. While we focus on canola, the framework is designed to be adaptable to other crops.
The remainder of this paper is structured as follows.
Section 2 details the methodology, including datasets and the FARM architecture.
Section 3 describes the study experimental setup model configurations, and training objectives.
Section 4 presents and discusses the results, including performance metrics and baseline comparisons.
Section 5 and
Section 6 provide a summary of findings and directions for future research.
3. Experiments
3.1. Training Objective and Implementation
To train our model, we define an objective function and utilize a training pipeline with modern optimization and regularization techniques. The entire framework is implemented to ensure efficient, stable, and reproducible training.
The model’s goal is to predict a continuous yield value, , for each pixel in an input image, such that it closely matches the corresponding ground-truth yield, . To quantify the discrepancy between predicted and true yield maps, we evaluate two loss functions.
The primary loss function used in our main experiments is the Mean Squared Error (MSE) loss. It is a standard and effective choice for regression tasks, calculated as the average of the squared differences between the predicted and actual yield values over all pixels in a batch. For a single predicted yield map
and a ground-truth map
Y, both of size
, the MSE is defined as follows:
MSE loss is differentiable and penalizes larger errors more heavily due to the squaring term, which encourages the model to avoid significant deviations.
To assess the model’s robustness to potential outliers in the ground-truth data, we also conducted experiments using the Huber Loss. The Huber loss is a piecewise function that provides a compromise between the sensitivity of MSE and the robustness of Mean Absolute Error (MAE). It behaves quadratically for small errors but linearly for large errors, which reduces the influence of outliers that might otherwise dominate the gradient during training. This makes the Huber loss less sensitive to anomalous yield values in the dataset, often leading to better generalization.
where
is a tunable hyperparameter that defines the threshold at which the function transitions from quadratic to linear.
To facilitate the stable training of our architecture, we employ a deep supervision strategy through the use of an auxiliary head. This auxiliary head is attached to an intermediate layer of the UperNet decoder and provides an additional, coarser-grained prediction output. The purpose of this auxiliary head is to inject an additional gradient signal directly into the middle of the network during backpropagation. This helps to mitigate the vanishing gradient problem, which can be a challenge in very deep models, and encourages the intermediate layers to learn more discriminative features. The auxiliary head has its own smaller UperNet decoder and a 1-channel regression output. Its loss, calculated using the same primary loss function (e.g., MSE), is added to the total loss of the main head, weighted by a factor of 0.2. The total loss for the model is therefore:
This deep supervision technique helps to regularize the model and often leads to faster convergence and improved final performance.
3.2. Evaluation Metrics
To evaluate the model’s performance, we utilize four standard regression metrics: Mean Absolute Error (MAE) [
27], Root Mean Squared Error (RMSE) [
27], Coefficient of Determination (
) [
28], and the Pearson’s Correlation Coefficient [
29]. While RMSE and MAE quantify the magnitude of prediction error,
and the Pearson’s Correlation Coefficient assess the model’s ability to explain yield variability and capture linear trends in the data.
3.3. Implementation Details
The entire framework was implemented using the PyTorch (version 2.6.0) deep learning library and streamlined with PyTorch Lightning for organized and reproducible training. To ensure experimental reproducibility, the random seed was fixed to 0. We initialized the encoder using the pre-trained Prithvi EO V2 600M TL weights and trained the full architecture for 120 epochs with a batch size of 8. We utilized the AdamW optimizer, which decouples weight decay from gradient-based updates, applying a weight decay of 0.1 to improve generalization. To dynamically adjust the learning rate, we employed a Cosine Annealing scheduler preceded by a linear warm-up period of 20 epochs; the learning rate started at and decayed to a minimum of . During training, data augmentation was applied on-the-fly, including random horizontal and vertical flips with a probability of 0.2 and Gaussian noise injection with a probability of 0.4 to enhance robustness against sensor noise. The UperNet decoder was configured with 1024 channels to effectively fuse multi-scale features, while the regression head utilized a dropout rate of 0.1 to mitigate overfitting. To accelerate training and reduce GPU memory consumption, we utilized bfloat16 (bf16) mixed-precision training. The experiments were conducted on a high-performance computing node equipped with NVIDIA GPUs with 48 GB of memory.
3.4. Baseline
To further contextualize the performance of our foundation model-based approach, we compare it against other state-of-the-art baseline architectures for spatio-temporal prediction tasks. To ensure a fair comparison, we adopted the 3D-CNN and DeepYield architectures proposed in [
26] and [
30] respectively, and trained them from scratch using our canola yield dataset under identical experimental settings. These model were selected because, as reported in [
26,
30], these architectures have already demonstrated statistically significant performance gains over traditional machine learning baselines (including Random Forest, LASSO, and SVM) and other deep learning approaches. Consequently, they serve as the most reliable benchmark for evaluating the advancements offered by FARM. However, since the original study focused on a different crop type, a direct comparison would not have been equitable. Therefore, retraining the models on our dataset enabled a consistent and crop-specific performance evaluation. The 3D-CNN architecture follows an encoder–decoder design that organizes multi-temporal satellite image data into a unified spatiotemporal framework and employs 3D convolutional kernels to jointly learn spatial and temporal dependencies. DeepYield frames the architecture as a combination of 3D-CNN and ConvLSTM.
4. Results
This section presents the evaluation results of our proposed intra-field level crop yield regression (FARM) framework. First, we detail the quantitative performance, followed by a qualitative analysis of the generated yield values. Subsequently, we compare our proposed model against other baselines to highlight the efficiency of our proposed model architecture in predicting continuous yield values. Finally, we present an analysis of the model’s interpretability through its attention weights from temporal and spectral perspectives to explore the key components involved in the final predictions.
4.1. Quantitative Evaluation
Table 2 presents a comparison of intra-field regression performance for three training configurations (loss functions)—MSE, Huber, and MSE + Aux—on the validation set. As described in the Methodology section, the model was trained under three distinct strategies: (i) using only the Mean Squared Error (MSE) loss, (ii) using only the Huber loss, and (iii) using the MSE loss combined with an auxiliary head for deep supervision.
The results are reported in both standardized form (as used during training) and de-standardized into common agricultural units (kg/ha and bu/ac) for practical interpretability. The model trained with the Huber loss achieved a slightly lower RMSE of 0.4677 and a higher of 0.7852 compared to the MSE-only model. This modest improvement indicates that while outliers are present in the dataset, they do not strongly influence the regression outcomes; nonetheless, employing a robust loss such as Huber yields a small gain in generalization.
The best model (FARM) achieved an of , showing that it successfully explains over of the variance in pixel-level canola yield within the validation dataset. The high Pearson’s Correlation of demonstrates a strong linear relationship between the predicted and ground-truth yield values, confirming that the model’s predictions are consistently aligned with the actual yield trends.
Figure 4 presents side-by-side comparison of predicted yield values with ground-truth for an arbitrary location in our region of interest. The predicted yield map (
Figure 4b) successfully captures the fine-grained, intra-field variability present in the ground-truth map (
Figure 4a). It also includes a map of the residuals (
Figure 4c), which illustrates the spatial patterns of prediction errors, showing areas where the model demonstrates higher or lower accuracy.
4.2. Qualitative Evaluation
To visually assess the model’s spatial prediction capabilities, we compared the predicted yield maps against the ground-truth.
Figure 5 shows this for a representative sample from the validation set. The scatter plot of predicted versus true values (
Figure 5b,c) shows a tight clustering of points along the identity line. This confirms the high correlation reported in
Table 2. The distribution of prediction errors is centered around zero and is close to a normal distribution (
Figure 5d), which indicates that the model does not have a significant systematic bias.
4.3. Comparison with Baselines
Comparison with Regression-based baseline:
To validate the effectiveness of adapting a large-scale foundation Prithvi-EO-2.0 model for this task, we compared our final model (FARM) against the state-of-the-art baselines, and the results are presented in
Table 3.
The results clearly indicate that our FARM model, significantly outperforms the 3D-CNN and DeepYield models across all standard regression metrics. This higher performance can be attributed to two key factors. First, the Prithvi-EO-2.0-600M encoder has been pre-trained on a massive and diverse dataset of global satellite imagery. This provides our model with a rich, generalized understanding of vegetation dynamics, phenology, and land surface patterns that a model trained from scratch on a specific agricultural dataset cannot easily acquire.
Second, the Vision Transformer architecture, with its self-attention mechanism, is inherently capable of capturing long-range spatial dependencies across an entire image chip. This may allow it to better model the contextual factors influencing yield variability than the localized receptive fields of the CNN-based encoder in the baseline models. These results support our hypothesis that fine-tuning large geospatial foundation models is a superior strategy for complex, dense prediction tasks like intra-field level crop yield estimation. Due to differences in crop type relative to prior studies (which did not focus on canola), a direct comparison with all previously proposed baselines was not appropriate. Therefore, we restrict our baseline evaluation to 3D-CNN and DeepYield, which are reported as the strongest-performing models in those works [
26,
30].
4.4. Validation on High-Resolution Ground Truth Data
To address the limitations associated with training on upsampled county-level yield data, we performed a series of experiments using a distinct dataset containing high spatial resolution (10 m) ground-truth yield data. This dataset covers limited numbers of fields in Canadian Prairies, collected during 2013 to 2024 growing seasons. Unlike the primary training set, these labels were not upsampled but represent true localized yield values.
We conducted three experiments to assess the performance of the FARM architecture and the transferability of the features learned from the upsampled data.
4.4.1. Experiment 1: Direct Inference (Zero-Shot Application)
We applied the FARM model (originally trained on upsampled county-level labels) directly to the new high-resolution dataset without any weight updates. The objective is to determine if the model learned generalized spectral-yield relationships or merely memorized the smoothed regional trends. All imagery originally at 10 m spatial resolution is upsampled to 5 m in order to match the 224-dimensional input specification required by the county-level FARM model checkpoint. This transformation doubles the spatial dimensions (112 → 224) through interpolation applied directly to the raw imagery. Each 10 m pixel is subdivided into four 5 m pixels using standard resampling methods(bilinear interpolation), and no boundary-filling strategies (such as padding with mean values) are applied. The resulting 5 m rasters preserve the original spatial extent while producing inputs that are fully compatible with the higher-resolution model workflow. The model achieved an RMSE of 0.921 and an
of 0.508.
Figure 6 shows the qualitative and quantitative presentation of this experiment.
4.4.2. Experiment 2: Fine-Tuning on High-Resolution Ground-Truth Data
We utilized the pre-trained FARM model (from the main study) and fine-tuned it on the high-resolution dataset. The objective is to test if the representations learned from the Foundation Model and refined on upsampled data serve as a strong initialization for high resolution ground-truth data and high precision agriculture tasks. The fine-tuned model achieved an RMSE of 0.628 and an
of 0.768.
Figure 7 shows the qualitative and quantitative presentation of this experiment.
4.4.3. Experiment 3: Training on High-Resolution Ground-Truth Data
We initialized the FARM architecture with the standard Prithvi-EO-2.0-600M weights to be trained on high resolution ground-truth data from scratch. Some changes happened in training pipline for example using Heteroscedastic loss instead of MSE/Huber loss and adding brightness and contrast augmentation on top of augmentations used in the model based on county-level labels to the input during training. This model achieved an RMSE of 0.557 and an
of 0.675.
Figure 8 shows the qualitative and quantitative presentation of this experiment.
Table 4 summarizes the performance metrics across the original and three experimental setups. These results confirm that the features learned from the upsampled county-level data are transferable and agronomically meaningful. In Experiment 1, the Zero-Shot application yielded an
of 0.508, indicating that the model captures fundamental yield drivers even without seeing native high-resolution labels during training. However, Experiment 2 achieved the highest explanatory power (
) by leveraging the pre-trained weights from the main study (trained on upsampled data) and fine-tuning them on the high-resolution dataset. Notably, this approach outperformed Experiment 3 (
), where the model was initialized randomly and trained from scratch on the high-resolution data. This performance gap underscores the challenge of training deep architectures on limited datasets; the high-resolution dataset, while precise, contained few samples for the model to fully converge on generalized features from scratch. The results suggest the robustness of the architecture but its potential is constrained by data volume in Experiment 3, and it would likely achieve higher accuracy if trained on a larger corpus of high-resolution imagery. Consequently, using massive volumes of upsampled county-level data to create a foundation serves as a critical initialization step for tasks where granular ground truth is scarce.
4.5. Interpretability Analysis
The interpretability of deep learning models in agriculture is of paramount importance, as it enables stakeholders to understand the underlying factors driving model predictions and fosters trust in AI-assisted decision-making. Beyond achieving high predictive accuracy, interpretable models allow agronomists, producers, and policymakers to gain meaningful insights into how temporal patterns, such as crop growth dynamics and seasonal variability, and other environmental or management-related features contribute to yield outcomes. This interpretability analysis was performed for both the temporal and spectral (channel) aspects to understand which phenological stages and what spectral information the model considers most influential for predicting canola yield.
By analyzing the attention matrices from key transformer layers, we quantified the focus the model places on each of the five monthly time steps (May through September). The results reveal a compelling and highly interpretable pattern. In the earlier layers of the transformer, such as Layer 8 (
Figure 9a), the model exhibits a strong, localized attention pattern, where each time step primarily attends to itself and its immediate neighbors. For instance, the July time step shows a dominant self-attention score, indicating the model is learning to consolidate information within this critical period. As we progress to deeper layers, such as Layer 16 (
Figure 9b), the attention mechanism evolves to capture more complex, long-range temporal dependencies. Here, the model consistently assigns the highest importance to the mid-season months, with July emerging as the most influential time step, receiving significant attention from both earlier and later periods. This is quantitatively evidenced by the higher aggregate attention scores for July and August across multiple attention heads. This pattern aligns perfectly with established crop physiology; for canola, the flowering and early pod-filling stages occurring in July are paramount for yield determination, as they directly influence seed set and development. The model’s ability to autonomously identify and prioritize this peak growing season, without any explicit phenological guidance, underscores its capacity to learn biologically salient features directly from the spectral-temporal data. Furthermore, the attention maps show that while the late-season month of September receives less direct attention, it still plays a contextual role, likely helping the model discern maturity and senescence patterns. This temporal interpretability not only builds trust in the model’s predictions but also scientifically validates that the fine-tuned foundation model successfully internalizes the growth dynamics of canola, effectively focusing its analytical power on the most agronomically decisive periods of the growing season.
Additionally, it is important to understand which image spectral bands the model uses to make its predictions. We conducted a channel-wise analysis by analyzing the magnitude of the learned patch embedding weights corresponding to each of the six input spectral bands across all time steps. The results of this technique show the relative importance the model assigns to each band at the very initial stage of processing, which provides insight into which spectral features are considered most informative. According to
Figure 10, the analysis reveals that the Near-Infrared (NIR) and Short-Wave Infrared (SWIR 1 and SWIR 2) bands received the highest importance scores. This observation is consistent with established remote sensing principles for vegetation monitoring [
31,
32,
33]. NIR, which is the basis of calculating vegetation indices like NDVI is directly related to plant growth status. SWIR bands are the next significant bands. These bands are sensitive to moisture content of vegetation and soil, which highly influence crop yield. The Red visible band is placed at the next rank due to its’ contributing role as the other component in NDVI. The Blue and Green bands were found to be the least influential.
5. Discussion
The results of our proposed FARM framework show a noticeable improvement in canola yield prediction with a significant MAE of 2.83 bushels per acre. These gains underscore the value of incorporating a foundation model compared to hybrid spatiotemporal models like 3D-CNN and DeepYield. While 3D-CNNs, DeepYield and similar hybrid architectures are efficient for learning spatiotemporal patterns, their knowledge is fundamentally limited to the canola-specific patterns they extract during training. By contrast, the superior performance of FARM can be directly attributed to its fine-tuning of the Prithvi-EO-2.0-600M foundation model. By leveraging a model pre-trained on a massive and diverse archive of global satellite imagery [
25], FARM benefits from a rich, generalized understanding of vegetation dynamics, atmospheric conditions, and land surface phenology. This large and transferable knowledge base allows the model to interpret the nuances of canola growth in more detail, leading to higher accuracy in yield prediction. Our work thus shifts from training task-specific models toward fine-tuning generalist foundation models for specialized agricultural applications, enabling more scalable solutions.
In addition, this study reformulates the yield prediction as a high-resolution intra-field regression task, which marks a critical methodological advance for precision agriculture. While recent state-of-the-art frameworks, such as the GNN-RNN approach by [
6] and DeepCropNet [
19], have demonstrated high accuracy in capturing spatial-temporal dependencies, they primarily target county-level or regional aggregation. Although these models are valuable for macro-scale food supply planning, their resolution is insufficient for capturing the granular, intra-field heterogeneity driven by local soil and management conditions. Historically, high-resolution yield prediction has often been framed as a low-resolution classification problem, discretizing yields into a finite set of categories. Such a classification approach imposes artificial boundaries on a continuous biological process, resulting in significant loss of granularity and unrealistic sharp transitions in predicted productivity. Our regression-based FARM model, in contrast, generates continuous yield maps that represent the smooth spatial heterogeneity present within and across agricultural fields. These high-resolution maps can directly inform decision-makers and precision-farming operations.
A key consideration in interpreting these results is the generation of pixel-level training labels through the upsampling of county-level data. While this approach allows for large-scale training where granular data is scarce, it raises questions regarding the model’s ability to capture true intra-field heterogeneity versus simply smoothing regional averages. To validate the model’s capacity for high-resolution prediction, we conducted a supplementary analysis using a separate dataset containing ground-truth yield data at native 30 m resolution. As detailed in
Section 4.4, we evaluated the model under three conditions: direct inference, fine-tuning, and training from scratch. The results demonstrate that the FARM model, initially trained on upsampled county-level data, successfully learns generalized spectral-yield relationships rather than simply memorizing smoothed regional averages. While direct inference (Zero-Shot) on high-resolution data showed moderate performance (
), fine-tuning the pre-trained model on true high-resolution ground truth significantly improved accuracy (
). Crucially, this fine-tuned model outperformed a version of the architecture trained from scratch solely on high-resolution data (
). The lower performance observed in the model trained from scratch is primarily attributed to the limited number of images in the high-resolution dataset. Thus, our transfer-learning strategy currently offers the most viable solution for bridging the gap between data scarcity and precise intra-field prediction.
Another contribution of this work lies in the interpretability analysis of the FARM model. Temporal attention analysis (
Figure 10) shows that the model autonomously learns to assign the highest importance to mid-season months, with July consistently emerging as the most influential time step. This learned behavior aligns with the established crop physiology of canola [
34]. The period spanning July and August corresponds to the critical flowering and pod-filling stages, during which the plant’s sensitivity to environmental conditions is at its peak and the primary determinants of final seed yield are established. The model’s ability to identify and prioritize this critical time window from raw spectral–temporal data indicates that it is capturing fundamental drivers of crop development.
Furthermore, the analysis of which spectral bands the model leverages provides deeper insight into its decision-making process. The model assigns the highest importance to the Near-Infrared (NIR) and Short-Wave Infrared (SWIR 1 and SWIR 2) bands, which aligns with principles in remote sensing for vegetation monitoring [
31,
32,
33]. Healthy vegetation strongly reflects NIR light, which makes this band important for plant analysis [
2,
4,
5]. The SWIR bands are sensitive to the moisture content in both vegetation and soil, which are effective markers that significantly influence crop yield. The model’s reliance on these specific bands demonstrates that it has learned to focus on the spectral signatures most indicative of plant health and water status, which are key drivers of canola yield.
While the current findings demonstrate the efficacy of the proposed approach, it is important to acknowledge the limitations of this study, which define specific paths for future research. The FARM model was trained and validated on canola within the context of the Canadian Prairies. To validate broader applicability, future work will focus on extending the model through multi-crop fine-tuning strategies. Specifically, we aim to adapt the architecture for wheat yield prediction, investigating whether the shared geospatial representations in the foundation model can be leveraged to enhance performance across distinct crop types with varying phenological cycles. Although the model shows high performance using multi-temporal imagery alone, there is significant opportunity to enhance its predictive power by integrating meteorological and soil data, thereby providing a more holistic view of agricultural ecosystems.
To translate FARM from a research capability to a deployable precision agriculture tool, the generated pixel-level yield maps must integrate directly with Farm Management Information Systems (FMIS). In a practical workflow, these yield predictions would serve as the foundational layer for Variable Rate Application (VRA) prescriptions. By identifying high-yielding and low-yielding zones early in the season, agronomists could adjust nitrogen inputs or seeding rates dynamically to maximize Return of Investment (ROI) per acre. Additionally, these maps can guide targeted scouting, allowing growers to physically inspect anomaly zones detected by the model rather than relying on random field sampling.
However, several gaps remain before commercial deployment. First, the latency between satellite acquisition (HLS data availability) and inference needs to be minimized to ensure decisions can be made within tight operational windows. Second, future iterations must incorporate uncertainty quantification, providing users with a confidence interval alongside yield estimates to build trust in automated decision-making.
6. Conclusions
In this research, we introduced FARM, a foundation model for intra-field crop yield prediction, and demonstrated its effectiveness on canola in the Canadian Prairies. Our work establishes the significant advantages of fine-tuning a large-scale, pre-trained geospatial foundation model, Prithvi-EO-2.0-600M, for specialized agricultural tasks, which shows superior performance over models trained from scratch. In FARM, we also formulated the yield prediction as a continuous, pixel-level regression task. We addressed the limitations of low-resolution discrete classification. This approach generates high-resolution, continuous yield maps that capture the granular, intra-field variability, which is important for precision agriculture applications. Furthermore, our analysis of the model’s temporal attention mechanisms showed that FARM’s output for crop yield prediction prioritizes the most critical phenological stages for canola yield, which are aligned with the current literature. This confirms its ability to learn agronomically meaningful patterns. Overall, the findings position foundation models as a new technology for advancing data-driven crop yield prediction.