A Multi-Farm Global-to-Local Expert-Informed Machine Learning System for Strawberry Yield Forecasting

: The importance of forecasting crop yields in agriculture cannot be overstated. The effects of yield forecasting are observed in all the aspects of the supply chain from staffing to supplier demand, food waste, and other business decisions. However, the process is often inaccurate and far from perfect. This paper explores the potential of using expert forecasts to enhance the crop yield predictions of our global-to-local XGBoost machine learning system. Additionally, it investigates the ERA5 climate model’s viability as an alternative data source for crop yield forecasting in the absence of on-farm weather data. We find that, by combining both the expert’s pre-season forecasts and the ERA5 climate model with the machine learning model, we can—in most cases—obtain better forecasts that outperform the growers’ pre-season forecasts and the machine learning-only models. Our expert-informed model attains yield forecasts for 4 weeks ahead with an average RMSE of 0.0855 across all the plots and an RMSE of 0.0872 with the ERA5 climate data included.


Introduction
The agricultural sector is key to the UK economy, utilising 71% of the UK's total land area [1].Globally, the agricultural sector has faced substantial disruptions due to shifting geopolitics and the impact of COVID-19 on labour availability.In the UK, these challenges have been exacerbated by the complexities of the transitioning policies to restructure the industry post-Brexit [2].One such policy is the UK transitioning away from the European Union's Common Agricultural Policy (CAP), which compensated growers based upon the amount of land that they farmed [1].
Amidst these evolving challenges, the ability to accurately forecast crop yields, both before and during the growing season, emerges as a critical tool for agricultural resilience and decision-making [3].These forecasts are among the most valuable pieces of information the grower could be provided with [4].Increasing the accuracy of these forecasts means we can reduce the business risks the growers take [5].An example strawberry yield pattern from a real farm can be observed in Figure 1.
It is well-known that there will be several major fruit waves throughout the growing season.However, it is when the waves begin that is the difficult part to predict [6].This difficulty in forecasting is further compounded in tasks like predicting strawberry yields and prices, which are influenced by a myriad of complex factors.Variables such as the weather, soil conditions, and irrigation play crucial roles in determining the yield.The inherent uncertainty of these factors adds layers of complexity to the forecasting process, making it a challenging task [7].Growers create their forecasts based upon their previous experience and the seasonal conditions and then use that information as a guide to construct management decisions [8].

Novelty and Contributions
Although there is extensive research in the field of crop yield forecasting, we aim to address the issue of low-resolution real-world data.Many studies use aerial imaging to improve their predictions [7].However, for crops grown in polytunnels, this method is not feasible.Instead, we investigate the use of growers' own predictions as an input.
We propose a dynamic method that is able to use this expertise alongside an ML solution, building upon a unique dataset that includes real-world production data and forecasts from multiple farms across the UK.This research is also important as there is still a significant need to develop ML techniques for fresh produce, including strawberries [9].Our system utilises a global-to-local method where we train a single model on the data from various farms and then use this single model to make individual predictions for all the farms and their respective plots.This paper delves into the intersection of machine learning and crop yield forecasting and investigates the integration of growers' expert knowledge with machine learning techniques.We examine how embedding growers' seasoned wisdom into our model can enhance the precision and reliability of our predictions.
This paper also aims to address a critical challenge in the realm of agricultural data management.A crop's yield largely depends on the weather conditions during the growing season [10].Growers frequently rely on weather data to inform their decisions regarding agricultural practices, such as planting, irrigation, and pest control.However, we have observed that many growers often utilise weather data on an ad hoc basis without retaining them for future reference.This practice can lead to the under-utilisation of valuable historical weather information that could otherwise provide insights into crop trends and inform more resilient and sustainable agricultural strategies.
Specifically, we investigate the use of the ERA5 climate model as an alternative data source for crop yield forecasting when the weather data were not captured at a farm.The ERA5 climate model, also known as the "ERA5-Land Hourly-ECMWF Climate Reanalysis", is the "Fifth Generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) Reanalysis".The model represents a high-resolution (11,132 m) and comprehensive dataset of various atmospheric variables, providing historical weather data on a global scale.The dataset encompasses a wide array of variables, including temperature, precipitation, wind speed, and more, with data available from 1979 to the present day [11].By integrating the ERA5 climate model into agricultural practices, we aim to bridge the gap left by the lack of farm-specific weather data, thereby enhancing the accuracy and reliability of crop yield forecasts.
Addressing the issue of growers not retaining their weather data is not a quick fix; the change would involve infrastructural and behavioural changes.Even with growers adopting better data management practices moving forward, the challenge remains for historical data, which are currently unavailable.To tackle this issue comprehensively, we explore the potential of the ERA5 climate model as a valuable resource for providing historical weather data that can be incorporated into our dataset.By investigating the suitability of the ERA5 model, we aim to provide a method for utilising historical yield data where temperature data were not recorded.
Moreover, our study leverages the real-world data provided by Angus Soft Fruits (Arbroath, UK), a leading supplier of berries to UK and European retailers, enhancing the practical relevance of our findings.The valuable data provided by them have been instrumental in training our neural network (detailed in Section 2) and in evaluating the performance (in Section 4).The literature suggests that machine learning approaches are highly effective for yield forecasting, and our experiments confirm this.Notably, we incorporate XGBoost [12] into our end-to-end framework and benchmark against the growers' own forecasts.
In summary, our work makes the following contributions: • Our proposed approach uses growers' pre-season forecasts alongside a machine learning model and the ERA5 climate model to develop a strawberry yield forecasting system; • Inspired by real-data intricacies from multiple farms across the UK, we present a comprehensive end-to-end framework and a forecasting model that can be used to support the growers' decision-making process; • With our global-to-local model, we demonstrate how data from multiple farms can be used to inform the decision-making at a local level, therefore supporting a global-tolocal approach rather than the most commonly used local-to-local one.
The paper is structured as follows.Section 2 outlines other related work in the field of crop yield forecasting, highlighting the various methodologies and approaches previously utilised.Section 3 describes the materials and methods used in this study, including data collection, data wrangling, and the machine learning models employed.Section 4 presents the results of the experiments conducted, detailing the performance of the proposed expertinformed machine learning system compared to other models.Finally, Section 5 discusses the findings, their implications, and the potential for future research in this area.

Related Work
Agriculture is a vital part of the global economy.The pressure on agricultural systems across the world is only going to continue to increase with the growing human population [13].This increasing demand necessitates innovation in the agri-food sector to enhance productivity and sustainability.
The agri-food sector has experienced notable advancements through the integration of machine learning and data-driven approaches, leading to a vast array of applications in this broad field.These technologies are paving the way for agriculture to evolve into a data-driven, intelligent, agile, and autonomous connected system of systems [14][15][16][17][18][19].The sector has already observed the benefits of machine learning in a variety of different areas, including pest prediction and prevention [20,21].
Machine learning within agri-food has also achieved success with yield forecasting [22].However, there are fewer examples of this when we restrict the success to only strawberries, although even then we can find successful applications [7,23,24].
The recent advancements in agricultural forecasting for forecasting strawberry yields have shown promising results through the application of deep learning models.Notably, some studies have enhanced their predictive accuracy by incorporating satellite imagery and detailed soil parameter data [7].
However, these methods often rely on the availability of extensive environmental data and clear imagery for analysis.In our specific context, our data setup lacks the necessary sensors to collect detailed soil data.However, even if these sensors were installed, we would not have the required historical data to make use of them at this stage.Furthermore, our crops are housed within polytunnels, which poses a unique challenge as they obstruct the view of satellite cameras, rendering satellite imagery ineffective for monitoring the crops within.This challenge necessitates alternative approaches to improve the yield forecasting.
In the context of strawberry yield forecasting, transformers have been successfully applied to predict the yield under varying conditions and settings [23,24].However, the research utilised comprehensive datasets, including detailed irrigation information from tabletop systems, extensive environmental data from weather stations, and frequent yield quality reports from strawberry picking teams.These rich data sources, which were precise and high-resolution, are in contrast to the much less-detailed real-world data we have access to.Even so, there remained issues with data availability.In contrast, our research involves lower resolutions and less-detailed data.We conducted preliminary work and found that, with our low-resolution data, transformers were not suitable.
Other research has demonstrated the effectiveness of using unmanned aerial vehicles (UAVs) with mounted cameras for predicting strawberry yields and dry biomass, such as in the work presented by Zheng et al.,in [25].However, this approach may not be feasible for the farms under our consideration in Scotland, where strawberries are cultivated in polytunnels to adapt to the colder climate.This is in contrast to the open-field cultivation practices common in Florida's warmer environment.
In summary, while significant advancements have been achieved in agricultural forecasting, challenges remain, particularly in contexts with limited data and unique environmental constraints.Our research aims to address these challenges by combining growers' forecasts with satellite data to enhance the yield predictions, thereby providing a more robust and practical solution for real-world applications.

Angus Soft Fruits Data
We collaborated with Angus Soft Fruits (Arbroath, UK), a company that generously supplied us with both current and historical data on their soft fruit crop yields from farms all across Scotland and England.Additionally, they provided pre-season and weekly forecasts.

Pre-Season Forecast
For every year from 2020, we were provided with a document called the "pre-season forecast".This document aggregates the individual forecasts submitted by each farm to the company before the start of the growing season.It encompasses the growers' expectations for each plot within their farms, detailing the anticipated yield and the expected timing of these yields.Additionally, it includes other crucial information, such as the planting dates of the crops and the acres of the plot.

Weekly Forecasts
Similarly to the pre-season forecast, each farm also provides weekly updates to the company.These are individual documents for each farm, for each week, containing the grower's updated forecasts.Often, these forecasts are less accurate than the forecasts at the beginning of the year.Management noted that this is often because growers will often exaggerate the positive/negative effects on their crops from events such as drastic changes in weather or management decisions.Each week, when these are sent to head office, they will be viewed and decisions regarding operations and logistics will be updated with this new information.
However, these weekly updated forecasts are not compiled or stored, so obtaining historical data from previous years was not possible.Due to the limitation of only having data from 2023, we were unable to incorporate historical data as a feature in our model.Instead, we used them solely for evaluating the effectiveness of our predictions over time compared to the growers' forecasts.However, throughout this project, Angus Soft Fruits has adopted a more data-centric approach, ensuring that, moving forward, these data will be retained.This shift means that, in the future, we will be able to leverage these data, full of latent variables, as an input for our model.

Data Wrangling
Although we had data ranging back to 2011, we elected to use data from 2020 onwards as we had the corresponding pre-season forecasts for these years.This provides us three years (72% of the data) for training (2020, 2021, and 2022) and one year (28% of the data) for testing (2023).The dataset we compiled focused only on strawberries and utilised the date, received yield, farm name, plot name, plot acres, strawberry variety, tunnel type, plant age, and the growers' prediction for the week.The model was trained and tested on the plots of one specific strawberry variety across 6 farms.The plots all vary in size and shape, containing different amounts of polytunnels.
When using XGBoost for time-series forecasting, the size of the time-steps must be kept consistent.The strawberry harvests typically occur twice a week, although the frequency can vary.On days without harvests, there were no data points, leading to irregular timesteps in our dataset.To address this, we modified the dataset by recording a yield of 0 kg for days without harvests.This allowed us to maintain consistent daily records.We then aggregated these data weekly, providing a weekly yield for each plot at each farm throughout each year.
Another challenge we encountered was dealing with missing or incorrect data entries.It was common to find dates and values that were inaccurately recorded due to obvious typos in dates, crop types, and other variables.Correcting these errors was a necessary step as some level of data inconsistency is often inevitable in real-world scenarios.From the pre-season, we had a "planted date" for most crops.For any crops where this field was left blank, we would check the previous years to see if this crop was present so that we could determine the first year the crop was harvested and thus determine a "planted date" we could use.If there were no previous harvests for the crop, this would imply this is a new crop and we would then use the current year as the planting year.We could then use this to create an age for each plant.This was important as the age of a crop affects its yield.
As we were using XGBoost, we adapted how we would have to window the data in comparison to a deep learning network such as a Long Short-Term Memory Network (LSTM) or time-series transformers, which utilise a window size.We had to manually craft the input features to incorporate historical data.For example, when predicting yields 4 weeks ahead, we included the yields (Y) from 5, 6, 7, and 8 weeks prior from each specific plot as features in the model's input (X).This was applied not only to the yield data and the growers' forecasts but also to the historical temperature values that we pulled from the ERA5 system.
Our dataset contained categorical data, including variables like tunnel type, farm name, and plot name.To make these data compatible with our model, it was necessary to convert these categories into numerical form.We accomplished this using the LabelEncoder from the sklearn library.This method assigns a unique numerical value, starting from 0 for each category.
After finalising the dataset, we normalised all the numerical values to a range between 0 and 1.Although this normalisation step did not significantly affect the model's performance, it was crucial for maintaining the data privacy of the farms involved.To accomplish this, we used a simple Min-Max scalar from sklearn shown below:

The Base Dataset
Our dataset is composed of 13 different plots from across 6 different farms that are distributed all across the UK.The yield distribution of crops across a season is highly variable between different varieties of crop, season of planting, and tunnel types.To narrow down the scope of this forecasting problem, we targeted 1 variety of crop, and specifically those of this particular variety that were grown in the Seaton Tunnel system.We still opted to keep "Tunnel Type" and "Variety" as features to ensure consistency when experimenting with a wider variety of fruit/tunnel types.We also removed any autumn crops from the dataset as these crops are less numerous and behave very differently.
The target variable (Y) across all datasets was the historical yield.The 13 features (X) of the base model included farm ID, plot ID, the age of the plant, tunnel type, variety, and the acreage of the plot.Additionally, the model incorporated various time-series features; these features were day of the week, quarter, month, year, day of the year, day of the month, and week of the year.The feature importance of the base dataset generated by XGBoost can be viewed in Figure 2, and the feature importance of the complete dataset can be observed in Figure 3.

Datasets for Comparison
With our cleaned data, we created four different datasets for our comparison: 1.
The dataset with the growers' forecasts added (expert-informed model); 3.
The dataset with the Era5 weather data added (machine learning model); 4.
The dataset with both the growers' forecasts and the Era5 weather data added (expertinformed model plus climate).

ERA5 Temperature Data
Due to the lack of available weather data, we opted for the "ERA5-Land Hourly-ECMWF Climate Reanalysis" dataset as an alternative to provide us with temperature data.To effectively utilise the ERA5 data, we developed a Python (version 3.12) script that retrieved the information from Google Earth Engine, converted the data to Celsius, and then down-sampled the hourly data to a weekly mean temperature in order to match the resolution of the yield data and the grower's forecasts.Finally, the tool would match up the data to the location of the farm.This would become another feature in our model.
As we have emphasised the significance of weather data in our model, Angus Soft Fruits has been highly responsive by initiating weather sensor trials across a variety of polytunnels to gather data.However, a persistent challenge remains in historical data.To effectively train our model, we must have consistent historical weather data, spanning the past four years (the time of our dataset).We explored other options such as the MET office in our quest for a reliable data source encompassing consistent historical and current data.Unfortunately, historical data availability from this source was severely limited in terms of locations.We also examined alternative systems like MODIS; however, we encountered considerable inconsistencies when comparing the data trends of these satellite/sensor data to on-site weather stations in Scotland.
As we are using this system for our temperature data, it is crucial to consider its limitations.The ERA5 meteorological variables are generally accurate for large plains or urban areas, while the applicability of the data in different environments remains inconclusive.For example, in mountainous areas, temperature, relative humidity, and horizontal wind speeds at the middle and lower levels deviate significantly from observations, especially during extreme weather events such as rainstorms and typhoons [26].

Comparison to a Farm Weather Station
The ERA5 model assimilates data from a wide range of sources, including potentially public weather stations.To test the viability of the data we extracted from the ERA5 model, we would need to compare them to those from an on-the-ground weather station.This led us to compare the ERA5 data with measurements from a private weather station located at an Angus Soft Fruits farm (examples shown in Figures 4 and 5).
The Pearson Correlation Coefficient between the ERA5 dataset and the farm's weather data was 0.938.This strong correlation underscores the reliability of the ERA5 dataset in reflecting actual weather conditions, even when compared to independent sources that are most certainly not being fed into the model.These results further reinforce the applicability of the ERA5 dataset for use in agriculture as accurate local data are essential.

Models
In our research, we conducted a thorough comparison of various iterations of our XGBoost model with the growers' pre-season and mid-season forecasts.This comparison was motivated by the consistently strong performance of the XGBoost model in preliminary studies.Our focus was to evaluate how these iterations of the XGBoost model performed in contrast to the current gold standard growers' forecasts.This analysis aimed to explore the potential of combining XGBoost and the growers' manual predictions in enhancing predictive accuracy in the agricultural sector.

Random Forest
The Random Forest (RF) algorithm is an ensemble learning method used for classification and regression tasks.It operates by constructing a multitude of decision trees during training.RF is one of the most popular machine learning techniques widely used for regression due to its accuracy, versatility, and precision in its predictions [27].This makes it a good baseline comparison for our XGBoost model.To determine the best structure for the Random Forest model, we used "GridSearch", which provided us with a structure of 'n_estimators' 1000 and 'max_depth' 15.

XGBoost
The XGBoost algorithm is a highly scalable end-to-end tree-boosting system used in machine learning for classification and regression tasks [12].The algorithm is renowned not only for its precision and flexibility but also for its automatic handling of missing values [28].
XGBoost stands out as one of the most widely adopted implementations of gradientboosting decision trees (GBDT) due to both its robustness and effectiveness.Gradient trees are formed one by one, each addressing the errors of its predecessor.It employs gradient boosting to aggregate predictions from all trees, assigning greater weight to the more accurate ones, and ultimately combines these predictions for a final decision [29].This is shown in Figure 6.However, the aforementioned steps to clean up the data are essential as the XGBoost algorithm is sensitive to data quality.Noise or outliers in the dataset may affect the effectiveness of the model.XGBoost has often exhibited its capacity to surpass other models, including regular gradient-boosted decision trees, autoregressive integrated moving average (ARIMA), Prophet, and Long Short-Term Memory Networks (LSTMs) [30,31].It can be difficult to tune the many parameters of the model; to determine the best structure for the XGBoost model, we used "GridSearch", which provided us with a structure of 'n_estimators' 1000", 'max_depth' 15, and 'learning_rate' 0.01.

Forecasting Framework
Figure 7 illustrates the workflow of the application.Our system is a multi-farm globalto-local model; it is one model trained on the data from many farms across the UK, which then makes predictions for each individual farm plot.Initially, the Dataframe Builder script is launched, which imports various CSV files containing historical yields from the growers' database, pre-season documents from previous years as well as the current year, and any mid-season forecasts available for the current year.Additionally, the model incorporates data from the ERA5 climate model.They are processed by our weather data tool script we created to pull the data from Google's Earth Engine.The Dataframe Builder will then process and window these data so that they can then be fed into the XGBoost model to generate predictions.

Model Variations
To compare the different methods, we created four model variations for every farm plot (forecasts shown in Figure 8).We had the base model utilising plot acres, plant age, and historical yields, the expert-informed model, which added the growers' forecasts, the climate model, which added the temperature data from the ERA5 system, and the expert-informed + climate system, which added the data from both the growers' forecasts as well as the temperature data from the ERA5 system.A comparison of these models can be observed in Table 1.The results for Farm 1 Plot 2 for the 2023 predictions can be observed in Figure 8.The graphs for all the other farms and plots can be found attached in Appendix A (Figures A1-A12).

Random Forest Baseline
For each of the four model variations, we also created a baseline where we used Random Forest instead of XGBoost.This method did not perform as well as XGBoost.The results can be found in Appendix A in Table A1.However, comparisons between the various models including the RF methods have been included in Table 2 and comparisons with the growers' forecasts in Table 3.

Base Model vs. Expert-Informed Model
The expert-informed model is generally more accurate and precise in its predictions than the base model due to its consistently lower Root Mean Square Error (RMSE) and Mean Average Error (MAE) values across multiple farm plots.However, this comes with slightly more variability in performance, as indicated by the higher standard deviations.The expert-informed model outperforms the base model, with lower average RMSE (0.0855 vs. 0.0939) and MAE (0.0334 vs. 0.0395) values, indicating greater accuracy and precision, albeit with slightly higher variability in performance, as reflected by the standard deviations.A one-way ANOVA test with an alpha of 0.05 was used and confirmed that the improvement from the base model to the expert-informed model was significant, p < 0.001.The full set of results can be found in Table 1.

Climate Model Data
The pattern that emerges from the analysis of the models indicates that the expertinformed model generally provides superior performance compared to the base model.As can be observed in Table 2, the integration of sensor data often leads to further improvements in the average RMSE (0.0939 vs. 0.0894) and MAE (0.0395 vs. 0.0365).The expert-informed + climate model frequently achieves the best results, underscoring the value of combining expert analytical capabilities with sensor-derived data.The climate model alone also shows strong performance in specific instances, suggesting its utility in certain conditions.Overall, the data suggest a nuanced approach to model selection, where the choice of the model may depend on the specific characteristics and requirements of each farm plot.A one-way ANOVA test with an alpha of 0.05 was used and confirmed that the improvement from the base model to the model utilising the climate data from the ERA5 model was significant, p < 0.001.

Expert-Informed + Climate Model Data
Combining the approaches presents us with a new model; however, as can be observed in Table 2, although superior to the base model average RMSE (0.0872 vs. 0.0939) and MAE (0.0342 vs. 0.0395), and the climate model data average RMSE (0.0872 vs. 0.0894) and MAE (0.0342 vs. 0.0365), on average, the model is still beaten by the expert-informed model average RMSE (0.0872 vs. 0.0855) and MAE (0.0342 vs. 0.0334).Although these results are very similar, the improvement in the results just using the expert-informed method is consistent enough to be statistically significant when a one-way ANOVA was performed with an alpha of 0.05, leading to a p-value of 0.001.

Comparisons with Grower Forecasts
In evaluating the effectiveness of our model, we decided to compare it against the performance of the growers' own forecasts (our baseline comparison).For our comparison, we used our expert-informed model, which was our best-performing model (see Tables 3 and 4).We compared against both their pre-season forecast and one of their mid-year forecasts from May.Upon calculating the average values for each method, it was found that our model (expert-informed) demonstrated the highest accuracy with the lowest RMSE and MAE values.Specifically, this method showed an average RMSE of 0.0855 and an average MAE of 0.0334, outperforming both the growers' pre-season (RMSE 0.1008; MAE 0.0412) and mid-season forecasts (RMSE 0.1310; MAE 0.0519).
Additionally, while the base model was the weakest among our models, it still performed better than the growers' forecasts.The base model achieved an average RMSE of 0.0939 and an average MAE of 0.0395, which were both lower than the growers' pre-season forecast (RMSE 0.1008; MAE 0.0412).This indicates that even the base model, without the expert-informed enhancements or climate data, still provides more accurate and precise predictions compared to the growers' own forecasts.
In summary, while the base model performs the worst among our models, it still outperforms the growers' forecasts.The addition of climate and expert data in various combinations is able to further improve on this, with the expert-informed model performing the best, with an RMSE of 0.0855 and an average MAE of 0.0334.

Discussion
Our research underscores the value of integrating growers' forecasts into machine learning-based crop forecasting models.This approach effectively bridges the traditional agricultural knowledge with advanced computational techniques, yielding better yield forecasts.This is particularly important when dealing with the inherent complexities of real-world agricultural data; this hybrid method (expert-informed model) demonstrates its strength.
We evaluated the performance of the expert-informed model, which incorporates the growers' pre-season forecast, in comparison to the model without any of this information.Our analysis revealed that the expert-informed model demonstrates superior accuracy in yield prediction.This is evidenced by its consistently lower Root Mean Square Error (RMSE) and Mean Absolute Error (MAE) values when compared across multiple farm plots.Specifically, the expert-informed model achieved an average RMSE of 0.0855 in contrast to the base model's 0.0939, and an average MAE of 0.0334 compared to the base model's 0.0395.These results indicate a notable improvement in the models' accuracy.The inclusion of the growers' insights, derived from the years of experience and deep understanding of their lands, complements the data-driven aspects of our models, which is important when handling the messy, small real-world data where other more advanced techniques such as transformers proved not to be effective for time-series forecasting.This hybrid strategy, therefore, represents a promising direction for future research in agricultural forecasting, utilising this practical, ground-level perspective provided by the growers.
Although we have access to the data of entire farms across the company, it would seem that data availability remains a limiting factor.Real-word data can often provide different challenges, and we struggled to build a dataset with a strong independent variable.Going forward, the company we worked with has already made an effort to attain more detailed information through data after seeing the possibilities of ML while understanding the limitations of their current data collection.
These improvements to the company's data collection mean that, as well as pre-season forecasts, we will also be able to build up a dataset of weekly updated forecasts.This would enable us to utilise these weekly forecasts as an input and make our model more robust against not only any dramatic or unforeseen weather conditions but also any on-farm events that may effect the yield.
Although these changes will have a drastic positive effect in the future and open the door to various other models and methods, in the meantime, with the current historical training data we have, we must find a way to make it work; this is where the importance of utilising data from both other expert data sources and the ERA5 model comes into play.
A significant aspect of our research focused on evaluating the potential of using historical ERA5 data as a feature in our predictive models, particularly as an alternative for instances where the growers may not have recorded their temperature data.This investigation stemmed from the need to provide a robust solution for those growers who might lack local temperature recordings, a challenge we encountered in agricultural data collection.
Our findings indicate that incorporating the temperature values from ERA5 data does indeed increase the accuracy of the predictive models, even with the crops being cultivated within polytunnels.This is particularly noteworthy in the realm of agriculture, where precise temperature data are often crucial for accurate yield forecasts.The improved model performance with ERA5 data integration demonstrates that, even in the absence of locally recorded data, growers worldwide can still leverage machine learning techniques utilising weather data to make informed forecasts and decisions.
Our research, however, does come with limitations.All the crops in our dataset were the same variety of strawberries, grown in the same type of polytunnel.To build a system complex enough to handle an array of strawberry varieties, tunnel types, and out-of-season crops, i.e., autumn crops, the system would require substantial work.In contrast to this, other studies do not mention their model being so limited, and it can be assumed that there may be more variety in the crops [32].Our model also lacks economic variables (e.g., market prices and input costs) and does not account for any environmental factors beyond temperature (e.g., soil health and water availability), which are crucial for decision-making and comprehensive agricultural forecasting.
The model relies heavily on growers' initial forecasts and expertise, which might not be available/relevant for newer crops or varieties that the growers have not previously grown.The model may also prove to be less effective with crops or climates that can be more variable, where the growers' initial forecasts will be less relevant.
Looking forward, our research will evolve to try and adapt to the microclimatic conditions of the polytunnels that growers in Scotland utilise.Specifically, we plan to utilise climate data gathered from within the polytunnels and analyse how these trends correlate with the ERA5 temperature data.The goal is to develop a model capable of using ERA5 data to infer the corresponding internal polytunnel temperatures.
By achieving this, we anticipate a further improvement in our models' predictions.This advancement could be a game-changer for growers, enabling them to utilise predictive modelling effectively, even in scenarios where they lack extensive historical weather data collection.This research not only broadens the applicability of our model but also aligns with the broader objective of making machine learning a universally accessible tool in agriculture.

Conclusions
In this paper, we proposed an expert-informed global-to-local model designed for strawberry yield forecasting.The model incorporates real-world expert-generated data and achieves more precise predictions-in most cases-than the experts' forecasts as well as a machine learning model solely based on the historical yield records, temperature readings from the ERA5 climate model, and various categorical variables.Our expert-informed model provides a 16.4% RMSE and 20.9% MAE decrease in error over the growers on the forecasts, and a 9.4% RMSE and 16.7% MAE decrease in error over the standard machine learning model based on the historical yield records and various categorical variables.In addition, we observed that, in scenarios where expert data are unavailable, integrating the temperature data from the ERA5 climate model significantly enhances the accuracy of the forecasts.The standard machine learning model with the addition of the ERA5 climate data provides a 4.9% RMSE and 7.9% MAE decrease in error over the standard model without these data, suggesting the need to consider this in future forecasting systems.Finally, we believe our system can form the basis for future developments in this area that will leverage the already available historical data from farms for developing accurate forecasting models that can support the growers' decision-making process.

Appendix A. Additional Forecasting Plots
Appendix A.1      (a) Base Model

Figure 2 .Figure 3 .
Figure 2. The feature importance plotted for the base dataset generated by XGBoost.

Figure 4 .
Figure 4. Temperature at an Angus Soft Fruits farm.

Figure 5 .
Figure 5. ERA5 data plotted against weather station data at an Angus Soft Fruits farm.

Figure 6 .
Figure 6.A simple visualisation of the XGBoost process.

Appendix A. 2
Random Forest Results

Table 1 .
Combined model variations-MAE and RMSE (Bold denotes lowest error).

Table 2 .
Mean RMSE and MAE values for the different models across all farms (ML: machine learning, Bold denotes lowest error).

Table 3 .
Mean RMSE and MAE values for each method (ML: machine learning, Bold denotes lowest error).

Table 4 .
Prediction comparisons for the growers' pre-season, mid-season, and the expert-informed model (Bold denotes lowest error).

Table A1 .
Combined RF model variations-MAE and RMSE (Bold denotes lowest error).