Soya Yield Prediction on a Within-Field Scale Using Machine Learning Models Trained on Sentinel-2 and Soil Data

: Agriculture is the backbone and the main sector of the industry for many countries in the world. Assessing crop yields is key to optimising on-ﬁeld decisions and deﬁning sustainable agricultural strategies. Remote sensing applications have greatly enhanced our ability to monitor and manage farming operation. The main objective of this research was to evaluate machine learning system for within-ﬁeld soya yield prediction trained on Sentinel-2 multispectral images and soil parameters. Multispectral images used in the study came from ESA’s Sentinel-2 satellites. A total of 3 cloud-free Sentinel-2 multispectral images per year from speciﬁc periods of vegetation were used to obtain the time-series necessary for crop yield prediction. Yield monitor data were collected in three crop seasons (2018, 2019 and 2020) from a number of farms located in Upper Austria. The ground-truth database consisted of information about the location of the ﬁelds and crop yield monitor data on 411 ha of farmland. A novel method, namely the Polygon-Pixel Interpolation, for optimal ﬁtting yield monitor data with satellite images is introduced. Several machine learning algorithms, such as Multiple Linear Regression, Support Vector Machine, eXtreme Gradient Boosting, Stochastic Gradient Descent and Random Forest, were compared for their performance in soya yield prediction. Among the tested machine learning algorithms, Stochastic Gradient Descent regression model performed better than the others, with a mean absolute error of 4.36 kg/pixel (0.436 t/ha) and a correlation coefﬁcient of 0.83%.


Introduction
With total farmland stretching over 124 Mha, soya (lat.Glycine max) is the fourth largest crop in the world and one of the most important sources of oil and protein for animal feed, human consumption and bio-fuel [1].Due to the plant's rich nutritional content, soya production has increased more than 13 times since the 1960s, with the average yield in 2016 spanning between 1.22 t/ha in India and 3.51 t/ha in the USA [2].It is estimated that 78-80% of soya produced globally is based on Genetically Modified Organism (GMO) technology [3,4] for improving the plant's yield, nutritional components and pesticide resistance [5].However, there is a concern of a part of the public and scientific community over GMO food and accompanying pesticides and their influence on humans and the environment [6][7][8].For this reason, non-GMO production has also witnessed the increase in popularity.Non-GMO production of soya is especially present in Europe, where stakeholders from the entire value chain and civil society are gathered under the umbrella of Donau Soja [9].Donau Soya is a non-profit organisation organised as an association and is based in Vienna, with three local offices in Serbia, Ukraine and Moldova as well as a representative in Romania.The organisation's network of farmers was the one to acquire the data for this study, while their motivation behind sharing the data was the development of a Machine Learning System (MLS) that they could use in their daily operations.
Modern technologies have made it possible to install different types of devices on the combine harvester, such as a yield monitor that uses the responses of different sensors to map the crop on the parcel [10].In the domain of precision agriculture, yield monitor devices provide a new and powerful tool for zone management and in-field comparisons [11].This means that by analysing data for a specific field, farmers gain new knowledge that allows them to better prepare the plan of agricultural operations for the next growing season.Yield monitor data are sent directly from the server via the Internet of Things (IoT) communications protocol [12].In this way, they are obtaining insights into the status of their crops at the time of harvest, sometimes even in real-time.On the other hand, these data are stored in the internal memory, which provides the possibility of further processing or analysis.
Recently, satellite platforms have become increasingly available for widespread use, both for research and commercial purposes.Remote sensing applications, which include satellite images, have greatly enhanced our ability to monitor and manage our natural resources, especially in the area of agriculture [13].Satellite-based monitoring enabled large-scale observation with a revisit period of 5-10 days for open access data and a daily resolution for the paid satellites.Due to their wide swath, satellite imagery has found its practical application in the field of crop condition monitoring on a global level [14].The crop response in satellite images depends on soil properties in the areas with homogeneous treatments [15].The usage of satellite images and vegetation indices allow the farmers to identify different management zones on a commercial farm [16].One approach is to map the crop yield, as one of the most important layers of information in agriculture production, using Scalable, a satellite-based Crop Yield Mapper (SCYM) [17].This method uses crop model simulations to train statistical models for different combinations of possible image acquisition dates within the Google Earth Engine platform.
On the large scale of crop yield prediction a non-linear quasi-Newton multi-variate optimization method has been tested in the Iowa state.This model takes into account remote sensing and surface parameters for estimating the annual average yield and achieved promising results for soya (R 2 = 0.86) and corn [18].Farmers traditionally estimate yield based on their previous experience and present weather, but there are other more advanced approaches based on simulation models (e.g., APSIM [19], DSSAT [20], WOFOST [21] and AquaCrop [22]) or data-driven models, which rely on crop reflectance in remote sensing and the current wealth of agro-environmental data offers a great opportunity to improve yield forecasts [23].Soya yield prediction based on satellite images and weather data at municipality-level achieved a mean absolute error (MAE) of 0.24 Mg/ha [24].Terrain topography can significantly affect crop yields.The effectiveness of each topographic derivative can be estimated using LiDAR (Light Detection and Ranging) data and geographically weighted regression (GWR) models.These models using topographic variables derived from LiDAR can effectively explain yield on an entire-field scale (R 2 = 0.71 for soya yield prediction) [25].Prediction within-field yield is quite a difficult challenge because of the high resolution of produced resulting maps.Crop-growth model based on meteo, soil and LAI (Leaf Area Index) retrieval from Sentinel-2 achieved promising results.The mean error ranging from negative to positive values was −365 to 411 kg/ha across the study fields [26].The results of the research clearly indicate that each additional source of error should be included in the simulations when using the crop model for yield prediction.The assimilation of crop model data with the Ensemble Kalman filter for correcting errors in the water balance of the world food studies (WOFOST) led to improved results in predicting wheat yields for the majority of regions (66%) [27].Additional assimilations of Sentinel-1 and Sentinel-2 data and incorporation into the WOFOST model achieved correlations of R 2 = 0.35 of observed and simulated yields and RMSE = 934 kg/ha [28].
In order to feed the world's growing population, food production and yields must increase significantly [29].For that reason, yield prediction is one of the most important tasks in machine learning (ML) applied in agriculture.Agricultural production is very sensitive to a wide range of factors such as the weather, soil, seed selection, fertilisation, homogeneous zone management and their complex interactions.
Global research suggests that climate variation explains about one-third of global yield variability [30].Weather effects, together with evapotranspiration of plants, were used to create a ML regression model for yield prediction [31].There are several ML techniques that have been applied for crop yield prediction approaches found in the literature.Deep Neural Networks (DNN) utilize the high processing power of modern servers and computers and offers a state-of-the-art modeling approach yield prediction. DNN structures for crop yield prediction are usually based on soil and weather input data for the purpose of model training.Yield prediction has been performed using DNN and the model performance achieved the correlation coefficient of 81.91% [32].Another approach is to use Convolutional Neural Networks (CNN) [33] in combination with publicly available remote sensing data such as MODIS Surface Reflectance to estimate yield on the country level [34,35].Artificial Neural Networks (ANN) proved to be a very effective model for predicting the maize and soya yield, achieving a coefficient of determination of 0.77 and 0.81, respectively [36].
Based on the pre-season yield prediction, soya varieties are chosen [37,38], while based on in-season yield prediction, farmers used that prediction to optimise the storage, logistics, agricultural operations and forward sales.In this manner, they are lowering the post-harvest losses, signing more favourable and less risky contracts and increasing profits.
The results are promising and can provide yield estimates at the farm level, which could be useful in refining broader scale (e.g., county, region) yield projections, but on within-field levels, the prediction errors were above 20% in the best case [39].For most of the field, the prediction accuracy of ANN models was low, with errors above 40%.
The main aim of this research was to evaluate MLS for soya yield prediction using Sentinel-2 multispectral images and soil parameters together as training data.In parallel, the problem of fitting data from different sources (such as satellite images, yield monitors, and field surveys) was considered and elaborated.Finally, the computing requirements for large scale applications of the different machine learning algorithms used in this attempt were tested.

Method Overview
The methodology followed in this research employs MLS to predict soya yield from a series of multi-source input parameters, such as yield monitor data, raw pixel values from Sentinel-2 multispectral images, vegetation indices derived from the same imagery and soil features recorded in pre-existing databases.The flow chart of the processing pipeline is shown in Figure 1.MLS consists of two parts.The first part, i.e., preprocessing, was employed to deal with the irregularities within the data, and the second one was used for modelling the yield using a ML approach.Each segment of preprocessing was performed successively for all three years, and the final database for training ML models was created (furthermore, the complete dataset).Codes for both parts of MLS are given in Supplementary Materials.

Data
The dataset used in this research comprises a set of field data derived from yield monitor device, imaging data observed from Sentinel-2 satellites and soil data obtained from SoilGrids system.
Field data used in this study were provided by Donau Soja [9], which acquired it through the organisation's network of farmers in Austria.The yield measurements were taken by combine harvesters during the 2018, 2019 and 2020 growing seasons in Upper Austria (Figure 2).There were 142 fields in total: 72 in the first, 53 in the second and 17 in the third year.Their total area was 411 ha for all three years with an average parcel size of 2.9 ha.Each of field is denoted with a unique identification number (Parcel ID).Sentinel-2 is a modern satellite mission for Earth observation.It is a constellation of two identical satellites (Sentinel-2A and Sentinel-2B) placed in the same sun-synchronous orbit, phased at 180°from each other.Spectral bands for the Sentinel-2 images are shown in Table 1.This study relied on Level-2A product of Sentinel-2 satellite imagery that comes in 12 spectral bands from visible, infrared and short-wave infrared part of the spectrum at the highest spatial resolution of 10 m [40][41][42].Band 10 was discarded from this product, so the remaining 12 bands were used in this study.To deal with the problem of different resolutions, the bands with resolutions of 20 m and 60 m were upsampled to the resolution of 10 m so that all channels could be concatenated with aligned pixels.New images are generated every 5 days, but the majority were discarded in this study due to high cloud coverage.For every year, we chose 3 cloud-free images: one for each month from June to August.The dates are specified in Table 2, along with the relevant soya growth stage.The dates are not identical in the same day for each season due to the appearance of the cloud coverage, but it was considered to take the approximate date of image acquisition in the required period.In Table 2, mark V denotes the vegetation stages with the number of nodes (5) on the main stem with fully developed leaves, while mark M implies reproductive stages.During the R2 stage, soya opened its flower at one of the two uppermost nodes on the main stem with a fully developed leaf.In stage R2, a soya flower formed.R3 and R4 stages include pod formation, while R5 and R6 involved cover seed formation [43].

2019 2020 Soya Growth Stage
Vegetation indices are widely used in the domain of agriculture.Some of those are used to continuously monitor the condition of crops.Vegetation indices are mathematical combinations of two or more spectral bands designed to highlight a particular property of vegetation.Here, several vegetation indices [44] were calculated to facilitate tracing crop variance within the tested fields: • Normalized Difference Vegetation Index (NDVI) is one of the most commonly used vegetation indices, which combines two spectral bands to estimate the density of green vegetation and health of the plant [45,46].
• Enhanced Vegetation Index (EVI) was developed to quantify the vegetation signal with improved sensitivity in areas with dense vegetation and improved vegetation monitoring by de-coupling the canopy background signal and a reduction in atmosphere influences [47,48].
• Atmospherically Resistant Vegetation Index (ARVI) provides a self-correction process to correct radiance for the atmospheric effect on the RED band [49].
• Soil-Adjusted Vegetation Index (SAVI) is presented to minimize soil brightness influences from spectral vegetation indices [50].
• Normalized Difference Vegetation Index Red-edge (NDVIRE) is the modification of NDVI where the RED band was replaced with RedEdge [51,52].
• Visible Atmospherically Resistant Index (VARI) is used to estimate the share of vegetation with low sensitivity to atmospheric effects [53].
• Normalized Difference Water Index (NDWI) is used to determine the water content in vegetation [54].
• Modified Normalized Difference Water Index (MNDWI) is a modified version of the NDWI index and it is also used to detect water content [55].
• Visible-Band Difference Vegetation Index (VDVI) plays a role in the extraction of vegetation information in visible bands only and it is used to estimate vegetation coverage rate [56].
• Non-linear Index (NLI) is developed using intuition in the physics of interaction between optical radiation and vegetation canopy and using some results of analytical models.This index can minimize the effects of "disturbing" factors and view azimuth as well as soil brightness [57].
• Modified Non-linear Index (MNLI) is modification of NLI, which has an added a soil factor reduction [58].
• Normalised Multi-Band Drought Index (NMDI) is proposed for monitoring soil and vegetation moisture with satellite remote sensing data [59].It is used for drought detection.
• Green Leaf Index (GLI) is an important determinant of canopy photosynthesis, evapotranspiration and competition among crop plants and weeds [60].
• Excess Green (ExG) vegetation index is provided to determine a near-binary intensity image, which outlines a plant region of interest [61].
• Color Index of Vegetation Extraction (CIVE) is created to separate and emphasize the green plant portion from the background [62].
• Automated Water Extraction Index (AWEI) has a role to increase the contrast between water and other dark surfaces.Moreover, the aim of AWEI is to maximize the separability of water and nonwater pixels through band differencing, addition and applying different coefficients [63].
• Green-Red Vegetation Index (GRVI) is evaluated as phenological indicator based on multiyear stand-level observations of spectral reflectance and phenology [64].
• Green Atmospherically Resistant Index (GARI) is developed and expected to be as resistant to atmospheric effects as ARVI but more sensitive to a wide range of Chlorophyll concentrations [65].
• Difference Vegetation Index (DVI) is used to evaluate and quantify the difference between NIR and RED bands [66].
• Leaf Area Index (LAI) is related to canopy light absorption.It is used to characterise plant canopies [26].
For ML regression model training, 20 vegetation indices were used as additional features, which are derived from 12 bands of Sentinel-2 multispectral images for each of the three dates.These indices were used to estimate vegetation characteristics as they mostly serve as indicators for crop dynamics and overall changes in biomass quantity and properties.Moreover, some indices are used to monitor changes in the water content of leaves, while others are able to suppress the influence of the soil or eliminate the influence of the atmosphere.
Another set of inputs was the soil features retrieved via ISRIC's (International Soil Reference and Information Centre) SoilGrids platform [67].Out of a large number of soil chemical and physical features, the eleven most quantifiable ones at a 15-30 cm soil depth (where soy's root system expands) were selected for the analysis.The features are listed in Table 3.
All parcels are located within a radius of 15 km.Due to the proximity of fields, which were all located in the Upper Austria region, we considered the fields to have been influenced by similar weather conditions.Therefore, weather variables were not considered.The most straightforward approach for producing a map from a point-vector geospatial layer (e.g., a shapefile) is interpolation.However, it poses several problems.The first is that in the regions of slower combine movement, yield measurements are denser and hence lower.For example, let us assume that, for one strip of the field, a combine took 20 yield measurements, each showing 2 t/ha.Had it driven twice as fast, it would have made 10 yield measurements, but now with 4 t/ha each.Because of this difference in the measurements, the standard interpolation method (Simple Moving Average-SMA) [68] provides a higher yield value than the measurement in reality.Therefore, a novel methodology, namely the Polygon-Pixel Intersection (PPI), for fitting yield monitor and satellite data are proposed in this research.The new method is compared with SMA interpolation, and the findings are presented in the Results section.
The proposed PPI method is based on the interconnection of yield monitor and satellite data.Yield data from combine harvesters came in the form of georeferenced point measurements (Figure 3).The measurements were taken at regular time intervals but at a variable combine speed, resulting in irregularly spaced points within the rows.Furthermore, due to the irregular shape of the fields, the combine passes stand out from a straight line.
The original data from the yield monitor device contain information about the longitude (x), latitude (y), swath width (sw) (m), angle (a) (rad), distance (d) (m) and yield (y) (kg).All the data used for further analysis are shown in Figure 4.The green square signifies the Sentinel-2 pixel, while the purple rectangle signifies the area harvested by the combine in one interval, i.e., the area, which corresponds to one yield measurement.The point P c represents the location of combine at the end of the scanning interval.The combine is moving in the direction of the yellow arrow, and it passed distance d.Points P 1 to P 4 that represent the corners of the polygon are expressed in the following form.
Their coordinates are calculated using the aforementioned parameters as follows.In order to obtain the total amount of yield for a particular Sentinel-2 pixel, it was needed to sum up the contribution of each sensory reading to that pixel.The procedure includes several steps.First, a rectangle is drawn for each yield measurement using the four corners.It corresponds to the area from which soya was harvested (Figure 5).The width of the rectangle is recorded by combine's sensors and extracted from the shapefile, along with the orientation of the movement.Next, the Sentinel-2 pixel grid is drawn on top, as in Figure 6.The satellite image was clipped to the parcel size using the shapefile with the boundaries of the parcel.For each Sentinel-2 pixel, yield was calculated as the overlap between the pixel and all overlapping yield polygons.In this way, a yield map was derived for the Sentinel-2 pixel grid, allowing for seamless data fusion.The percentage of overlap (O p ) between a yield monitor polygon (A ymp ) and the Sentinel-2 pixel's polygon (A pp ) is expressed as follows.
The final output of the model is in the amount of yield in kg per pixel (Y pp ).
where n is the number of polygons that overlap with the pixel, and Y ymp is the contribution of a yield monitor polygon to the satellite pixel.The only issues were observed in border pixels, which, in addition to soya plants, also contain roads, dirt, forests and other pieces of land not belonging to the field (Figure 7).Therefore, border pixels were left out of further analysis.This procedure reduced the dataset by removing contaminated and irregular data.
In order to create a complete dataset from different sources, the data must be aligned to the same grid.Satellite image grid was chosen as the reference, as it has the highest spatial resolution of 10 m, compared to the 250 m wide SoilGrids pixels.In this manner, each pixel is associated with the appropriate values of the soil parameters.

Machine Learning
Predicted yield amount served as the output of the MLS regression model, while yield monitor data, raw Sentinel-2 pixel values, vegetation indices and soil parameters represented the input.
The input dataset to the MLS consisted of 107 features in total-three 12-band satellite images from three different dates, additional twenty vegetation indices for the relevant dates and eleven soil parameters.Soil parameters are considered immutable during the observed period.
Outliers were removed from the dataset where the lower limit was 1 kg/pixel while the limit upper was 70 kg/pixel.The entire dataset contained 28,111 samples, which in this case, after resampling and data preprocessing, represented the total number of pixels.Each pixel was denoted with unique identification number (Pixel ID).
For the purpose of model testing 10% of parcels were randomly selected for the dataset to achieve similar probability density function on the training and testing set.The rest of the parcels were included in the training set.The separation into training and test set was performed on the parcel and not on a pixel level to avoid the problem of data leakage.Namely, pixel values from the same parcel are very similar and highly correlated to one another.Therefore, the random separation on the parcel level was performed to create subsets.Performance metrics, however, were evaluated at the pixel level.
Different ML algorithms were tested on these fields and their performance was evaluated and compared.The following ML algorithms were used to model the relationship between the input features and the yield: Multiple Linear Regression (MLR); 2.
All algorithms were implemented using the Scikit Learn Python library [69].This library is free and widely used in everyday challenges for the implementation of artificial intelligence in various decision-support systems.
The MLR is a technique with general purpose to seek for the linear relationship between a dependent variable and several independent variables.Multiple linear regression is a generalization of simple linear regression to the case of multiple independent variables [70].
The SVM is a supervised ML algorithm used for binary classification or regression.The goal is to find a hyperplane that separates data into predefined number of classes.For that purpose, SVM uses kernel function to separate feature space [71].
The XGB is one of the algorithms of gradient boosting machines with the ability to be utilized for supervised learning.It is an ensemble ML algorithm constructed from decision tree models.Each tree is used once to fit prediction errors made by prior models.This technique uses a gradient descent optimisation algorithm to fit a model with the goal to minimise the loss gradient [72].
The SGD algorithm is an iterative method for optimising objective functions.After each iteration, the gradient is updated using the stochastic approximation of gradient descent.The SGD implements a plain stochastic gradient descent learning algorithm that supports different loss functions and penalties to fit linear regression models [73].
The RF is a method of ensemble learning, where the basic unit in the ensemble is the decision tree.It is based on the idea that several "weak regressors" can be combined to form a "strong regressors".Each decision tree within the ensemble is trained on a bootstrap set of samples and has its own prediction.This system of creating an ensemble and the final decision made by a majority vote is called bagging (bootstrap aggregation) [74].
Root-Mean-Squared Error (RMSE), Mean Absolute Error (MAE), Coefficient of determination (R 2 ), Pearson Correlation Coefficient (PCC) and Spearman's Correlation Coefficient (SCC) were used to evaluate and rank different models.Sensitivity analysis was performed using an open-source Python package SALib [75].

Processing Speed-Up Using High Performance Computing
For a small number of fields, the execution of the system presented herein on a regular computer (PC) with Intel(R) Core(TM) i7-7800X CPU with six cores (3.5 GHz) and 32 GB of RAM was sufficient to produce results in a matter of hours.However, the system was envisaged to work either within a large farm management information system with thousands or tens of thousands of users, or as a tool for regional yield prediction, both of which are highly computationally demanding tasks.Soya is grown on 5.65 Mha in Europe [1] and due to the intensive digital transformation of agriculture, there is potentially a huge demand for the service.As most soya plants are grown in the same period of the year throughout Europe, with some deviations due to climate and the relative maturity group of the varieties, producing the results could be time-critical.For this reason, a system was implemented on the High Performance Computing Cluster (HPC), which provided more powerful hardware infrastructure [76].The algorithms were adapted using the Message Passing Interface (MPI) model [77] in order to scale on an HPC infrastructure, where the program execution time was significantly reduced.More specifically, the application was encapsulated in a Singularity container [78] together with all its Python libraries, such as the mpi4py library that enables implementation of MPI in Python.Cluster system configuration had 2 x Intel(R) Xeon(R) Gold 6226 CPU with 48 cores (2.7 GHz) and 512 GB of RAM.

Results and Discussion
The results of all regression models with optimised parameters related to the given models are presented in Table 4.The result with the smallest MAE (4.36 kg/pixel) and RMSE (5.53 kg/pixel) error was achieved by SGD regression model.In terms of easier comparison with the prediction on the parcel-level, this error was 0.436 t/ha.The coefficient of discrimination R 2 was 0.68, which is compared to some relevant scientific papers [18] (R 2 = 0.82) much smaller.The reason lies in the fact that we had included much greater spatial variability, thus the yield prediction error increases cumulatively.Compared to regional-level soya yield prediction, we also achieved a weaker result.Method based on satellite images soya yield prediction at the municipal (regional) level achieved an MAE of 0.24 t/ha [24].It should be taken that at a higher spatial level, the error is hidden behind the yield approximation over that large area and a high possibility of discrimination and precision is lost.The SGD model presents a very efficient approach to dealing with nonlinearity in the yield monitor data.The SVM model achieved similar results as the SGD, which confirms the similarity of these two algorithms.Ensemble learning models (RF and XGB) achieved slightly lower performance for yield prediction where MAE was greater than 5 kg/pixel.If we take into account the average yield of 4 t/ha, the prediction error of our model was 10.9%, which is two-times lower than the result achieved in the other relevant study (yield prediction error > 20%) [39].The linear regression method (MLR) did not provide satisfactory results (MAE = 5.48 kg/pixel); therefore, this regression problem can be considered very difficult to solve using linear determination.
Sobol' sensitivity indices represent the first-order indices that are used for estimating the influence of individual factors and the second order indices, which correspond to pairwise interaction between factors [79].The visualisation of interactions is shown in Figure 8.For the given fifteen most important parameters, the Sobol' method determined seven soil parameters (silt, cec, phh2o, cfvo, ocd, nitrogen and sand), four vegetation indices (EVI.2,EVI.3, GRVI.2 and NDWI.3) and four raw bands of Sentinel-2 images (Band4.3,Band9.1,Band8.2 and Band2.2). Results of Sobol' method reveal that silt and EVI (for the second date) were the most important individual factors for the model output.Silt content proved to be the most indicative of yield variation [80].High importance of EVI and GRVI indices are justifying the usage of vegetation indices in this study.The EVI well reflects the state of vegetation, and it is also resistant to saturation for high dense biomass.The NDWI is of great importance in determining the water content, which is a very important factor for better yield.Selected blue, red and green channels are also highly correlated with the chlorophyll content in the plant and, therefore, make them very important features for yield prediction.  2.
The visualisation of model performance is shown in Figure 9 where SGD showed the best ability to model the yield.Moreover, Figure 9 shows that the highest concentration of samples was between 30 kg/pixel and 50 kg/pixel, which corresponds to a yield of 3 t/ha to 5 t/ha.This range of yields corresponds to the average soya yield per hectare, which is an indicator of the correctness of our predictions.
Advantages of our Polygon-Pixel Intersection (PPI) method over the classical interpolation (Simple Moving Average-SMA) are presented by the metrics themselves in Table 5.The SMA interpolation method provided a higher prediction error (MAE > 8 kg/pixel) than the proposed PPI method.The advantage of the PPI method lies in the fact that the harvester does not have to move in a straight line and in a normal orientation, which is related to imaginary X and Y coordinates.In most cases, yield polygons are covered by several Sentinel-2 pixels, which leads to the partial redistribution of yields.An additional experiment was conducted to determine dependence of the yield prediction error and the size of the pixels-polygon overlap.The results are shown in Figure 10.Yield error represents the difference between real and predicted values.The orange histogram shows that, in most cases, the estimation error was 0. Since the highest spatial resolution for Sentinel-2 pixel is 10 m × 10 m, the area of 100 m 2 is taken as the reference value of the estimation per pixel.According to the green histogram in the top part of Figure 10, it is clear that most pixels have an overlap of 100 m 2 .Moreover, the pixel coverage in some cases was larger than 100 m 2 due to the overlap in the trajectory of the combine harvester over the same area.Based on the analysis and presentation, the conclusion is that the estimation does not depend on the total overlap of pixels with yield monitor polygons.The synergy of the PPI and SGD regression model provides an opportunity to clearly calculate the proportional part of the crop yield that belongs to a single Sentinel-2 pixel.This ensures the robustness of the SDG regression model in the case when the movement of the combine is not optimal (straight line) or when sample distance and swath width are not uniform.
The runtimes on PC and HPC infrastructures are shown in Figure 11.The results show that implementation of the code on the HPC cluster infrastructure significantly reduces the execution time, especially with preprocessing, which is the most time-consuming and complex operation in the pipeline.Compared to a single-core PC, the HPC cluster saved 75% of the time.Processing time was reduced by a factor of 2 using only 3 cores.With each subsequent increase in the number of cores, negligible processing time savings were achieved.Employing more than nine cores did not result in any improvement.
Due to its lower prediction error than other tested regressors, SGD was chosen as the basis of the system for yield prediction.The final output of the model is a 10 m resolution map as shown in Figure 12.This representation of results is possible to embed in a web platform as a tool for yield prediction.Using these yield maps, variability within the parcel is easily observed, and it gives us an opportunity to perform variable-rate application, within the paradigm of precision agriculture.

Conclusions
ML has a wide application in agriculture with a great research interest in this field as there is a growing need for practical solutions for increasing the efficiency of agricultural production.Crop yield is a very complex trait that is determined by several factors such as the genotype, environment, management and their mutual interaction.This study presents a method for developing a yield prediction tool, which has the potential to be widely applied in locations where soya is grown.It uses global/open-source satellite and soil data as inputs making it scalable across the globe.Accurate yield prediction requires a thorough understanding of the problem, comprehensive datasets, and powerful algorithms.It is a very demanding challenge to provide crucial information for optimising management decisions concerning fertiliser application, harvest, silage, logistic and sales.This research belongs to the field of data analytics that combines image processing and ML for yield prediction.Although the research covers three full crop seasons, the database turned out to be insufficiently large for more complex algorithms such as DNN or CNN that possess many more parameters than conventional ML models and, thus, require more data for training.We showed that our approach of data preprocessing provides the opportunity to train an ML model (SGD) that makes decisions at the 10 m resolution with the prediction error of 4.36 kg/pixel.
The PPI method provides information about the spatial distribution of yield, which is incorporated into the Sentinel-2 pixel grid.This information has the potential to help farmers create appropriate decisions in terms of precision agriculture and field operations.The biggest advantage of PPI method is the possibility to obtain high-resolution prediction maps despite the large variability of data.The achieved results provide opportunities for farmers regardless of whether or not they own yield monitor device to have an accurate insight into the condition of their parcels.These yield maps improve the quality of farmer's decision-making solutions on a very high resolution.A simple method such as classical interpolation (SMA) provides worse results than the PPI method due to irregular spatial overlap of the satellite image pixels and yield monitor polygons.Furthermore, the size of overlap between satellite pixels and yield monitor polygons has no influence on the quality of yield prediction.
Combining data from several sources takes time, as it involves executing queries, preprocessing steps and stacking the data sets.Therefore, the HPC infrastructure is employed as a method of reducing processing time.The HPC infrastructure is a method of processing huge volumes of data at very high speeds for large numbers of customers simultaneously.Code implementation on the HPC infrastructure significantly reduced data processing times by more than 75%.This implementation would be necessary for an MLS that provides a yield map with a resolution of 10 m processed in real time.
Future work will be centered around improving the regression algorithms by using the additional information and data from other regions in order to establish a more robust algorithm for soya yield prediction.These forms of data have the potential to provide the appropriate setup for research in the field of in-season yield prediction, in which the system performance would be analysed throughout the growing season-from sowing to harvest.One of the important things is that HPC implementation enables the integration of a practical system to install it on a web platform and run in real-time.Another challenge would be to develop a multi-class regressor that could predict the yield not only for soya but also for other similar arable crops from around the world.However, we believe that this research provides a valuable contribution to the digital transformation of agriculture in the humanity's efforts to optimise farming and produce more with less.

Figure 1 .
Figure 1.A flowchart overview and example walk-through of the methods presented in this paper.Datasets are shown in yellow, purple denotes the preprocessing operations, and modules for ML are shown in blue, while the resulting model and outputs (prediction maps and model performance) are shown in green and red, respectively.Black arrows indicate the flow of data.The segments belonging to the preprocessing part are framed by a red dashed line, while the ML components are bordered by blue dashed lines.

Figure 2 .
Figure 2. Location of the fields from the dataset.(A) The selected study region in the Upper Austria denoted by the black box.(B) Spatial distribution of fields within the study region.

Figure 3 .
Figure 3. Original data from the yield monitor device in the points form.

Figure 4 .
Figure 4. Graphic representation of polygon and pixel overlap for PPI method.

Figure 5 .
Figure 5.A polygons correspond to the area from which soya was harvested for each single yield measurement.

Figure 7 .
Figure 7. Graphic representation of Sentinel-2 pixel grid.Border pixels excluded from the analysis are noted in red while the pixels for further processing are marked in gray.

Figure 8 .
Figure 8. Sobol' factors interaction.Red circles denote first order index, while blue lines represent second order index.The larger diameter of the circles (first order index) corresponds to the greater influence of the parameter on the model output.Second order index between two factors is proportional to the width of the blue line that connects them.The first number in the band name denotes a channel number, and the second number indicates one of the three selected dates in the Table2.

Figure 9 .
Figure 9. Graph visualization of real and predicted yield on test set.

Figure 10 .
Figure 10.Graph visualization of dependence between yield prediction and overlap area per pixel.

Figure 11 .
Figure 11.Execution time on a PC (cyan) and HPC cluster (magenta) for a different number of cores.

Figure 12 .
Figure 12.Yield map-the output of the pre-trained model in the form of a shapefile.

Author Contributions:
Conceptualization, B.P. and O.M.; methodology, P.L., M.P. (Marko Panić) and M.P. (Miloš Pandžić); software, B.P., P.L., A.A., E.A., P.M. and N.Z.; writing and editing, B.P. and O.M.; supervision, V.C.; funding acquisition, O.M. and V.C.All authors have read and agreed to the published version of the manuscript.Funding: Research on satellite image processing was supported by ANTARES project founded from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 739570, while research on machine learning and HPC implementation was supported by CYBELE project funded from the European Union's Horizon 2020 research and innovation programme under grant agreement No. 825355.The authors acknowledge the financial support of Provincial Secretariat for Higher Education and Scientific Research of Vojvodina through the project Development of decision support system for agricultural production using data fusion and artificial intelligence (Grant No. 142-451-2698/2021-01).Data Availability Statement: Protected by the NDA.

Table 2 .
Dates of Sentinel-2 image acquisition with the corresponding soya growth stage.

Table 4 .
Performance of ML models with the new PPI methodology for preprocesing.

Table 5 .
Performance of ML models with SMA interpolation method for preprocessing.