Grape Yield Prediction Models: Approaching Different Machine Learning Algorithms

: Efﬁcient marketing of winegrapes involves negotiating with potential buyers long before the harvest, when little is known about the expected vintage. Grapevine physiology is affected by weather conditions as well as by soil properties and such information can be applied to build yield prediction models. In this study, Partial Least Squares Regression (PLSR), Cubist (CUB) and Random Forest (RF) algorithms were used to predict yield from imputed weather station data and soil sample analysis reports. Models using only soil variables had the worst general results (R 2 = 0.15, RMSE = 4.16 Mg ha − 1 , MAE = 3.20 Mg ha − 1 ), while the use of only weather variables yielded the best performance (R 2 = 0.52, RMSE = 2.99 Mg ha − 1 , MAE = 2.43 Mg ha − 1 ). Models built with CUB and RF algorithms showed signs of overﬁtting, yet RF models achieved the best average results (R 2 = 0.58, RMSE = 2.85 Mg ha − 1 , MAE = 2.24 Mg ha − 1 ) using only weather variables as predictors. Weather data imputation affected RF and CUB models more intensely while PLSR remained fairly insensitive. Plant age, yield level group, vineyard plot, May temperatures, soil pH and exchangeable concentrations of Zn, Cu, K and Mn were identiﬁed as important predictors. This exploratory work offers insights for future research on grape yield predictive modeling and grouping strategies to obtain more assertive results, thus contributing to a more efﬁcient grapevine production chain in southern Brazil and worldwide.


Introduction
In 2021, 7.3 million hectares of land were occupied by vineyards worldwide.Total grape production reached 77.8 Mt with over 57% destined to the wine industry.World wine production in 2021 was estimated in 26 GL, excluding juices and musts [1,2].In Brazil over 75 thousand hectares of vineyards were harvested in 2022 [3] and the socioeconomic importance of viticulture is highlighted by a national wine market of BRL 20.3 billion (~USD 4 billion), which employed around 200 thousand people in 2019 [4].Despite being the second largest market in South America after Argentina [2], 86 of every 100 bottles consumed by Brazilians in 2019 were imported [4], revealing the enormous growth potential for national wineries.
Efficient marketing of wine grapes involves negotiating with potential buyers long before the harvest, when little is known about the expected vintage.In addition, grape fertilizing practices also rely on yield expectations for the definition of application doses [5].
To anticipate yields, one must inevitably rely on some conceptual or numerical model of how crops respond to surrounding conditions.The use of predictive models in fruit growing can improve the effectiveness of decision-making processes and, as a result, there has been substantial industry and academic interest in building predictive models for grapevines [6][7][8][9][10].
Soil chemical properties and nutrient levels have a decisive impact on vineyards' production [11][12][13].Grapevine mineral nutrition is one of the most important factors in fruit production, especially in high-density orchards [14], as minerals are responsible for several functions in plants, such as tissue constitution, enzymatic activation, osmotic regulation of membranes and intermediation of energetic processes [15,16].Unbalanced grapevine nutrition can compromise yields and the quality of grapes, affecting their appearance, color, flavor, size, aroma, post-harvest storage capacity and tolerance to pests and diseases [17,18], which is later reflected in the characteristics of the resulting must and wines [19][20][21].Farmers generally rely on soil analysis to assess properties such as texture and organic matter content, pH and exchangeable nutrient levels in order to guide fertilizing management [22,23] but this type of data can be of limited availability due to cost constraints.Recent advances in soil sensing, however, may lead to more affordable, applicable and shareable soil information that can help circumvent current limitations in the near future [24][25][26][27][28].
Aside from nutrition, plant productivity is strongly driven by environmental factors.Meteorological conditions play a predominant role in grapevine physiology [29], influencing its growth, phenological development, yield and must properties [12,13,30], thus affecting wine quality [31].Sunlight exposure is a key component of development, as solar radiation is captured by the plants and transformed into biomass [32].Incoming solar energy is also converted to temperature with direct effects on grapevine production.High temperatures affect photosynthesis, transpiration and grape berry composition [33], whereas low temperatures contribute to the accumulation of chill units and have effects on bud dormancy [34,35].Water availability is arguably the most important environmental factor limiting crop growth and productivity.Despite the growth of irrigated viticulture worldwide, precipitation remains of paramount importance, making grapevines susceptible to altered water regimes and, while grapevine drought tolerance mechanisms are not fully understood [36], high precipitation is known to increase the risk of downy mildew (Plasmopora viticola) occurrence [37].Fungal spores can be dispersed by wind, spreading infestation and compromising vineyard health [38].High wind speed, in turn, can reduce grape berry susceptibility to sunburn [39].Along with humidity and temperature, wind can be used to estimate grapevine canopy evapotranspiration, which reflects water usage by plants [40].The sheer importance of weather factors to grapevine production raises concerns over climate change impacts on wine-growing regions [31,[41][42][43][44][45][46].
Climate data is now abundant, relatively cheap and can be found in different resolutions, as weather stations across the world continuously record and monitor various parameters for climate classification, planning, modeling and management purposes [47].However, measuring instruments are subject to recording errors, malfunctioning, maintenance, network transmission and storage failures among other events that can generate data gaps and result in incomplete datasets [48][49][50].Data missingness adversely impacts the analysis carried out afterwards and may lead to erroneous findings, false conclusions and inaccurate predictions [51].
The choice of the best strategy to handle missing data is not straightforward but most studies conclude that imputation is more advisable than removing data in order to reduce the risk of introducing bias in datasets and subsequent analyses [50,52,53].Consequently, data imputation is an important preprocessing task in modeling.The simplest imputation methods consist in replacing the missing values with some measure of central tendency calculated from the non-missing values, such as the mean, mode or median or by randomly selecting values within the entire data or subset (e.g., hot deck imputation) [50,54].Despite the simplicity, these methods can reduce overall data variance and are bound to the non-missing data range.Climatic data properties such as autocorrelation between time lags, seasonality, periodic trends and cycles can be useful for the development of imputation strategies [47,55].In addition to leveraging univariate patterns, multivariate approaches can be applied to estimate missing data by means of predictive modeling, where non-missing variables are used as predictors.As a result, different statistical and machine learning methods have long been used to address the data missingness issue in different knowledge domains [51,56,57], including climate data [47,49,50,58,59].In advanced multiple imputation schemes, the process of generating replacement values for missing data is repeated many times, resulting in m complete datasets that are further analyzed.The literature recommends the number of imputed datasets (m) ought to be between 5 and 10 [60].These datasets are then used in subsequent analysis and the outcomes are pooled in order to obtain robust results, reducing uncertainty and bias [50,60,61].
Machine learning (ML) is a segment of artificial intelligence that has thrived in the recent context of big data technologies and high-performance computing [62].ML represents an alternative to statistical models where deterministic processes take precedence over probability or likelihood measures in accomplishing estimation tasks [63].As such, forecasting models can be built based on nonparametric and semi-parametric structures with validation relying on prediction accuracy.These empirical models are data-driven and do not require deep knowledge of the biophysical mechanisms that produced the data nor a predefined structure of the model, making such techniques inexpensive and relatively easy to apply [10].Moreover, some ML algorithms can work with a large number of predictors from both categorical and numerical data types all at once, requiring little data preprocessing.These features have led to a prominent adoption of ML methods in agriculture [62] and crop yield prediction [63,64].Previous studies attempted to model grape yields in wine-growing regions [6,7,10], but to date there have not been such attempts focusing on the well-established viticulture region of southern Brazil.Furthermore, studies have generally not evaluated the effect of imputing missing data on the predictive modeling of grape yields.
The aims of this study were to assess the suitability of different machine learning algorithms in grape yield forecasting; evaluate the effects of data imputations on model performance; and investigate the importance of soil analysis and weather station data in grape yield predictions.

Study Region and Grapevine Data
The study area lies in Santana do Livramento (latitude 30 • 53 27 S, longitude 55 • 31 58 W), located in the Campanha Gaúcha region of Rio Grande do Sul state, in southern Brazil (Figure 1).It is a traditional winegrape region with a humid subtropical climate with hot summers and no dry seasons (Köppen-Geiger classification Cfa) [65].Data was obtained from a commercial vineyard grown on sandy textured Alisol [66].Grapevines were grafted onto SO4 (Selection Oppenheim 4) rootstocks and grown in an espalier system.
Yield records (n = 534) ranged from 0.11 to 26.13 ton ha −1 , with a mean of 9.10 ± 4.81 Mg ha −1 , comprising 14 harvests (1999:2007, 2009, 2011, 2013, 2018, 2019)  Yield observations were sorted by harvest.In each harvest, yield distribution was divided into three groups ("high", "medium" and "low" yield) using Jenks Natural breaks optimization [67], implemented via the BAMM tools [68] package in R. Jenks Natural breaks method is borrowed from the field of cartography and seeks to minimize the variance within categories while maximizing the variance between categories.Observations were then assigned to each group based on the yield recorded at a given harvest and the respective defined breaks.After classification, the data was grouped by cultivar and the frequency in which they were classified as "high", "medium" or "low" was calculated.This information was used along with the general yield distribution and the number of observations by cultivar to determine the final yield class of each cultivar.Clustering results (Figure 2) were validated by expert knowledge provided by the agronomist responsible for the studied vineyard.Yield observations were sorted by harvest.In each harvest, yield distribution was divided into three groups ("high", "medium" and "low" yield) using Jenks Natural breaks optimization [67], implemented via the BAMM tools [68] package in R. Jenks Natural breaks method is borrowed from the field of cartography and seeks to minimize the variance within categories while maximizing the variance between categories.Observations were then assigned to each group based on the yield recorded at a given harvest and the respective defined breaks.After classification, the data was grouped by cultivar and the frequency in which they were classified as "high", "medium" or "low" was calculated.This information was used along with the general yield distribution and the number of observations by cultivar to determine the final yield class of each cultivar.Clustering results (Figure 2) were validated by expert knowledge provided by the agronomist responsible for the studied vineyard.Yield observations were sorted by harvest.In each harvest, yield distribution was divided into three groups ("high", "medium" and "low" yield) using Jenks Natural breaks optimization [67], implemented via the BAMM tools [68] package in R. Jenks Natural breaks method is borrowed from the field of cartography and seeks to minimize the variance within categories while maximizing the variance between categories.Observations were then assigned to each group based on the yield recorded at a given harvest and the respective defined breaks.After classification, the data was grouped by cultivar and the frequency in which they were classified as "high", "medium" or "low" was calculated.This information was used along with the general yield distribution and the number of observations by cultivar to determine the final yield class of each cultivar.Clustering results (Figure 2) were validated by expert knowledge provided by the agronomist responsible for the studied vineyard.

Soil Data
Soil data consisted of routine laboratory analysis results obtained from soil top layer samples (0-20 cm) collected under canopy projection and contained information on soil exchange capacity (potential and effective), acidity (potential, pH and SMP index), base and aluminum saturation, clay and organic matter contents, as well as exchangeable concentrations of aluminum (Al), phosphorus (P), potassium (K), calcium (Ca), magnesium (Mg), sulfur (S), copper (Cu), zinc (Zn), boron (B) and manganese (Mn) (Table 1 and Figure 3).Soil analyses followed methodologies recommended by [69,70].

Weather Data
Data from the Santana do Livramento ground meteorological station, the closest to the vineyards, was retrieved from the Brazilian National Institute of Meteorology database (https://bdmep.inmet.gov.br/,accessed on 26 July 2023) for the period between 1990 and 2020.Due to the homogeneity of the climate and relief of the study region, data from the nearby weather station of Bagé (~160km apart) was also retrieved in order to provide auxiliary information and aid missing data imputation [60].Monthly average values of potential and real evaporation, evapotranspiration, sunlight exposure, cloudiness, temperatures (maximum, average and minimum), relative humidity and average wind speed were available, as well as the total number of rainy days and precipitation volumes (Table 1).
Soil data consisted of routine laboratory analysis results obtained from soil top layer samples (0-20 cm) collected under canopy projection and contained information on soil exchange capacity (potential and effective), acidity (potential, pH and SMP index), base and aluminum saturation, clay and organic matter contents, as well as exchangeable concentrations of aluminum (Al), phosphorus (P), potassium (K), calcium (Ca), magnesium (Mg), sulfur (S), copper (Cu), zinc (Zn), boron (B) and manganese (Mn) (Table 1 and Figure 3).Soil analyses followed methodologies recommended by [69,70].

Data Imputation
In this study, missing data from the Santana do Livramento weather station was partially missing within the period covered by soil sampling data (Figure 4) and was imputed using either uni-or multivariate methods, depending on the gap size.Gaps equal to or shorter than nine consecutive observations (i.e., 9 months or 75%) were imputed using a univariate approach, which consisted of splitting the times series into seasons and performing imputation separately for each of the resulting subsets using interpolation.In this study, the univariate imputation of meteorological data was performed using the imputeTS package [71] with a 12-month season.
Univariate methods may fail to provide reasonable imputations for a variable when periods of missing values are large [51], so for gaps larger than 9 months Multivariate Imputation by Chained Equations, or MICE [61] was used.MICE turns the imputation problem into a series of estimations, where each variable has its own imputation model built using the other remaining variables of the dataset.First, only the complete data is used to estimate values for the variable with the smallest number of missing observations.Next, the recently imputed variable is used along with the originally complete data to estimate the variable with the second smallest number of missing observations and so on.The first iteration is finished when all missing values are estimated once.In the second iteration, the order of imputation remains but the imputed values are now updatedconsidering all estimates generated in the previous steps.This process is repeated through a number of iterations to achieve a stable imputation.A single imputed dataset is obtained in the last iteration [72].To attain robust results, it is recommended that many imputed datasets (m) are used for the desired analysis so that the pooled results can be evaluated [53].The MICE algorithm was implemented using the mice [61] package to generate five different imputed datasets after 50 iterations each, produced with random forest models containing 500 trees.Univariate methods may fail to provide reasonable imputations for a variable when periods of missing values are large [51], so for gaps larger than 9 months Multivariate Imputation by Chained Equations, or MICE [61] was used.MICE turns the imputation problem into a series of estimations, where each variable has its own imputation model built using the other remaining variables of the dataset.First, only the complete data is used to estimate values for the variable with the smallest number of missing observations.Next, the recently imputed variable is used along with the originally complete data to estimate the variable with the second smallest number of missing observations and so on.The first iteration is finished when all missing values are estimated once.In the second iteration, the order of imputation remains but the imputed values are now updatedconsidering all estimates generated in the previous steps.This process is repeated through a number of iterations to achieve a stable imputation.A single imputed dataset is obtained in the last iteration [72].To attain robust results, it is recommended that many imputed datasets (m) are used for the desired analysis so that the pooled results can be evaluated [53].The MICE algorithm was implemented using the mice [61] package to generate five different imputed datasets after 50 iterations each, produced with random forest models containing 500 trees.
In addition to weather station data, some of the soil analysis reports also hosted missing values, particularly regarding soil micronutrients (Figure 4).Since no seasonality can be attributed to these soil features and given the fact that the missingness mechanism did not seem to be random, a more simple approach was chosen and median imputation was performed to fill soil data gaps.

Modeling
The underlying assumption of this study is that weather and soil data hold information that is useful to forecast yields and, to investigate its contribution to estimates, models were built using different sets of predictors (p), namely "Soil" (p = 24: four categorical, 20 numeric), "Weather" (p = 149: four categorical, 145 numeric) and "Weather + Soil" (p = 168: four categorical, 164 numeric) (Table 1).In order to handle the high number of predictors and the possible non-linear relationships between variables, three In addition to weather station data, some of the soil analysis reports also hosted missing values, particularly regarding soil micronutrients (Figure 4).Since no seasonality can be attributed to these soil features and given the fact that the missingness mechanism did not seem to be random, a more simple approach was chosen and median imputation was performed to fill soil data gaps.

Modeling
The underlying assumption of this study is that weather and soil data hold information that is useful to forecast yields and, to investigate its contribution to estimates, models were built using different sets of predictors (p), namely "Soil" (p = 24: four categorical, 20 numeric), "Weather" (p = 149: four categorical, 145 numeric) and "Weather + Soil" (p = 168: four categorical, 164 numeric) (Table 1).In order to handle the high number of predictors and the possible non-linear relationships between variables, three algorithms-Partial Least Squares Regression (PLSR) [73], Cubist (CUB) [74,75] and Random Forest (RF) [76]-were used to predict grape yields.PLSR seeks to produce uncorrelated predictors by resorting to orthogonal projections.As such, predictors are expected to be numeric, therefore, categorical variables in the dataset were one-hot encoded prior to running this algorithm while numeric ones were centered and scaled.Moreover, the "plot" variable, which refers to a vineyard parcel, was removed since it contains 135 possible values and can be considered high cardinality data, yielding too many new variables after one-hot encoding.To evaluate model performance, the coefficient of determination (R 2 ) (Equation (1)) was used as a measure of variance explained by the predictors while Root Mean Squared Error (RMSE) (Equation ( 2)) and Mean Absolute Error (MAE) (Equation (3)) were used as a measure of spread in predictions [77].
All models were fit using the caret package [78] ("pls", "ranger" and "cubist" engines) on a training set (n = 399, 75%) with 10-fold cross-validation and five repetitions.Model tuning was performed using grids methods, optimized to reduce Root Mean Squared Error.For PLSR tuning, the number of principal components ranged from 1 to 20.In CUB, 100 committees were used and five different numbers of neighbors were tested (1, 3, 5, 7 and 9).The number of random variables (mtry) used in each RF tree depended on the dataset used for predictions.For "Weather" and "Weather + Soil", four values were tested (50, 65, 80 and 95), whereas the "Soil" dataset was tuned using five values (9, 12, 15, 18 and 21).The same minimum node sizes were tested for all datasets (3, 5, 7 and 10).All RF models were built using 100 trees.Model best fit parameters were applied in the validation model and performance was assessed on a hold-out test set (n = 135, 25%).Data handling, imputing and modeling were performed in R language [79] using RStudio IDE [80].

Model Performance and Imputation Effects
Model outcomes indicate that the choice of predictors had a considerable impact on grape yield forecasting.Regardless of the algorithm used, models built with the "Soil" dataset had the worst average results (R 2 = 0.15, RMSE = 4.16 Mg ha −1 , MAE = 3.20 Mg ha −1 ), while the use of only the "Weather" dataset yielded the best performance (R 2 = 0.52, RMSE = 2.99 Mg ha −1 , MAE = 2.43 Mg ha −1 ).The combination of soil and weather predictors had slightly worse results than weather variables alone (R 2 = 0.50, RMSE = 3.05 Mg ha −1 , MAE = 2.43 Mg ha −1 ), probably as a consequence of the higher number of variables with low predictive power.On the other hand, as the number of predictors increased, the difference between the three algorithms (i.e., the spread) decreased, suggesting a trade-off between higher precision and lower accuracy (Figure 5 and Table 2).In general, the RF model had the highest R 2 and the lowest RMSE and MAE, followed by CUB and PLSR, regardless of the predictors used (Table 3).However, CUB and RF were the two algorithms most affected by overfitting, presenting very sharp changes in metrics when training and test sets are compared, especially when soil-related predictors were used.In addition, CUB and RF were also the most affected by different imputation sets and, once more, the presence of soil-related predictors contributed to higher variance in model metrics (Figure 6).The impact of imputation sets on model performance arises f predictor distribution and range after filling in the missing values importance of predictors can also be indirectly assessed by investigatin between the imputed datasets.Imputation sets were fairly uniform and The impact of imputation sets on model performance arises from changes in predictor distribution and range after filling in the missing values, therefore, the importance of predictors can also be indirectly assessed by investigating the differences between the imputed datasets.Imputation sets were fairly uniform and shaped similarly to the original observed data for most predictors except the mean wind speed and, to some extent, potential evapo-transpiration.The number of rainy days, precipitation and relative humidity had at least one imputation set whose distribution appeared different from the others (Figure 7). to the original observed data for most predictors except the mean wind speed and, to some extent, potential evapotranspiration.The number of rainy days, precipitation and relative humidity had at least one imputation set whose distribution appeared different from the others (Figure 7).

Variable Importance
The assessments of variable importance can enable insights about the nature of the most important predictors and point to the direction for data acquisition and prediction improvements.Across all algorithms and all predictor datasets, plant age and yield class were the most important variables.May temperatures, especially the minimum, were the most important weather-related variable.Soil pH and concentrations of Zn and Cu were the most important soil-related predictors, followed by soil K and Mn.In addition, plot and cultivar were important predictors in CUB and RF models.The importance of other predictors varied among dataset x model combinations (Figure 8).

Variable Importance
The assessments of variable importance can enable insights about the nature of the most important predictors and point to the direction for data acquisition and prediction improvements.Across all algorithms and all predictor datasets, plant age and yield class were the most important variables.May temperatures, especially the minimum, were the most important weather-related variable.Soil pH and concentrations of Zn and Cu were the most important soil-related predictors, followed by soil K and Mn.In addition, plot and cultivar were important predictors in CUB and RF models.The importance of other predictors varied among dataset x model combinations (Figure 8).

Discussion
The effects of data imputation on model performance vary according to the algorithm used.The different sensitivities between PLSR, CUB and RF arise due to how each of these algorithms work.The PLSR acts as a supervised Principal Component Regression (PCR) producing new features (PCs) which are linear combinations of the original predictors that maximize the explained variance in the response [73].Therefore, changes in particular variables due to imputation may have their effects diluted or even ignored given this optimization constraint.
Building on Quinlan's original M5 model [74,75], its modern version called Cubist (CUB) is a rule-based algorithm that makes predictions by building regression models on its terminal nodes [81].This algorithm works by creating a split in the dataset and fitting a regression with every predictor available.In the next node, data is split again and a new regression is fit using the new subset of predictors.Moreover, CUB includes an ensemble method for predictions called committees and a nearest-neighbor adjustment that occurs after the model predictions [81].In committee ensembles, the outcome values are modified for each iteration in an effort to reduce over-and under-predictions, whereas the final result is obtained by averaging over the committee model estimates.After model

Discussion
The effects of data imputation on model performance vary according to the algorithm used.The different sensitivities between PLSR, CUB and RF arise due to how each of these algorithms work.The PLSR acts as a supervised Principal Component Regression (PCR) producing new features (PCs) which are linear combinations of the original predictors that maximize the explained variance in the response [73].Therefore, changes in particular variables due to imputation may have their effects diluted or even ignored given this optimization constraint.
Building on Quinlan's original M5 model [74,75], its modern version called Cubist (CUB) is a rule-based algorithm that makes predictions by building regression models on its terminal nodes [81].This algorithm works by creating a split in the dataset and fitting a regression with every predictor available.In the next node, data is split again and a new regression is fit using the new subset of predictors.Moreover, CUB includes an ensemble method for predictions called committees and a nearest-neighbor adjustment that occurs after the model predictions [81].In committee ensembles, the outcome values are modified for each iteration in an effort to reduce over-and under-predictions, whereas the final result is obtained by averaging over the committee model estimates.After model prediction, it is possible to conduct a post-model K-nearest neighbor adjustment where the closest observations in the training set (along with their original predictions) are used to correct prediction errors and trends in the dependent variable [74,81].In the current study, the number of committees and neighbors used was high and, while effective in improving performance metrics, it is likely that it caused overfitting in the training set, hampering predictive ability during the validation stage.
Random Forest is an ensemble of decision trees that do not build regression models on terminal nodes, but rather use average values and, as a consequence, may be more sensitive to variations caused by imputation, given the changes in imputed data distributions.The number of variables randomly sampled as candidates at each split (mtry) is a tuning parameter that impacts the likelihood of selecting important features that are actually related to the outcome variable for most of the splits that are made.Too large mtry values can reduce randomness in the model, diminishing the benefits of building multiple independent decision trees.Minimum node size, in turn, defines how deep a tree can grow, that is, the minimum number of observations a node requires to proceed with a split.Our results present clear signs of overfitting, suggesting that the combination of large mtry values and small node size (Table 3) is detrimental to modeling and should be reconsidered.Nevertheless, the pooled results of RF yielded the best predictive performance, indicating that some of the overfitting issues were addressed by the averaging effect of multiple trees (a.k.a. the law of large numbers) [76], although higher values should also be tested.
Our results corroborate with the well-established importance of meteorological variables in resulting grape yield [12,13,30].However, our models highlighted end-of-cycle temperatures (May) as being important predictors, as opposed to the most commonly reported temperatures at bud break, flowering, fruit-set and berry development stages [8,10,31,82,83].Post-harvest temperature might contribute to greater carbohydrate accumulation in plant tissues, which are translocated in the next cycle to the leaves and branches at the beginning of vegetative growth [84].
Relative humidity and precipitation were previously reported as relevant predictors for grape yield models [10,46] but did not show consistent predictive power in the present study.This may arise from the inherit difficulty in capturing rain records.Since rain and humidity can be very heterogeneous across landscape, even nearby stations may not represent the water regime faithfully, let alone in imputed datasets [58], an issue of known concern [58,85].Imputation of time series and its impact on modeling results is still an active field of research [47,51,53,56] and further studies need to be carried out to avoid erroneous conclusions and inaccurate predictions.
Despite the undeniable importance of plant nutrition to obtain proper grapevine yields, the predictive power of soil variables was considerably low.Grape root systems can grow over a meter deep, therefore, the dynamics of nutrient uptake are not fully captured by top layer soil samples [18].Moreover, adult grapevines reach nutritional stability and are able to mobilize reserves [86], hindering the relationship between soil nutrient concentrations and yield parameters.This is also reflected in the ubiquitous presence of plant age as the most important yield predictor.
Soil pH plays a key role in plant development due to its effects on nutrient availability and toxicity [87] and such importance is promptly captured by all three models.The importance of soil Cu and Zn as yield predictors can, however, be related to phytosanitary control, since they are components of fungicides largely used against crop diseases, ensuring higher yields but also leading to their accumulation in vineyard soils [88][89][90].Grape berries are great sinks of K [91] and clusters can account for over 60% of the total content in the above-ground organs [92], so it is sound that soil K figures as an important yield predictor, which is supported by recent modeling attempts [10].Grapevines are particularly susceptible to Mn deficiency and shortage of supply can lead to small and poor quality yields [93].In addition, recent studies suggest that Mn plays a role in enhancing plant resistance to water stress [94], so despite being a micronutrient, Mn soil concentrations seem to play key role in grape production.It is worth noting that Mn was the single most imputed variable in soil-related datasets (Figure 4) and, while having artificially reduced variance, it figured as a relevant predictor when soil variables were considered.
Categorical variables were also important to model estimates.The high importance of vineyard plot information suggests spatial dependency, consistent with recent studies on the effects of intra-vineyard variability and soil heterogeneity on vine performance, dry matter and nutrient partitioning [95], where, based on the Normalized Difference Vegetation Index NDVI, authors divided the vineyard into high, medium and low vigor zones.The availability of georeferenced data can improve yield response estimates by enabling the use of covariates in modeling [96] and imputations [97] but such information was unavailable in the current study.Instead, the cultivars were grouped into low, medium or high yield levels based on numerical clustering, historical yield levels (Figure 2) and expert knowledge.This feature engineering improved model performance by better partitioning the variability while retaining agronomic sense.Similarly, other groupings (or clusterings) could enhance the predictive ability of models.Research has shown the importance of phenological stage grouping for modeling [6,10,98].Moreover, advances in fruit tree fertilizing indicate different nutrition profiles for each cultivar [12,99,100] or site-specific conditions [101] and, while requiring further elucidation, have the potential to leverage soil data predictive power.
The present study showed that the combined use of databases and robust machine learning methods has the potential to estimate grapevine productivity, helping decision makers to be more assertive not only during commercialization stage but also in fertilizer recommendations and phytosanitary management of vineyards.The herein adopted strategies can be extended for other relevant applications in orchards worldwide, estimating the performance of fruit trees under specific conditions (culture, site, management, etc.), while keeping in mind multiple factors such as increased yield and economic viability, higher visual and bromatological quality of fruits, plant nutrition for proper development, rational use of fertilizers and adaptation within a climate change scenario.

Conclusions
In this study, Partial Least Square (PLSR), Cubist (CUB) and Random Forest (RF) algorithms were used to predict grape yields from weather and soil data separately and in conjunction.Overall, the results were best when using only weather predictors, followed by weather and soil data combined and, lastly, soil data alone.Weather data from ground meteorological stations contained observations with missing values and these were imputed by uni-or multivariate methods.The RF and CUB algorithms were the most affected by weather data imputations, while PLSR remained fairly insensitive.Parameter tuning optimized to reduce Root Mean Squared Error led to overfitting in CUB-and RF-based models and (hyper)parameter values should be reconsidered to improve performance.Nevertheless, RF achieved the best metrics, followed closely by CUB, while PLSR yielded the poorest yet most stable results.Plant age, May temperatures, soil pH and concentrations of Zn, Cu, K and Mn were identified as important predictors, even though 57% of soil Mn observations were missing and were imputed using a median method.Yield level groups had high importance in predictions, indicating that clustering can be a useful strategy, whereas the importance of vineyard plots suggests spatial dependencies that can be further explored by using georeferenced data.This exploratory work offers insights for future research on grape yield predictive modeling and highlights the importance of high resolution data and grouping strategies to obtain more assertive results, thus contributing to a more efficient grapevine production chain in southern Brazil and worldwide.

Figure 2 .
Figure 2. Number of observations (a), yield distribution (b) and Jenks Natural breaks classification frequency (c) by cultivar.Vertical red line represent cultivar average yield and gray bullets represent outliers.

Figure 2 .
Figure 2. Number of observations (a), yield distribution (b) and Jenks Natural breaks classification frequency (c) by cultivar.Vertical red line represent cultivar average yield and gray bullets represent outliers.

Figure 2 .
Figure 2. Number of observations (a), yield distribution (b) and Jenks Natural breaks classification frequency (c) by cultivar.Vertical red line represent cultivar average yield and gray bullets represent outliers.

Figure 3 .
Figure 3. Distribution of yield and soil-related variables used in grapevine prediction models.

Figure 3 .
Figure 3. Distribution of yield and soil-related variables used in grapevine prediction models.

Figure 4 .
Figure 4. Rate of missingness in Santana do Livramento weather station data, by variable and growing season (a).Number of yield observations by harvest (b).Rate of missing observations by variable in soil dataset (c).

Figure 4 .
Figure 4. Rate of missingness in Santana do Livramento weather station data, by variable and growing season (a).Number of yield observations by harvest (b).Rate of missing observations by variable in soil dataset (c).

Figure 5 .
Figure 5. Coefficient of determination (R 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) of all combined algorithms by dataset.

Figure 6 .
Figure 6.Impact of weather data imputation on coefficient of determination Squared Error (RMSE) and Mean Absolute Error (MAE) of grapevine yield predic Least Squares Regression (PLSR), Cubist (CUB) and Random Forest (RF) algorit sets of predictors.

Figure 6 .
Figure 6.Impact of weather data imputation on coefficient of determination (R 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) of grapevine yield predictions using Partial Least Squares Regression (PLSR), Cubist (CUB) and Random Forest (RF) algorithms with different sets of predictors.

Figure 7 .
Figure 7. Distributions of weather variables in observed data (obs, in white) and the five different complete datasets produced after multivariate chained equation (MICE) imputations (1-5, coloured).Gray bullets represent outliers.

Figure 7 .
Figure 7. Distributions of weather variables in observed data (obs, in white) and the five different complete datasets produced after multivariate chained equation (MICE) imputations (1-5, coloured).Gray bullets represent outliers.

Figure 8 .
Figure 8.The 15 most important variables by algorithm, predictor and imputation set.For each imputation set, variable importance is given on a 0 to 100 scale and x-axis values are the sum of the importance in m imputation sets.Soil dataset was only imputed using a median method, hence the results are the same for all imputation sets.

Figure 8 .
Figure 8.The 15 most important variables by algorithm, predictor and imputation set.For each imputation set, variable importance is given on a 0 to 100 scale and x-axis values are the sum of the importance in m imputation sets.Soil dataset was only imputed using a median method, hence the results are the same for all imputation sets.

Table 1 .
Grapevine, soil and weather data used in yield forecasting models, grouped by predictor set.

Table 1 .
Grapevine, soil and weather data used in yield forecasting models, grouped by predictor set.

Table 2 .
Average coefficient of determination (R 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) values for Partial Least Squares Regression (PLSR), Cubist (CUB) and Random Forest (RF) algorithms by predictor dataset.

Table 2 .
Average coefficient of determination (R 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) values for Partial Least Squares Regression (PLSR), Cubist (CUB) and Random Forest (RF) algorithms by predictor dataset.

Table 3 .
Grape yield prediction coefficient of determination (R 2 ), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE) by algorithm, predictor dataset and imputation set.Best fit refers to optimized tuning parameters for Partial Least Squares Regressions (PLSR: number of principal components), Cubist (CUB: number of neighbors) and Random Forest (RF: random predictors subset/minimum node size).