Potato Yield Prediction Using Machine Learning Techniques and Sentinel 2 Data

: Traditional potato growth models evidence certain limitations, such as the cost of obtaining the input data required to run the models, the lack of spatial information in some instances, or the actual quality of input data. In order to address these issues, we develop a model to predict potato yield using satellite remote sensing. In an e ﬀ ort to o ﬀ er a good predictive model that improves the state of the art on potato precision agriculture, we use images from the twin Sentinel 2 satellites (European Space Agency—Copernicus Programme) over three growing seasons, applying di ﬀ erent machine learning models. First, we ﬁtted nine machine learning algorithms with various pre-processing scenarios using variables from July, August and September based on the red, red-edge and infra-red bands of the spectrum. Second, we selected the best performing models and evaluated them against independent test data. Finally, we repeated the previous two steps using only variables corresponding to July and August. Our results showed that the feature selection step proved vital during data pre-processing in order to reduce multicollinearity among predictors. The Regression Quantile Lasso model (11.67% Root Mean Square Error, RMSE; R 2 = 0.88 and 9.18% Mean Absolute Error, MAE) and Leap Backwards model (10.94% RMSE, R 2 = 0.89 and 8.95% MAE) performed better when predictors with a correlation coe ﬃ cient > 0.5 were removed from the dataset. In contrast, the Support Vector Machine Radial (svmRadial) performed better with no feature selection method (11.7% RMSE, R 2 = 0.93 and 8.64% MAE). In addition, we used a random forest model to predict potato yields in Castilla y Le ó n (Spain) 1–2 months prior to harvest, and obtained satisfactory results (11.16% RMSE, R 2 = 0.89 and 8.71% MAE). These results demonstrate the suitability of our models to predict potato yields in the region studied.


Introduction
The world population has been increasing exponentially since the mid-1920s, and stood at 7.7 billion people in October 2018 (www.worldometers.info), a figure which is projected to increase by a further three billion over the next five decades [1]. Global food demand will rise accordingly, and competition is expected for the fertile land and water resources required to produce more agricultural food products [2]. Rijsberman and Molden [3] point to the need to increase total food production by about 40%, while reducing the water resources used in agriculture by 10-20%. Yet these premises must face up to the prospect of certain anticipated climate change effects which may negatively impact crop production as well as other indispensable resources for agriculture, such as water availability [4]. Land, fossil energy and nutrients are other important resources that ensure food production, although their current consumption exceeds their global regeneration rate [5,6]. Precision Agriculture (PA) has emerged in an effort to meet major global challenges such as food security [7], the depletion of natural resources [8], and anthropogenic climate change [9]. The primary goal of PA is to optimize returns while reducing the potential impact of farming on the environment [10].

Study Area
In this study, a total of 33 different sites were studied over three years in the province of Segovia, Spain ( Figure 1). We selected these sites because they offer sub-field information on yield production and similar agricultural practices. In terms of management, agricultural practices were similar in all the areas under study; potato tubers of three different cultivars of medium-late maturity were sown following a ridge distribution with each plant spaced 0.33 m along the ridge. The distance between ridges was 0.75 m and the furrow depth was around 0.15 to 0.20 m. In the study area, potato crops are usually sown between mid-March and the end of April, and the harvest period generally spans from mid-September to late October. Sprinkler irrigation was used to supply water to the crop. According to the Koppen classification, the study area has a Mediterranean "cool dry-summer" climate (classified as Csb) [48], characterized by rainy winters and dry summers. Annual average precipitation across years is 430 mm, with July and August being the driest months during the year. Mean temperature is 11.9 • C, with January being the coldest month, and July and August the hottest months (https://es.climate-data.org).

Materials
Crop yield information corresponds to the total commercial weight obtained in each studied field-area during 2016, 2017 and 2018, and was provided by the potato producer (Table S1). Field geo-location was obtained using the "Sistema de Información Geográfica de parcelas agrícolas" (SIGPAC) from the regional government of Castilla y Leon [49], which delimits agricultural fields in digital format. These were then subsetted into smaller areas in accordance with the crop yield information provided, using ArcMap 10.4 software [50]. We were thus able to derive the following ratio: fresh matter (FM) yield in ton/ha. Potatoes presenting green parts, with a size below 28 mm (in diameter), deformed structures or physical damage were not harvested and were not taken into account as crop yield. Field evaluations reported that approximately 3-5% of the total production presented some kind of defect, such that those potatoes were not harvested. Henceforth, the term crop yield refers only to "commercial yield" (good quality potatoes that have passed the visual evaluation filter during harvest). For a total of 33 samples across three years, the minimum yield sample was 17.547 ton/ha, the maximum was 85.678 ton/ha, and the mean potato yield was 57.950 ton/ha, with a standard deviation of 16,747.
We downloaded 44 Sentinel-2 L1C images in an effort to cover the periods between tuberization and senescence (Table S2). The Sentinel satellites are twin-polar orbiting satellites that provide a revisit time of five days. They carry a Multi-Spectral Instrument which has 13 spectral bands: four bands at 10 m, six bands at 20 m, and three bands at 60 m spatial resolution [45]. Field observations were carried out to determine the beginning of tuberization, which occurred typically during July in the study area, while senescence occurs in September under normal circumstances. It was therefore deemed appropriate to select the study time between early July and late September for 2016, 2017 and 2018. The downloaded 44 Sentinel-2 images within this time range were atmospherically corrected using the SEN2COR algorithm from Top of Atmosphere (TOA) to Bottom of Atmosphere (BOA) reflectance [51]. Image bands were resampled using the nearest neighbour technique with the "Resample" function of the Raster package [52] in R software [53] in order to have 10 m pixel resolution in each band. Cloud cover was then automatically removed using the cloud mask layer provided for each Sentinel 2 image for all the available bands. In addition to the information of each image band, we computed seven vegetation indices to assess crop status. The Anthocyanin Reflectance Index "ARI2" [54] is sensitive to anthocyanin in plant foliage, the Carotenoid Reflectance Index "CRI2" [55] evaluates the carotenoid concentration relative to the chlorophyll content, the Inverted Red-Edge Chlorophyll Index "IRECI2" [56] estimates the canopy chlorophyll content, the Leaf Chlorophyll Content "LCC" [57] is a non-destructive assessment of chlorophyll content expressed at unit leaf area, the Normalized Difference Vegetation Index "NDVI" [58] is a widely used vegetation index that quantifies green vegetation and phenology, the Plant Senescence Reflectance Index "PSRI" [59] quantifies plant senescence, and the Weighted Difference Vegetation Index "WDVI" [60] is a proxy of the Leaf Area Index "LAI" of green vegetation. The equations of these indices can be seen in Table S3.

Data Preparation
We used the R software [53] and ArcMap 10.4 [50] to pre-process the data and to build the models. We extracted the mean value of each Sentinel 2 band (vegetation indices included) for each field per year. Given the observed crop evolution on the field, we considered merging the values of the bands and vegetation indices into three larger groups, thereby giving the average values for July (beginning of tuberization), August (early senescence under non-normal conditions such as pests) and September (senescence). Sentinel 2 bands and indices for July, August and September were Band 2, Band 3, Band 4, Band 5, Band 6, Band 7, Band 8, Band 8a, Band 9, Band 11, Band 12, PSRI, CRI2, LCC, IRECI2, NDVI, WDVI and ARI2 (54 in total). After computing the average per phase, no missing values were present in the dataset and the dependent variable was crop yield.

Model Building
In machine learning, there is no single algorithm or solution that fits all data. As a result, it is quite common to work iteratively in order to find the best algorithms, hyper-parameters and solutions for machine learning problems [61]. In this context, we fitted the following machine learning algorithms for regression problems (Generalised Linear Model "glm" [62], Linear Regression with Backwards Selection "LeapBack" [63], Quantile Regression with LASSO penalty "rqlasso" [64], Support Vector Machine Linear "svmLinear" [65], Support Vector Machine Radial "svmRadial" [66], Random Forest "rf" [67], Multivariate adaptive regression splines "MARS" [68], k-Nearest Neighbours "kknn" [69] and Model Averaged Neural Network "avNNet" [70]) using the CARET package [71], and ran them with four different pre-processing options (1: Scale, Center and Principal Component Analysis; 2: Scale and Center; 3: Center; 4: None). Additionally, we used the algorithms Cubist "cubist" and ensemble Cubist "cubist_committee" from the Cubist package [72] without any pre-processing. In all, we built 38 models, named as follows: Algorithm name_number (corresponding to the pre-process method, as described before). In order to reduce collinearity, we removed predictor variables which presented an absolute correlation coefficient > 0.75 [73,74] using the "FindCorrelation" function of the caret package. The k-fold cross-validation resampling technique was used to evaluate each model (k = 10) due to the limited samples in our dataset [75]. RMSE and MAE were used to measure model accuracies and both metrics were converted to percent RMSE (% RMSE) and percent MAE (% MAE) by dividing the RMSE or MAE by the mean of the observed yield (57.950 ton/ha) across years [76]. We only selected models with RMSE < 9 ton.FM/ha (<15% RMSE) and R 2 > 0.80 [17,41,77]. The most adequate algorithms and pre-processing steps were thus identified.
Second, we ran the best performing models using different feature selection scenarios to address collinearity among predictors. These were fitted using a k-fold cross-validation technique over 80 % of the dataset. In order to provide an unbiased evaluation of the final models, the holdout dataset was set at 20%. We thus ensured that data samples used over the training-testing phase were independent from the holdout set of the evaluation phase. This process was repeated ten times using arbitrarily chosen seeds (ensuring repeatable results) to average the evaluation results. We explored the influence of collinearity among variable predictors in terms of correlation coefficients: 0.5 (Scenario A), 0.75 (Scenario B), 0.90 (Scenario C) and without any prior feature selection process (Scenario D). Finally, we addressed the statistical significance of each variable in the final models.

Crop Yield Prediction One Month Prior to Harvest
We simulated yield prediction one month prior to harvest time (end of August) removing September predictors from the original dataset. The same step-wise method as in Section 2.3.2 was followed, and only the best performing model was selected. Even though the optimal timing for crop yield forecasting would be two months prior to harvest [39], the same authors acknowledge that only one month before provides more realistic results given that uncertainty tends to decline towards the end of the season.

Results
We compared 38 models with different pre-process options to evaluate the performance of ten machine learning algorithms. The k-fold cross-validation technique was used to evaluate the algorithms' predictive performance. In general, most of the proposed models obtained RMSE values <12 ton/ha, which is less than 20% error compared to mean yield. MAE scores presented values between 6.5 and 9.5 ton/ha, representing error values <16 % of the yield average. The rqlasso_2, LeapBack_2-3 and svmRadial_3 algorithms proved to be the best approaches for modelling our potato yields (Table 1). Hyper-parameter tuning was performed by means of cross-validation. The minimum RMSE value was used to select the optimal hyper-parameters so that each model was automatically optimized to provide maximum model accuracy.
We selected the models which met the proposed criteria (RMSE < 9 ton/ha and R 2 > 0.80): rqlasso_2, LeapBack_3, LeapBack_2 and svmRadial_3. These were run again, this time using 80 % of the original dataset to train and test the models by k-fold cross-validation (k = 10), and the remaining 20% to independently evaluate them. Table 2 summarizes these four model performances for different feature selection scenarios. Given the ability of the rqlasso and LeapBack algorithms to carry out feature selection by their own methods, we also explored a no-feature-selection scenario in which the original dataset was included (scenario D).   Table 2 shows the performance obtained by each model following a different feature selection process in terms of the correlation coefficient. LeapBack_2 and LeapBack_3 presented similar values, indicating that the pre-process action of "Scale and Center" or just "Center" over predictors made no difference for this algorithm (henceforth, we will only refer to LeapBack_2, although the same comments could be applied to LeapBack_3). The average RMSE across models indicated that the best feature selection was scenario B, which involved using ten variables out of 56 ( Figure 2 and Table  S4). Table S5 shows variable importance for the rqlasso_2 model (the caret package does not support variable importance for the svm or Leapbackwards algorithms). The second and third best scenarios were A with six predictors and D with 56 predictors, respectively. The worst performing scenario was C with 18 predictors. Under scenario B, the best model performance in terms of RMSE was obtained by LeapBack_2 model (nvmax = 3) with values of RMSE = 6.866 ton/ha (11.84 % RMSE), R 2 = 0.89 and MAE = 5.512 ton/ha (9.51% MAE), closely followed by the rqlasso_2 model (lambda = 0.1) with RMSE = 7.093 ton/ha (12.24% RMSE), R 2 = 0.90 and MAE = 5.653 ton/ha (9.75% MAE). The most relevant variables in the LeapBack_2 model were band 8 (July), LCC index (July) and the wdvi index (August). As regards the rqlasso_2 model, these were band 8 (July), band 6 (September) and the LCC index in September, July and August (decreasing order of importance). Even though svmRadial_3 presented worse results than the aforementioned models, its performance greatly improved as the predictor number increased. In scenario D, the best results across scenarios and models were obtained by svmRadial_3 with RMSE = 6.781 ton/ha (11.7% RMSE), R 2 = 0.93 and MAE = 5.012 ton/ha (8.64% MAE). The optimal tuning parameters were sigma = 0.01085 and C = 1. To jointly represent predicted yield versus actual yield (Figure 3), we took the overall of individual predictions per model in their best scenario and in terms of R 2 ( Table 2).
We used the svmRadial_3 model to generate the predicted yield maps of the studied potato fields across 2016, 2017 and 2018 on a pixel basis (Figure 4). According to Table 2, this model obtained the highest performance in terms of R 2 (0.93). These maps showed a variation in yields across fields as represented by the eight classes. In general, the predicted yield per pixel ranged from 43 to 80 Ton/ha, with mean values around 59 Ton/ha. It can be observed that the low-yield class (0 Ton/ha) was mainly distributed across some field boundaries given the influence of other crop types or roads in the pixel reflectance.
Finally, we made a pre-harvest prediction of potato yield using only feature variables corresponding to July and August. The best trade-off in terms of RMSE and R 2 was obtained by the random forest algorithm with different pre-processing: rf_3 (RMSE = 8.751 ton/ha, R 2 = 0.84 and MAE = 7.399 ton/ha), rf_4 (RMSE = 8.765 ton/ha, R 2 = 0.83 and MAE = 7.046 ton/ha) and rf_2 (RMSE = 8.916 ton/ha, R 2 = 0.84 and MAE = 6.717 ton/ha). The latter models were fitted with 80 % of the original dataset using k-fold cross-validation (k = 10), and then evaluated against the remaining 20 % holdout dataset. In this case, the pre-processing options applied displayed minimal influence across these three model results, such that we selected rf_3 due to its having shown the lowest RMSE score during the selection process. In addition, the number of variables selected in each scenario proved to be critical ( Figure 5). In general, predicted results were less close to actual yields compared to those involving September data (Table 2). Nevertheless, rf_3 offered promising results in scenario B with RMSE = 6.470 ton/ha (11.16% RMSE), MAE = 5.052 ton/ha (8.71% MAE) and R 2 = 0.89. Scenarios A and C provided quite similar scores, whereas scenario D was the worst performing one. Therefore, the best choice to predict potato yield prior to harvest was the random forest with "center" pre-processing (rf_3) and the "mtry" hyper-parameter = 5.  In order to graphically display the fitness of our methodology using only variables from July and August, we represented the actual and modelled yields. Figure 6 reveals the overall amount of individual predictions modelled by rf_3 after each iteration under the best scenario ( Figure 5, scenario B). Table S5 shows the variable importance of each variable involved in the model.

Discussion
Crop yield prediction is of major importance in global and local markets as it enables early decision-making, improves agricultural commodity practices and allows market prices to be modelled [78,79]. In this work, we first evaluated ten different machine learning algorithms with different pre-processing options to compare model performances in potato crops. These results highlight the advantages of pre-processing methods such as "scale" and "center" when the ranges of values or magnitudes differ greatly across predictors [80]. At this initial stage, we used the cross-validation resampling method due to the limited number of observations [81]. In general, linear regression algorithms (rqlasso, LeapBack) obtained better results than non-linear ones (svmRadial or random forest). This may explain the use of linear models for crop yield prediction in previous works [41], although crop yield differs spatially and temporally with non-linear behaviour [82]. Nevertheless, non-linear models such as svmRadial also obtained satisfactory results in terms of R 2 (0.85) and RMSE (8.949 ton/ha). We agree with other authors [83,84] that crop yield prediction requires testing both linear and non-linear approaches, since model predictive ability depends upon sample number and data quality.
Second, we selected the best performing models (rqlasso_2, LeapBack_2-3 and svmRadial_3) and compared their scores against various scenarios. It was found that removing all the predictors whose correlation coefficient was > 0.50 improved the model performance for rqlasso and LeapBack in terms of RMSE. In contrast, svmRadial in scenarios C and D outperformed its results in scenarios A and B, demonstrating its improvement when more predictors are included in this model. All of these models were evaluated against a hold-out dataset, which was not included in the train-test phase in order to avoid overfitting in the model [85,86]. Since the rqlasso and LeapBack algorithms can inherently perform feature selection methods identifying the best model that contains a given number of predictors [87], we also evaluated the models without any kind of feature selection ( Table 2, Scenario D). These results showed very good performance for svmRadial_3 in terms of R 2 = 0.93, MAE = 5.012 ton/ha (8.64% MAE) and RMSE = 6.781 ton/ha (11.70% RMSE). The rqlasso_2 and LeapBack_2 models performed more poorly as the number of predictors increased. Based on the results presented, we suggest the use of correlation coefficients to remove collinearity among variables when using rqlasso and LeapBack. Although these models have inherent methods to automatically reduce variables, they performed worse when the number of predictors was larger and no prior feature selection was applied. These algorithms tend to select a useful set of predictors depending on the penalty term they use, but not necessarily the most important variables to explain our data [88]. In contrast, svmRadial_3 performed better when the number of predictors was larger (scenario D), which concurs with the results obtained by Joachims [89]. Although in this case, this model may not be suitable for potato yield estimation/prediction since the multi-collinearity is not addressed properly. According to Joachims [89], support vector machine models use overfitting protection that can be tuned through the regularisation parameter C. This is a penalty parameter of the error term that determines the trade-off between smooth decision boundary and classifying the training points correctly. Although some works describe the benefits of applying feature selection procedures to improve predictive performance [90,91], we did not observe it for svmRadial_3. This may be explained by the sample size of the dataset, as suggested by Jain and Zongker [92], or by the inner regularization parameters that support-vector machine algorithms have [89]. Typically, prediction errors around 10-15 % RMSE are found in crop yield prediction literature [76,93,94]. Focusing more specifically on potato crop modelling, Hartz and Moore [95] tested a multiple linear model based on temperature and insolation in the laboratory with an accuracy of R 2 = 0.93. Bala and Islam [40] used vegetation indices derived from TERRA satellite imagery to predict potato yield, and obtained the following correlation coefficients (R 2 ): 0.84 (NDVI), 0.72 (LAI) and 0.80 (fPAR-the fraction of photosynthetically active radiation). More recent approaches, such as Al-Gaadi et al. [41], predicted potato yield using Sentinel 2 and Landsat 8 satellites, and obtained R 2 values between 0.39 and 0.65. Some of our fitted models outperform previous approaches based on satellite imagery for predicting potato yield, such as svmRadial_3 (R 2 = 0.93 and RMSE = 6.781 ton/ha).
Pre-harvest results emphasized the need to build several machine learning models with different pre-processing and feature selection methods to optimize model results. The rf_3 performed very well under scenario B: R 2 = 0.89, RMSE = 6.470 ton/ha (11.16 % RMSE) and MAE = 5.052 ton/ha (8.71% MAE). Nevertheless, some extreme events or abnormal crop conditions in September such as diseases, pest outbreaks or water stress, can influence modelled yields causing yield overestimation, as already stated by other authors [76]. To the best of our knowledge, no one has yet attempted to make pre-harvest predictions in potato crops using satellite data. In addition, there is a demand for crop modelling tools which are able to offer a better knowledge of crop productivity. These techniques can help to alleviate extreme weather-related events that trigger food insecurity since such events may reduce food supply and the incomes of households working in the agricultural sector, particularly in food insecure regions [96]. Table S5 shows the variable importance for the rqlasso_2 (July, August and September variables) and rf_3 models (July and August variables) under scenario B. The most important variables were Band8_July, Band6_Sept, LCC_Sept, LCC_July and LCC_Aug for rqlasso_2; while Band5_Aug and LCC_Aug were the most informative for rf_3. In both models, the LCC index proved to be key given its capacity to retrieve an approximation of the chlorophyll content at leaf level. As a result, it provides information about photosynthetic capacity and plant functioning [97]. Band5_Aug and Band6_Sept also had high scores given these spectral bands' sensitivity to chlorophyll in the red-edge, which shows them to be convenient proxies for operationally estimating biophysical parameters from Sentinel-2 [98]. Band8_July achieved the highest score of influence in rqlasso_2. Near-infrared leaf reflectance is not affected by changes in chlorophyll content, with very low leaf absorption and transmittance reaching maximum values [99]. Nevertheless, near-infrared reflectance values can detect significant differences between healthy and diseased potato plants [100]. We did not find any clear evidence to attest which month was more informative, either in the rqlasso_2 or in the rf_3.
This study confirms the feasibility of our machine learning models based on Sentinel 2 imagery and how it outperforms previous efforts in potato yield prediction. We have developed a step-wise methodology to find and build the best performing models for our study region over a three-year period. The proposed Sentinel 2 bands and indices can be used effectively to model potato crops for predicting yields in the study area, and the method can be extended to other sites in Castilla y Leon that have a similar phenological cycle for the potato. The use of this methodology in other parts of the world would require a deep understanding of tuberization and senescence dates in order to adequately build and fit the proposed models.

Conclusions
This study has shown the possibility of predicting potato yields using Sentinel 2 imagery and machine learning techniques for three different growing years. The use of Sentinel 2 imagery provides high spatial and temporal resolutions when compared to previous approaches, and our fitted machine learning models have proved their usefulness for modelling potato yield. In general, pre-processing techniques, such as "centering" and "scaling", improved the model results, while the impact of feature selection methods differed depending on the algorithms. Regression quantile lasso (rqLasso: 11.67% RMSE, R 2 = 0.88 and 9.18% MAE) and Leap Backwards (LeapBack: 10.94% RMSE, R 2 = 0.89 and 8.95% MAE) performed better when highly correlated predictors (correlation coefficient > 0.5) were removed from the dataset. In contrast, the support vector machine radial obtained higher scores when feature selection methods were not applied at all (svmRadial: R 2 = 0.93, 8.64% MAE and 11.7% RMSE). In addition, we developed a model to predict potato yield prior to harvest using the rf_3 model (R 2 = 0.84, 13.55% RMSE and 10.31% MAE). The latter method evidences greater uncertainty, given the possible occurrence of certain extreme events or abnormal crop conditions in September such as diseases, pest outbreaks or water stress that cannot be included in the model and would overestimate crop yields.
The study results improve the current state of the art for potato yield modelling using satellite imagery at very high spatial and temporal resolution. More robust results may be obtained by using a larger number of samples in the original dataset. In addition, cloud cover remains an obstacle in passive remote sensing, and entails the loss of information for some areas and days. Future attempts should aim to increase the number of samples, widen the geographical area and extend the number of years studied so that models have better generalization capacity.
Supplementary Materials: The following are available online at http://www.mdpi.com/2072-4292/11/15/1745/s1. Table S1: list of survey fields with total yield (ton), area (ha), yield per area (ton.FM/ha), harvesting and sowing dates; Table S2: list of satellite images downloaded for July, August, and September from 2016 to 2018; Table S3: formulations used to obtain vegetation indices: Anthocyanin Reflectance Index, Carotenoid Reflectance Index, Inverted Red-Edge Chlorophyll Index, Leaf Chlorophyll Content, Normalized Difference Vegetation Index, Plant Senescence Reflectance Index and Weighted Difference Vegetation Index; Table S4: selected predictor variables included in each scenario; Table S5: variable importance for the rqlasso_2 (July, August and September variables) and rf_3 models (July and August variables) under scenario B. All measures of importance are scaled to have a maximum value of 100, such that the highest scores represent the most important variables in the model. Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.