Estimation of Potato Yield Using Satellite Data at a Municipal Level: A Machine Learning Approach

: Crop growth modeling and yield forecasting are essential to improve food security policies worldwide. To estimate potato ( Solanum tubersum L. ) yield over Mexico at a municipal level, we used meteorological data provided by the ERA5 (ECMWF Re-Analysis) dataset developed by the Copernicus Climate Change Service, satellite imagery from the TERRA platform, and ﬁeld information. Five di ﬀ erent machine learning algorithms were used to build the models: random forest (rf), support vector machine linear (svmL), support vector machine polynomial (svmP), support vector machine radial (svmR), and general linear model (glm). The optimized models were tested using independent data (2017 and 2018) not used in the training and optimization phase (2004–2016). In terms of percent root mean squared error (%RMSE), the best results were obtained by the rf algorithm in the winter cycle using variables from the ﬁrst three months of the cycle (R 2 = 0.757 and %RMSE = 18.9). For the summer cycle, the best performing model was the svmP which used the ﬁrst ﬁve months of the cycle as variables (R 2 = 0.858 and %RMSE = 14.9). Our results indicated that adding predictor variables of the last two months before the harvest did not signiﬁcantly improved model performances. These results demonstrate that our models can predict potato yield by analyzing the yield of the previous year, the general conditions of NDVI, meteorology, and information related to the irrigation system at a municipal level.


Introduction
Potato (Solanum tuberosum L.) is a crop that originated from the Andean region [1]. It is the most important non-grain crop worldwide [2] and the fourth-largest food crop after rice, wheat, and maize [3,4]. Potato produces more nutritious food, quicker, on less land and in harsher climates than any other major crop [2]. It also plays a key role in food security strategies being grown and consumed in underprivileged areas [5]. In this sense, millions of farmers rely on potatoes as a source of food and economical support. Potato global trends indicate that developed countries are gradually reducing their potato production, whereas developing countries are increasing it [5,6].
Mexican potato production increased with respect to global production until approximately the year 2000, then its production has remained stable to date [7]. In general, 8700 producers are dedicated to the cultivation of potatoes in Mexico. Approximately 77800 families directly depend on this crop, generating 17500 direct jobs and more than 50 thousand indirect jobs [8]. The latter report noted that potatoes are grown over 23 states of the Mexican republic. By states, Sonora is the main producer with 24.5% of the total, followed by Sinaloa (17.0%), Puebla (9.8%), and Veracruz (8.3%). According to its end use, 29.0% is produced for industrial purposes, 56.0% for fresh consumption, and 15.0% for seed production. Concerning the seasonality of the crop, two main growing periods can be identified: autumn-winter and spring-summer. Autumn-winter periods generally present

Study Area
The study area is located in Mexico. This country has over 124 million people and covers a total area of 1,964,375 km 2 , with its territory divided into 32 estates and 2458 municipalities. Given its vast extension, Mexico is a country with a great climatic diversity with two distinct areas separated by the Tropic of Cancer: the tropical zone and the temperate zone ( Figure 1). According to Köppen-Geiger classification, the most frequent climate is hot semi-arid climate (BSh) and hot desert climate (BWh) [26]. These climatic conditions allow farmers to exploit the two cycles of potato, one during spring-summer and the other during autumn-winter.

In Situ Data
The land cover map v. 6.0 (1:25:000) from Instituto Nacional de Estadística y Geografía de México (INEGI) was used to identify those polygons categorized as crops to extract a representative vegetation vigor at a municipal level [27]. Its main advantage is the capacity to differentiate between irrigated and rain-fed crops. In Mexico, there is an approximate area of 34 Mha dedicated to agricultural use which represents 17% of the total area (23.7Mha rain-fed, 10.4 Mha irrigated). The crop yield data was downloaded from the Servicio de Información Agroalimentaria y Pesquera (SIAP) of the Mexican government (http://infosiap.siap.gob.mx/gobmx/datosAbiertos.php). This data comprises information for each year and municipality for different crops such as sown and harvested areas, production volume, total value, irrigated or rain-fed system, seasonality (spring-summer or autumn-winter), or crop yield since 1980. The boundaries of Mexican municipalities were downloaded from the webpage https://www.diva-gis.org/Data, site of the freely available DIVA-GIS software [28].

Remote Sensing Data
We used the MOD13Q1 product to represent the vegetation vigor of each Mexican municipality. This product compiles the NDVI (obtained by equation 1) of MODIS (Moderate Resolution Imaging Spectroradiometer) [29].
The MODIS sensor, on board the Terra platform from the EOS constellation (Earth Observing System), has 36 channels ranging in wavelength from 0.4 to 14.4 µm. The spatial resolution varies from 250 m to 1 km. More information about technical aspects of MODIS can be found in Ranson et al. [30]. The MOD13Q1 is a 16-day global composite of NDVI with a spatial resolution of 250 m, downloaded from the EarthData website for the following tiles: h09v06, h08v06, h08v07, h07v06, h09v07, and h08v05 [31]. We calculated a monthly composite image with the maximum value of NDVI. To do that, a mosaic image was created every 16 days by joining the corresponding six tiles and then, the maximum value per month was selected in the final monthly composite for each pixel mitigating, this way, the cloud cover related problems. Although the enhanced vegetation index

In Situ Data
The land cover map v.6.0 (1:25:000) from Instituto Nacional de Estadística y Geografía de México (INEGI) was used to identify those polygons categorized as crops to extract a representative vegetation vigor at a municipal level [27]. Its main advantage is the capacity to differentiate between irrigated and rain-fed crops. In Mexico, there is an approximate area of 34 Mha dedicated to agricultural use which represents 17% of the total area (23.7Mha rain-fed, 10.4 Mha irrigated). The crop yield data was downloaded from the Servicio de Información Agroalimentaria y Pesquera (SIAP) of the Mexican government (http://infosiap.siap.gob.mx/gobmx/datosAbiertos.php). This data comprises information for each year and municipality for different crops such as sown and harvested areas, production volume, total value, irrigated or rain-fed system, seasonality (spring-summer or autumn-winter), or crop yield since 1980. The boundaries of Mexican municipalities were downloaded from the webpage https://www.diva-gis.org/Data, site of the freely available DIVA-GIS software [28].

Remote Sensing Data
We used the MOD13Q1 product to represent the vegetation vigor of each Mexican municipality. This product compiles the NDVI (obtained by equation 1) of MODIS (Moderate Resolution Imaging Spectroradiometer) [29].
The MODIS sensor, on board the Terra platform from the EOS constellation (Earth Observing System), has 36 channels ranging in wavelength from 0.4 to 14.4 µm. The spatial resolution varies from 250 m to 1 km. More information about technical aspects of MODIS can be found in Ranson et al. [30]. The MOD13Q1 is a 16-day global composite of NDVI with a spatial resolution of 250 m, downloaded from the EarthData website for the following tiles: h09v06, h08v06, h08v07, h07v06, h09v07, and h08v05 [31]. We calculated a monthly composite image with the maximum value of NDVI. To do that, a mosaic image was created every 16 days by joining the corresponding six tiles and then, the maximum value per month was selected in the final monthly composite for each pixel mitigating, this way, the cloud cover related problems. Although the enhanced vegetation index (EVI) is also computed inside MOD13Q1 product, it has been dismissed because of its high correlation with NDVI [32,33].

Meteorological Data
The meteorological data was retrieved from ERA5 (ECMWF Re-Analysis) developed by the Copernicus Climate Change Service (C3S). ERA5 is the fifth generation ECMWF (European Centre for Medium-Range Weather Forecasts) reanalysis for the global climate and weather of the last four to seven decades under a free of charge license, worldwide, non-exclusive, royalty free, and perpetual. Data since 1979 is available. The reanalysis combines model data with world-wide observations into a globally complete and consistent dataset using model physics, core dynamics and data assimilation [34]. The input variables of the model (Table 1) were downloaded in a monthly mean basis, with a horizontal resolution of 0.25 degrees in a global coverage, from the following webpage: https://cds.climate. copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels-monthly-means?tab=overview.  Figure 2 shows, as an example, the temporal evolution of the NDVI and meteorological variables in Comondú municipality, Baja California, from October 2017 to September 2018. This state corresponds to irrigated farmer system for both cycles. The yield production for this year was 38 ton/ha for the winter cycle and 31 ton/ha for the summer cycle. Aiming to represent all the variables in one figure, skt and t2m have been divided by 1000, stl1 was measured in MPa, and tp, e, pev, and tp were measured in cm.

Data Preparation
The yield information was extracted at municipal level for the time period 2003-2018. The variable predictors were the farmer system (irrigated or rain-fed crop information derived from SIAP database), the yield production during the year before (ybYield) which represents the simplest prediction, NDVI and meteorological variables (Table 1). To generate these predictors, we used the RSI's software ENVI/IDL [35,36]. We extracted the mean NDVI value per month and municipality from those pixels classified as crop over six months (between sowing and harvest). It is worth noting that MOD13Q1 is a 16-day global composite, therefore while some months had only one value of NDVI per pixel, others had two values of NDVI per pixel. Thus, when a pixel had two values within a month, the maximum value was selected. The months from October to March were included in the winter cycle, and from April to September in the summer cycle. Given the low resolution of the meteorological data, these predictors were retrieved at municipal level, on a monthly basis, between the sown and harvest time for the six months before the harvesting. Hence, the mean value of each variable included in Table 1 was retrieved. Therefore, the dataset compiled one yield record per municipality and year, with its corresponding monthly mean NDVI and meteorological variables of the six months prior harvesting. Data samples with missing values were removed from the dataset. In addition, field data with yields lower than 10 ton/ha were considered as outliers and removed. The final dataset comprised 838 samples.

Data Preparation
The yield information was extracted at municipal level for the time period 2003-2018. The variable predictors were the farmer system (irrigated or rain-fed crop information derived from SIAP database), the yield production during the year before (ybYield) which represents the simplest prediction, NDVI and meteorological variables (Table 1). To generate these predictors, we used the RSI's software ENVI/IDL [35,36]. We extracted the mean NDVI value per month and municipality from those pixels classified as crop over six months (between sowing and harvest). It is worth noting that MOD13Q1 is a 16-day global composite, therefore while some months had only one value of NDVI per pixel, others had two values of NDVI per pixel. Thus, when a pixel had two values within a month, the maximum value was selected. The months from October to March were included in the winter cycle, and from April to September in the summer cycle. Given the low resolution of the meteorological data, these predictors were retrieved at municipal level, on a monthly basis, between the sown and harvest time for the six months before the harvesting. Hence, the mean value of each variable included in Table 1 was retrieved. Therefore, the dataset compiled one yield record per municipality and year, with its corresponding monthly mean NDVI and meteorological variables of the six months prior harvesting. Data samples with missing values were removed from the dataset. In addition, field data with yields lower than 10 ton/ha were considered as outliers and removed. The final dataset comprised 838 samples.

Machine Learning Modeling
During the modeling process, we used R software [37] and the caret package [38]. We set up six

Machine Learning Modeling
During the modeling process, we used R software [37] and the caret package [38]. We set up six different scenarios based on which months were used (only predictors from the sixth month before harvest, then sixth and fifth month before harvest, and so on until all the available predictors from the six months before harvest were included). In addition, each model was built with five different ML algorithms: random forest (rf) [39], support vector machine linear (svmL) [40], support vector machine polynomial (svmP) [40], support vector machine radial (svmR) [41], and general linear model (glm) [42].
The dataset was split into a training and validation set (85%, 712 samples), comprising data from 2004 to 2016; and an independent test set (15%, 126 samples) with data from 2017 and 2018. Thus, the percentage of data assigned to each subset was conditioned by the amount of samples available from 2017 and 2018. This method intends to simulate the performance of the models in a more 'real-world' scenario, finding a compromise between number of samples included in the test set and selecting only the most recent data. We used the bootstrapping resampling method (25 repetitions) to train the model and optimize the hyperparameters.
In order to assess the variability of this methodology, we replicated the previous approach over 50 iterations, but this time randomly selecting 90% of the data at each iteration. The best performing models were then evaluated with the latest data records (2017 and 2018) to assess their capacity in terms of operability (simulating a real-case scenario). The final and operative model would therefore be built with the entire dataset (2004-2018).

Results
The model performance of each ML algorithm per scenario and cycle using the hold-out data (2017-2018) using the bootstrapping method with 25 iterations is shown in Table 2. The accuracy of the models were presented in terms of coefficient of determination (R 2 ) and percent root mean squared error (%RMSE).  Table 3 shows the model performance of each ML algorithm per scenario and cycle using the bootstrapping method (25 repetitions) across 50 iterations, expressed as the mean and standard deviation of R 2 and %RMSE scores. This method allows us to compare the variance of the model resulting from each iteration (50 in total), and thereby the overall influence of the random sampling in the training phase.
In terms of %RMSE, the best ML algorithm during the winter cycle was the rf using the predictors corresponding to the 3-month scenario. In the summer cycle, the best results were obtained by svmP algorithm when using the predictors corresponding to the 5-month scenario (Table 3). In the winter cycle, the models after the 3-month scenario did not considerably improve R 2 and %RMSE scores, whereas model performances increased until the 5-month scenario in the summer cycle.
The crop cycle variable affected differently depending on the ML model used. In the summer cycle, the best svmP performance for the 5-month scenario (R 2 = 0.858) was substantially better than the rf performance (R 2 = 0.757) in winter for the 3-month scenario. Similar results were observed for other ML models and their best scenario: svmR (R 2 = 0.731 -0.837), svmL (R 2 = 0.692 -0.860), and glm= (R 2 = 0.612-0.834). Given the difference performance observed between summer and winter cycles, we did not build a unique model that would comprise data from both cycles. Some differences were also observed depending on whether the crops were irrigated or non-irrigated. In general, the models were better in the irrigated crops (%RMSE summer =8.9, %RMSE winter = 17.0) than in non-irrigated ones (%RMSE summer =22.5, %RMSE winter = 22.8).
As stated in the methodology section, we trained and optimized the models using the bootstrap method (25 repetitions) with 50 iterations. The best number of variables tried at each split (mtry) for the rf model (winter cycle, 3-month scenario) was 20. This hyper-parameter gave the best results in terms of RMSE for 47 out of 50 iterations. For the svmP model (summer cycle, 5-month scenario) the best hyper-parameters were degree = 3, scale = 0.01 and offset = 1 (49 out of 50 iterations). The support vector number was 306. We evaluated the statistical significance of each variable in the best performing models and cycle. The variable importance was calculated using the varImp function included in the caret package [38] and scaled to 100 for the sake of visibility. Figure 3 and 4 show the 20 most important variables for the rf model using the 3-month scenario (winter cycle), and svmP model using the 5-month scenario (summer cycle).  The variable importance shows in Figures 3 and 4 is only representative and depend on the random selection of the data. In order to perform a deeper analysis of the variable importance, the results of the 50 iterations have been calculated. In the winter cycle, the most important variable for the best performing model (rf, 3-month scenario) was the ybYield in 48 out of 50 iterations. In decreasing order of importance, NDVI1 was also a very relevant variable. It was at the top of the rank (first to third variable importance) in 25 out of 50 iterations. While NDVI2 was the second or third most important variable in 37 out of 50 iterations. In the summer cycle, the most important variables for the svmP model (5-month scenario) concur with the results presented in Figure 4. The most important variables were laiLv 1-5, laiHv 1-5, ybYield, farmer system, NDVI4, and NDVI5 in 47 out of 50 iterations. Although ybYield was in the eleventh place, its level of importance is similar to laiLv or laiHv in terms of % importance. We analyzed the prediction capacity of the ybYield to obtain a baseline-value of the predicted yield. This is one of the simplest models we can get (only one predictor), it is easy to implement and can be used to test and compare our model results. The baseline-value obtained model performances of R 2 =0.767 and %RMSE= 20.5 in the summer cycle, while R 2 = 0.515 and %RMSE = 25.0 in the winter cycle. Comparing with model results provided in Tables 2 and 3, we thus prove the important information added by the remote sensing and meteorological variables. To further inspect and verify the predictive capacity of the latter variables (remote sensing data and meteorology), we dropped off the ybYield variable from the rf model (3-month scenario, winter cycle) and svmP model (5-month scenario) with %RMSE scores of 19.9 and 23.3, respectively.  The variable importance shows in Figures 3-4 is only representative and depend on the random selection of the data. In order to perform a deeper analysis of the variable importance, the results of the 50 iterations have been calculated. In the winter cycle, the most important variable for the best performing model (rf, 3-month scenario) was the ybYield in 48 out of 50 iterations. In decreasing   Tables 2-3, we thus prove the important information added by the remote sensing and meteorological variables. To further inspect and verify the predictive capacity of the latter variables (remote sensing data and meteorology), we dropped off the ybYield variable from the rf model (3-month scenario, winter cycle) and svmP model (5-month scenario) with %RMSE scores of 19.9 and 23.3, respectively. Navy blue color represents rain-fed crops, green color represents irrigated crops, diamond-shaped objects represent prediction and actual yield for the rf model (3-month scenario, winter cycle) and triangle-shape objects represent prediction and actual yield for the svmP model (5-month scenario, summer cycle). Figure 5 shows the prediction yield versus the actual measured yield for the years 2017 and 2018 (hold-out dataset) for the best performing models at each cycle. A regression has been calculated for this scatter plot, and it resulted in a slope of 0.711 for winter cycle and 0.878 for summer cycle. The intercepts at 4.711 (winter cycle) and 1.482 (summer cycle) indicate the similarity of the linear relation with the 1:1: line. Navy blue color represents rain-fed crops, green color represents irrigated crops, diamond-shaped objects represent prediction and actual yield for the rf model (3-month scenario, winter cycle) and triangle-shape objects represent prediction and actual yield for the svmP model (5-month scenario, summer cycle).

Discussion
Monitoring crop condition and production estimates at state and regional level has been of great interest during the last decades for public authorities and decision makers [43]. In this sense, we built several ML models to estimate potato yields at municipal level in Mexico. One of the most obvious limitations in ML is the lack of data and it can be usually found in the agricultural sector, in which high quality data at finer spatial resolutions sometimes does not exist or it is very limited [44,45]. Given the lack of data at field scale in our study area (phenological information, sown and harvest date), our models used the monthly mean NDVI of the agricultural irrigated or rain-fed fields (depending on the sample), the ybYield and the weather conditions in each municipality as predictive variables. The NDVI used in this work comes from all crops, regardless they are potato cultivation or not. The specific location of the potato crops would provide more precise information to the models with respect to their vigorous state. In addition, the inclusion of other vegetation indexes with the same or higher spatial resolution [46], and not correlated with the information provided by the NDVI, could improve model performances.
The svmP model obtained the best performance in summer cycles using 5-month scenario to create model predictors (R 2 = 0.858 and %RMSE = 14.9). Nevertheless, the inclusion of data from the last months before harvest did not substantially improve yield predictions. The NDVI has proved to be an informative indicator of the general agricultural conditions at municipal level during the winter cycle, although the most important variable was the ybYield (Figure 3). Despite NDVI not being retrieved at a field level, these results reaffirm the potentiality of this vegetation index to characterize the eco-climatic conditions of an area [23,[47][48][49][50]. In the summer cycle, the most important variables were layhv, lailv, and ybYield ( Figure 4). In both models, rf and svmP, the ybYield variable played an important role to predict next year yield; however, our results suggest that its effects are modulated by NDVI and ERA5 data. The prediction capacity of the models using only ybYield (R 2 = 0.515/0.767 and %RMSE = 25.0/20.5) were much lower than the results obtained using all the variables (Tables 2-3). Thus, the ybYield variable can be considered as a very informative variable in those cases when standalone remote sensing is not sufficient to reach the desired accuracy. The irrigated and rain-fed systems play different role during both cycles as the rainfall pattern varies substantially in both periods. From 2014 to 2018, the accumulated rain during the summer cycle was about 74% of the total precipitation in a year in average [51]. During the summer cycle, it plays an important role (73% of importance) while during the winter cycle it is not as relevant (24%).
Regardless of the applied ML algorithm, the models performed far better when potato yields were estimated during summer rather than in winter cycles (Tables 2-3). Mean NDVI lailv and laihv values may just represent well potato cultivars during the summer season since the overall value of the evaluated crops is mixed (agricultural crops according to the land cover map). Our results go in accordance with previous studies which estimated potato yield using satellite imagery, but this comparison should be taken carefully as these results were obtained at field scale, and not at municipal level. For instance, Newton et al. [52] addressed the relation between NDVI from Landsat 5-7-8 and potato crop yield using the maximum and mean NDVI values at field-scale obtaining R 2 = 0.35 and R 2 = 0.81, respectively. Bala and Islam [53] developed a potato yield model using NDVI from MODIS with an estimate error of 15 %RMSE, obtained from data of 50 fields. Our study used 838 field samples to build the models, which improves the robustness of previous approaches.
When the study area gets larger, for instance at regional or country level, crop modeling or forecasting becomes more challenging due to misinformation about the geolocation of the cultivated fields per crop type. Furthermore, public institutions do not usually own such data until mid-season or later. Therefore, crop-based or remote sensing-based models applied over large areas may estimate the yield per hectare (Ton/ha) but not the total yield, at least at early stages of the crop season [54]. The main limitation of the methodology exposed in this work is that the retrieval of NDVI comes from the entire irrigated or rain-fed crop area at municipal level, not discerning potato fields (not available information), but using mean NDVI and lai values to assess the eco-climatic conditions for each date and municipality, in addition to climatic data. Similar approaches were seen in the literature for other crops [55,56]. The main advantage of remote sensing-based models is the capacity to allocate spatial information with respect to the typical crop growth models [57]. Nevertheless, the strength of the relation between remote sensing data and production forecasts is not universal and varies geographically [58]. Currently, there are some efforts to provide independent and timely information on crop areas and yields using satellite imagery. The Joint Research Center -European commission has been running a program since 1988 named as Monitoring of Agriculture using Remote Sensing (MARS) to use satellite data as well as crop-based models to provide independent and timely information on crop areas and crop yield at large scale. The performance of the MARS-crop yield forecasting system for the European Union during 1993-2015 reported a median error value across the member states of 6.3% for potato crops, although for some individual countries this error rose up to 16.69% (no R 2 available) [54]. The results obtained in our study (R 2 = 0.757/0.858, %RMSE = 18.9/14.9) are in accordance with the aforementioned errors presented in other works. This type of regional estimate of crop yield is desirable for managing large agricultural areas and determining food pricing and trading policies [59,60].

Conclusions
An Earth observation driven approach has been applied to estimate potato yield at municipal level in Mexico. The predictors were created based on, previous yield, meteorological variables, NDVI, and farmer system information. We evaluated five machine learning algorithms (rf, svmL svmP, svmR, and glm) and six different time scenarios to carry out model predictions. The best model performance was obtained by the svmP algorithm using variables from the first five months after sowing for summer cycle (R 2 = 0.858 and %RMSE = 14.9), and rf algorithm using variables from the first three months after sowing for winter cycle (R 2 = 0.757 and %RMSE = 18.9). However, the last few months did not considerably add key information to the models. The most important variables were the ybYield and NDVI variables in the winter cycle, whereas ybYield, lailv, and laihv were the most relevant in the summer cycle. These results are supported by the large dataset used to calibrate and validated the models; and the extent of the study area, which comprises a wide range of meteorological and soil patterns. The proposed methodology can predict potato yield before harvest, which can be of great interest to establish food security strategies.