Towards Predictive Modeling of Sorghum Biomass Yields Using Fraction of Absorbed Photosynthetically Active Radiation Derived from Sentinel-2 Satellite Imagery and Supervised Machine Learning Techniques

Sorghum crop is grown under tropical and temperate latitudes for several purposes including production of health promoting food from the kernel and forage and biofuels from aboveground biomass. One of the concerns of policy-makers and sorghum growers is to cost-effectively predict biomass yields early during the cropping season to improve biomass and biofuel management. The objective of this study was to investigate if Sentinel-2 satellite images could be used to predict within-season biomass sorghum yields in the Mediterranean region. Thirteen machine learning algorithms were tested on fortnightly Sentinel-2A and Sentinel-2B estimates of the fraction of Absorbed Photosynthetically Active Radiation (fAPAR) in combination with in situ aboveground biomass yields from demonstrative fields in Italy. A gradient boosting algorithm implementing the xgbtree method was the best predictive model as it was satisfactorily implemented anywhere from May to July. The best prediction time was the month of May followed by May–June and May–July. To the best of our knowledge, this work represents the first time Sentinel-2-derived fAPAR is used in sorghum biomass predictive modeling. The results from this study will help farmers improve their sorghum biomass business operations and policy-makers and extension services improve energy planning and avoid energy-related crises.


Introduction
Sorghum (Sorghum bicolor (L.) Moench) is a cereal with a C4 carbon fixation (the Hatch-Slack pathway) cultivated mainly for food, feed, forage, and fuel [1].Sorghum grain was historically used for human consumption in developing countries but, because it is gluten-free, with low glycemic index and high contents of macronutrients and antioxidants, its utilization as food extended worldwide.
There are several types of sorghum.Grain sorghums are generally shorter (usually having recessive alleles at three of the four Dw genes) than biomass sorghums (having recessive alleles at two Dw genes at most), and have been selected to have the grain as the primary sink for photosynthates.
In most studies, however, remote sensing-based biomass yield estimation or prediction makes use of low-or medium-resolution satellite images from sensors such as SPOT-VEGETATION [14,15,23,24] or MODIS [16].These satellite products have a coarser spatial resolution (250 to 1000 m) compared to the data collected from the two Sentinel-2 satellites in this work (10-m spatial resolution).With the launch of the Sentinel-2 constellation of satellites the overpass frequency (five days and locally even two to three days) the temporal resolution is nearly as good as for SPOT-VEGETATION and MODIS satellites (one to two days).The high spatial resolution of the Sentinel-2 images is an important asset when monitoring crops in agricultural regions characterized by many small fields.To our knowledge no previous studies assessed the efficiency of high resolution Sentinel-2-derived fAPAR data in predicting within-season biomass sorghum yields, and this paper is therefore aimed at addressing this gap.
Deriving yield information from satellite imagery has shown promising results but this technology is not extensively applied across farmers and crop species worldwide [27][28][29].In this work, we developed models for within-season prediction of annual and perennial sorghum biomass yields in Emilia-Romagna, Italy, based on fAPAR measurements from Sentinel 2A and Sentinel 2B satellite images on 42 mostly full-fledged commercial sorghum fields.We used machine learning algorithms to create yield prediction equations.These equations can be implemented in decision support systems to allow farmers and/or farming stakeholders to predict biomass yields from sorghum fields of interest early on in the cropping season.This information is very helpful to efficiently schedule fleets of harvesting machinery, transport vehicles, and storage facilities.The fAPAR-derived predictive models for biomass yields can also be implemented by extension services and policy-makers for several purposes, including the possibility to anticipate potential biomass availability and plan ahead, to avoid specific crises such as fuel shortage.

Trial Set-Up
Forty-two demonstration trials were run in this work, 23 and 19 of which were evaluated in 2017 and 2018, respectively.In 2017, the experimental sites were located in Conselice, Nonantola, Mirandola, and Anzola dell'Emilia, in the Italian region of Emilia Romagna (Table 1, Figure 1), while in 2018 the sites were established in Anzola, Mirandola, and Conselice (Figure 1, Table 1).The experimental sites were strategically selected to maximize extension impact by conducting most of the trials in the farmers' fields.The experimental fields in Mirandola and Conselice belonged to respective two big farming cooperatives with more than 2000 members each.In Nonantola, the fields belonged to individual farmers, while in Anzola the fields were established in the experimental station of the Council for Agricultural Research and Economics (CREA).The fields were generally of big size relative to plot sizes commonly used under standard experimental settings [1] in order to serve the purpose of demonstrative pilots with the objective of transferring into the production environment the technology of sorghum crop monitoring using satellite imagery.The fields areas ranged from 0.06 ha to 50.00 ha, with a mean and median of 5.70 ha and 1.10 ha, respectively.All the fields were planted with biomass sorghums including biomass per se (high tonnage), sweet, forage, and dual purpose types.One-grain sorghum trials were established in Anzola in 2017, but it was not included in this work in virtue of a different kind of experiment management and a diverse market of the grain sorghum produce relative to biomass sorghum.Thirty-five out of the 42 trials were sown with a single genotype of Sorghum bicolor (annual), while the 17IT_mat was sown with a diversity panel of 228 biomass Sorghum bicolor genotypes, and six (15R17, 16R17, 16R18, 15R18, 17R18, and 17US_mat) of the trials installed in Anzola were made up of a diversity panel consisting of advanced perennial interploid biomass hybrids deriving from S. bicolor × S. halepense (SB × SH) crosses.The original SB × SH materials originated from The Land Institute (Salina, KS, United States of America).S. bicolor × S. halepense breeding strategy was amply detailed in Piper and Kulakow [33] and in Habyarimana et al. [1].The 15R, 16R, 17R, and 17US_mat trials were sown in 2015, 2016, and 2017, respectively, meaning that regrowth-derived biomass was evaluated for the 15R, 16R, and 17R trials, while for the 17US_mat trial, the biomass evaluated in this study was produced from direct sowing.Crop management followed local extension services guidelines and was well described in Habyarimana et al. [1].Planting density was 26 (0.75 m spacing between rows; 0.052 m spacing of hills within row) plants per square meter for most (35) trials, and 13 (0.75 m spacing between rows; 0.10 m spacing of hills within row) plants per square meter for 15R, 16R, 17R, 17IT_mat, and 17US_mat trials.In terms of weather (Figures S1-S4), summer was generally dry across years and locations as expected.
In 2017, all sites had relatively wet spring except in Anzola, while in 2018 spring was relatively wet in Anzola and Conselice, but dry in Mirandola.Crop management followed local extension services guidelines and was well described in Habyarimana et al. [1].Planting density was 26 (0.75 m spacing between rows; 0.052 m spacing of hills within row) plants per square meter for most (35) trials, and 13 (0.75 m spacing between rows; 0.10 m spacing of hills within row) plants per square meter for 15R, 16R, 17R, 17IT_mat, and 17US_mat trials.In terms of weather (Figure S1-S4), summer was generally dry across years and locations as expected.In 2017, all sites had relatively wet spring except in Anzola, while in 2018 spring was relatively wet in Anzola and Conselice, but dry in Mirandola.

Biomass Data Collection
Trials in Nonantola, Mirandola, and Conselice were harvested at industrial scale from end of August to late November, while all trials in Anzola were harvested end of November using a single-row chopper harvester.Our experience showed that postponing harvest to later times may increase the likelihood for lodging, which may lead to the crop touching the ground, adding grit to the biomass material and possibly reducing biomass quality; delayed harvest also leads to kernel loss and kernel quality deterioration particularly due to molds, insect, and bird damages.The trials were harvested according to two machinery options: forage chopper or swathing the material into windrows and then baling it in large square bales or large round bales.Chopped biomass was weighed immediately at harvest, while baled biomass was weighed when bales were transported to the bioreactor.Chopped and baled biomasses were supplied to private biogas and combustion bioreactors.From each field, a 1kg-composite sample was taken from the sold biomass at the time of shipment to the end user in order to determine the dry mass content in the commercialized produce and calculate the dry biomass yield for each entire field that will be used in modeling.For the diversity panels, samples were taken from each genotype.Fresh samples were weighed and dried at 80 °C to constant weight in a forced air oven.The fresh and dry weights of the samples, and the fresh weight of the entire field's harvest, were used to derive dry mass fraction of the fresh material and

Biomass Data Collection
Trials in Nonantola, Mirandola, and Conselice were harvested at industrial scale from end of August to late November, while all trials in Anzola were harvested end of November using a single-row chopper harvester.Our experience showed that postponing harvest to later times may increase the likelihood for lodging, which may lead to the crop touching the ground, adding grit to the biomass material and possibly reducing biomass quality; delayed harvest also leads to kernel loss and kernel quality deterioration particularly due to molds, insect, and bird damages.The trials were harvested according to two machinery options: forage chopper or swathing the material into windrows and then baling it in large square bales or large round bales.Chopped biomass was weighed immediately at harvest, while baled biomass was weighed when bales were transported to the bioreactor.Chopped and baled biomasses were supplied to private biogas and combustion bioreactors.From each field, a 1kg-composite sample was taken from the sold biomass at the time of shipment to the end user in order to determine the dry mass content in the commercialized produce and calculate the dry biomass yield for each entire field that will be used in modeling.For the diversity panels, samples were taken from each genotype.Fresh samples were weighed and dried at 80 • C to constant weight in a forced air oven.The fresh and dry weights of the samples, and the fresh weight of the entire field's harvest, were used to derive dry mass fraction of the fresh material and dry biomass yield of the entire field.For the diversity panel fields, the final yields integrated the contributions of the component genotypes.

Satellite Data Acquisition
For this study we used Sentinel-2 optical satellite imagery.The Sentinel-2 mission is based on a constellation of two satellites-Sentinel-2A and Sentinel-2B-both orbiting Earth at an altitude of 786 km, but 180 • apart to optimize coverage and global revisit times.Swath width, i.e., the image width across the satellite path when scanning the Earth, is 290 km.As a constellation, the revisit time is 5 days.This means that the same spot over the equator is revisited every five days, and even faster at higher latitudes.Sentinel-2 data are acquired on 13 spectral bands in the VNIR (visible and near-infrared) and SWIR (short-wave infrared) range, of which four bands with a spatial resolution of 10 meters (blue, green, red, and near-infrared (NIR)), six bands at 20 meters (three red edge bands, a narrow NIR, and two SWIR bands), and three bands at 60 meters (a coastal aerosol, water vapor, and cirrus band).Spatial resolution refers to the surface area measured on the ground and represented by an individual pixel.Once the Sentinel data are acquired on-board, they are sent to ground and processed by a network of Processing and Archiving Centers.Next, all data products are united, archived, and disseminated online to the users by ESA's Copernicus Space Component (CSC) Ground Segment via the CSC Data Access Coordinated System.To facilitate image transfer and use, the projected Sentinel-2 images are converted to tiles with a fixed size of 100 square kilometers, each of which is approximately 500 MB.
For this study, Sentinel-2A and Sentinel-2B images from tile 32TQQ (including pilots from Conselice) and 32TPQ (including pilots from Anzola, Mirandola, and Nonantola) were downloaded from ESA and processed by Vlaamse Instelling voor Technologisch Onderzoek N.V. (VITO).Processing included atmospheric correction with iCOR [34] and cloud and shadow detection using Sen2COR v2.5.5 (ESA-STEP, ESA, Paris, France).Biophysical parameters fAPAR, fCover, and leaf area index (LAI) were calculated from the top of canopy normalized reflectances following the BV-NET (tool for mapping surface and vegetation variables) method described by Weiss and Baret [35].The BV-NET methodology is based on neural networks which are trained on a synthetic dataset of ~50,000 simulations using the PROSAIL (PROSPECT and SAIL radiative transfer models) model [36].The BV-NET version used in this study was calibrated with green, red and near infrared bands, all having a spatial resolution of 10 meters.Sen2Cor and BV-NET are publicly available through ESA's SNAP (Sentinel Application Platform, ESA, Paris, France) toolbox.
Previous studies such as Duveiller et al. [15], López-Lozano et al. [24], and Johnson et al. [16] illustrated the good performance of satellite derived fAPAR for estimating and predicting biomass yields of large field crops, including corn and sugarcane, which, together with sorghum, make-up the world's three economically important C4 crops of the Poaceae family with similar growth habits [37].We therefore decided to use fAPAR for this study as well.The fAPAR estimates generated with BV-NET from Sentinel-2A and 2B top of canopy reflectances over selected tiles in Emilia-Romagna had a spatial resolution of 10 meters and a temporal resolution of 5 days up to 2-3 days in those areas where the different satellite overpasses overlapped.
For monitoring the sorghum fields in this study "WatchITgrow" (Vlaamse Instelling voor Technologisch Onderzoek N.V., MOL, Belgium) was used.WatchITgrow is a web-based application for crop monitoring developed by VITO.It provides information on crop growth and development as well as possible anomalies derived from Sentinel-2 satellite images and weather data, and it allows the user to store all kinds of collected field data, such as planting and harvest dates and development stages, but also information on crop treatments such as fertilization, spraying, or irrigation.Prior to monitoring, the fields used in this study were geolocalized (Figure 1) using Field GPS (global positioning syatem) application for iPhone with a final field boundary correction using Google Earth.The field polygons were saved as kml files and then imported into WatchITgrow for monitoring.For each field, fAPAR or "greenness" maps were created (see example in Figure 2), and a growth curve was built, showing the evolution of the fAPAR values throughout the cropping season (see example in Figure 3).To build the growth curve the fAPAR values of all pixels within the field were averaged, thereby accounting for an inside buffer of ten meters (one pixel) in order to avoid capturing signals from neighboring fields or other objects.To correct for artifacts in the resulting fAPAR curve such as abnormally low fAPAR values due to undetected clouds, shadows or haze and to interpolate fAPAR values between subsequent acquisition dates, a Whittaker smoothing filter was applied on the curve [38,39].field were averaged, thereby accounting for an inside buffer of ten meters (one pixel) in order to avoid capturing signals from neighboring fields or other objects.To correct for artifacts in the resulting fAPAR curve such as abnormally low fAPAR values due to undetected clouds, shadows or haze and to interpolate fAPAR values between subsequent acquisition dates, a Whittaker smoothing filter was applied on the curve [38,39].Agronomy 2019, 9, x FOR PEER REVIEW 7 of 18 field were averaged, thereby accounting for an inside buffer of ten meters (one pixel) in order to avoid capturing signals from neighboring fields or other objects.To correct for artifacts in the resulting fAPAR curve such as abnormally low fAPAR values due to undetected clouds, shadows or haze and to interpolate fAPAR values between subsequent acquisition dates, a Whittaker smoothing filter was applied on the curve [38,39].

Modeling Total Aboveground Biomass Yields
Thirteen models were assessed in this study to predict sorghum biomass yields.The models included partial least square discriminant analysis (PLS-DA), principal component analysis discriminant analysis (PCA-DA), neural network (NN), random forest (RF), support vector machine (SVM) with linear classifier (SVML), nonlinear kernel (SVML_G), radial basis kernel (SVM_R), radial basis kernel with polynomial basis kernel (SVM_P), neural network (NNET), eXtreme Gradient Boosting-xgbtree method (GBT), eXtreme Gradient Boosting-xgbDART method (GBD), eXtreme Gradient Boosting-xgbLinear method (GBL), simple linear model (LM), and Neural Network-neuralnet method (NLNET).The simple linear model was used as a benchmark to gauge the performance of the models implemented.The models evaluated in this work were selected based on their robustness as reported in previous studies [40].
The field-based daily interpolated fAPAR estimates extracted from WatchITgrow were converted to fortnightly fAPAR averages.In this study, preference was given to the use of fortnightly fAPAR data as major morphophysiological changes in crops also occur fortnightly [41].In addition pilots established in experimental stations and in farmers' fields were fortnightly visited at or close to the times the fAPAR images used in this work were acquired.This time management was also favorable and accommodated the busy schedules of farmers and scientists.
Six fortnightly fAPAR values registered from May to July-here referred to as six "days of year" (DOY), that is, DOY 135 and 150 in May, 165 and 180 in June, and 195 and 210 in July-were used as regressor variables in successive predictive modeling of sorghum biomass yields.May, June, and July are important months concerning the predictive modeling, mimicking the 1 to 2 months required to release yield predictions before harvest [42]; taking into account that biomass sorghum in the Mediterranean region is harvested from August to November.
The research questions addressed in this work are (1) how accurately can we predict the yield of a biomass sorghum field based of Sentinel-2-derived fAPAR profile early in the cropping season?(2) Which months and/or days-of-year best contribute useful information for predicting biomass yields in commercial sorghum fields?The solutions to the above problems were evaluated by solving the below linear model for n trials or experimental locations ( 1, … . .,  and p prediction times or days of year ( 1, … . ., ).This model is represented by where  is the overall mean,  is the phenotypic observation (biomass yields) from field ,  is the residual comprising all other nongenetic and environmental factors,  is the days of year covariates, and  is the effect of the jth day of year covariate on  [43].Note that it is beyond the scope being presented here to identify and/or predict within-field yield variability for any potential applications.In addition, different sorghum types were combined in this study as they qualified for commercial aboveground biomass production and to mimic farming practices in the region of the study.We also assumed that the test region was homogeneous with respect to climatic conditions.

Modeling Total Aboveground Biomass Yields
Thirteen models were assessed in this study to predict sorghum biomass yields.The models included partial least square discriminant analysis (PLS-DA), principal component analysis discriminant analysis (PCA-DA), neural network (NN), random forest (RF), support vector machine (SVM) with linear classifier (SVML), nonlinear kernel (SVML_G), radial basis kernel (SVM_R), radial basis kernel with polynomial basis kernel (SVM_P), neural network (NNET), eXtreme Gradient Boostingxgbtree method (GBT), eXtreme Gradient Boosting-xgbDART method (GBD), eXtreme Gradient Boosting-xgbLinear method (GBL), simple linear model (LM), and Neural Network-neuralnet method (NLNET).The simple linear model was used as a benchmark to gauge the performance of the models implemented.The models evaluated in this work were selected based on their robustness as reported in previous studies [40].
The field-based daily interpolated fAPAR estimates extracted from WatchITgrow were converted to fortnightly fAPAR averages.In this study, preference was given to the use of fortnightly fAPAR data as major morphophysiological changes in crops also occur fortnightly [41].In addition pilots established in experimental stations and in farmers' fields were fortnightly visited at or close to the times the fAPAR images used in this work were acquired.This time management was also favorable and accommodated the busy schedules of farmers and scientists.
Six fortnightly fAPAR values registered from May to July-here referred to as six "days of year" (DOY), that is, DOY 135 and 150 in May, 165 and 180 in June, and 195 and 210 in July-were used as regressor variables in successive predictive modeling of sorghum biomass yields.May, June, and July are important months concerning the predictive modeling, mimicking the 1 to 2 months required to release yield predictions before harvest [42]; taking into account that biomass sorghum in the Mediterranean region is harvested from August to November.
The research questions addressed in this work are (1) how accurately can we predict the yield of a biomass sorghum field based of Sentinel-2-derived fAPAR profile early in the cropping season?(2) Which months and/or days-of-year best contribute useful information for predicting biomass yields in commercial sorghum fields?The solutions to the above problems were evaluated by solving the below linear model for n trials or experimental locations (i = 1, . . ., n) and p prediction times or days of year (j = 1, . . ., p).This model is represented by where µ is the overall mean, y i is the phenotypic observation (biomass yields) from field i, e i is the residual comprising all other nongenetic and environmental factors, x ij is the days of year covariates, and β j is the effect of the jth day of year covariate on y i [43].Note that it is beyond the scope being presented here to identify and/or predict within-field yield variability for any potential applications.In addition, different sorghum types were combined in this study as they qualified for commercial aboveground biomass production and to mimic farming practices in the region of the study.We also assumed that the test region was homogeneous with respect to climatic conditions.All statistical analyses were carried out using R software [44].The predictive models were fitted using the caret R package.In this work, the "one standard error" rule of Breiman et al. [45] was implemented to avoid overfitting, and the caret built-in features were invoked to automatically choose the tuning parameters associated with the best performance of the regression routines.During data preparation, zero-variance regressors were removed and those remaining were centered and scaled in order to avoid regressors with zero or near-zero variance, which often constitute a problem as they behave as second intercepts in predictive models [40].The dataset was randomly partitioned into training (80% of the entire dataset; 34 observations) and testing set (20% of the entire dataset; eight observations).The training set was used to run a cross-validation experiment to train and assess the models using a 10× repeated 5-random fold cross-validation (CV) iterations, rendering a total of 50 estimates of accuracy and prediction error; a large number of repetitions is expected to compensate for the high variance stemming from a reduced number of folds.Models were validated on the testing set which was an external test (validation) sample set needed so that the model performance can be characterized on data that were not used in the model training.The models were evaluated based on the prediction accuracy, the mean absolute error (MAE), and the mean absolute percentage error (MAPE).The MAE built within the repeated cross validation procedure (model calibration) was used to assess the variability (dependability) of the model performance.On the other hand, the MAE, MAPE, and accuracy obtained on the testing set were used to assess the model predictive ability.The MAPE allows us to compare the prediction of different dependent variables in different scales.The MAE measured the average magnitude of the errors in the set of predictions of biophysical variable values produced in this work, without considering their direction.It represented the average over the test sample of the absolute differences between prediction and actual observation where all individual differences had equal weight.The MAE was chosen for the model verification because it provides an unambiguous measure of the magnitude of the average error and is therefore more appropriate than the Root Mean Square Error (RMSE) for dimensioned evaluations of average model performance error [46].The distribution of the 50 MAE estimates from the optimal cross-validated models was characterized using boxplot, while the comparison of mean accuracies across models and across prediction times was performed using Duncan's test [47].The importance of the regressor variables (useful prediction times) was determined using a 0 to 100 index, with 0 corresponding to no effect and 100 corresponding to the highest magnitude of the regressor's importance.The accuracy was defined as the Pearson correlation coefficient between the predicted and the observed biomass yield values in the testing set [5].From the computed accuracy, r-squared values can be derived in order to better compare, for each model, the proportion of the variance in the dependent variable that is predictable from the regressors.

fAPAR Index Pattern Across Sorghum Types
Three fAPAR curve and map patterns were consistently observed as illustrated in the above Figures 2 and 3 using data from the 2017 cropping season.In dual purpose and biomass sorghums, a major peak was observed earlier in July followed by a drop and then a weak increase at the beginning of the second half of September.For the sweet, forage, and the perennial sorghum (SB × SH) grown from seeds, the fAPAR increased significantly in early July to reach a plateau from then up to late September/early October, whereas, in October, the curve decreases sharply to reach the minimum value in early November.On the other hand, in perennial sorghum regrown from rhizomes, two fAPAR peaks (smaller peak in mid-May, bigger peak in late September/early October) were observed that were separated by a deep drop extending from June to August.
NNET followed by NLNET and GBL.Prediction error (MAE) was lower in SVM-P and GBD, while it was not statistically different in PLS-DA, PCA-DA, RF, SVML, SVML-G, SVM-R, GBT, and LM (Figure 4, Table 2).The MAE values (in t ha −1 ) calculated using the validation (testing) set and the best prediction time (May) were, in increasing order, 1.87 (13.85%), 2.18 (16.15%), 2.27 (16.81%), 2.34  The mean comparison showed that SVM-R was the least accurate model.The other models showed comparable accuracies, but RF, SVML, SVML-G, SVM-P, NNET, GBT, and GBD showed prediction ability greater than SVM-R.GBT's prediction ability was consistently greater than 0.5 across the prediction times.Apart from GBL, the prediction ability of all models was high (prediction accuracy greater than or equal to 0.76) and/or better in the month of May (Table 2).The mean accuracy across models was high and not significantly different in May, May-June, and May-July.The across-model average accuracy computed in May was significantly superior to the mean accuracy obtained in June, June-July, and July.June, June-July, and July were statistically equally worst times for predicting biomass yields in sorghum under the Mediterranean region.
Spearman's rank correlation coefficient (Spearman's rho) between model accuracy and MAE values (t ha −1 and %) corresponding to the testing set was −0.40.The Spearman's rho method assesses how well the relationship between two variables can be described using a monotonic function between ordered sets that preserves or reverses the given order [48].The Spearman's rho approach was selected to account for the small size of the samples whose pairwise statistical dependences could not be correctly assessed with parametric approaches that have to be implemented on normally distributed data.Indeed the Shapiro-Wilk test of normality for the vectors of model accuracies and MAE values, was very highly significant (p < 0.001), meaning that we couldn't assume the normality.
Over the May to July prediction time interval, six days of year corresponding to fortnightly fAPAR indices, were used as regressors in this work.Among these regressors, the most important times to predict the aboveground sorghum biomass yields were investigated using the GBT algorithm as this model showed high and dependable performance that was insensitive to the prediction times across the cropping season.The model showed that the day of year 150 was the most important (index = 100) followed by DoY 165 (index = 80), DoY 135 (index = 30), DoY 195 (index = 20), and DoY 210 (index = 10) (Figure 5).The day of year 180 was associated with no importance in terms of fAPAR-based prediction of the aboveground biomass yields in sorghum under the Mediterranean environment.algorithm as this model showed high and dependable performance that was insensitive to the prediction times across the cropping season.The model showed that the day of year 150 was the most important (index = 100) followed by DoY 165 (index = 80), DoY 135 (index = 30), DoY 195 (index = 20), and DoY 210 (index = 10) (Figure 5).The day of year 180 was associated with no importance in terms of fAPAR-based prediction of the aboveground biomass yields in sorghum under the Mediterranean environment.

Discussion
The fAPAR biophysical variable used in this work was derived from satellite imagery, which is part of Earth Observation's big data.Big data technology (BDT) is a new technological paradigm that is driving entire economies, including low-tech industries such as agriculture where it is implemented under the banner of precision farming (PF) [49].In this work, BDT was built on geocoded maps of agricultural experiment fields and the real-time monitoring of sorghum crops on commercial farms in order to assess the possibility to monitor sorghum growth and development, with the ultimate aim of predicting the aboveground biomass yields.Early prediction of biomass production has positive implications including increased efficiency in biomass, biofuel, and farming resource management [50], and avoidance of energy crises.Forty-two sorghum pilot trials were evaluated in this work using fAPAR and different sorghum varieties belonging to four biomass producing sorghum types of dual purpose, sweet, forage, and biomass per se.Combining different types of biomass producing sorghums in this study was motivated by the need to mimic farming practice in the Mediterranean region.In this region, farmers, farming cooperatives, and third-party biomass harvesting and biodigesting companies manage the above-mentioned sorghum types

Discussion
The fAPAR biophysical variable used in this work was derived from satellite imagery, which is part of Earth Observation's big data.Big data technology (BDT) is a new technological paradigm that is driving entire economies, including low-tech industries such as agriculture where it is implemented under the banner of precision farming (PF) [49].In this work, BDT was built on geocoded maps of agricultural experiment fields and the real-time monitoring of sorghum crops on commercial farms in order to assess the possibility to monitor sorghum growth and development, with the ultimate aim of predicting the aboveground biomass yields.Early prediction of biomass production has positive implications including increased efficiency in biomass, biofuel, and farming resource management [50], and avoidance of energy crises.Forty-two sorghum pilot trials were evaluated in this work using fAPAR and different sorghum varieties belonging to four biomass producing sorghum types of dual purpose, sweet, forage, and biomass per se.Combining different types of biomass producing sorghums in this study was motivated by the need to mimic farming practice in the Mediterranean region.In this region, farmers, farming cooperatives, and third-party biomass harvesting and biodigesting companies manage the above-mentioned sorghum types indiscriminately on a regular basis.It made therefore sense not to discriminate the biomass producing sorghum types as sources of variation in the models implemented in this work.Similar investigations were reported in previous studies working on different crop species [16,51].Important regressors of interest were identified and used in the predictive algorithms as suggested in literature [14,51].
The fAPAR index produced unique curves and maps that discriminated between the types of sorghums evaluated in this work.The fAPAR profile paralleled the evolution of leaf senescence across sorghum types [52] under the Mediterranean environment.The fAPAR curves presented in this work were purposely derived from sorghum fields established side by side in the same location in Anzola dell'Emilia.These pilots were sown and harvested on the same dates and managed identically, which allows a coherent comparison.The above-described shapes of the curves were generally similar also across locations in this study, though with slight discrepancies for some sorghum types.All sorghum trials reported in this work were conducted under a rainfed regime.Given that Mediterranean region is characterized by a semiarid climate wherein summer crops rely heavily on winter soil-stored moisture and experience postanthesis drought stress, it can be inferred that the fAPAR in dual purpose and biomass sorghum types did not rise during the reproduction growth stage probably due to a combined effect of sink demand and soil water scarcity in dual purpose, and mostly soil water scarcity in biomass sorghum.Postanthesis drought stress in sorghum under the Mediterranean environment was amply described by Habyarimana et al. [52][53][54][55].The fAPAR profile in the sweet, forage, and SB × SH grown from seed reflects the reduced importance of the sink and the delayed leaf senescence in these types.In these sorghum types, a slow fAPAR increase toward the harvest can be explained by the precipitations registered in early fall in most locations (Figures S1 and S2), which stimulated the growth of axillary tillers in annual Sorghum bicolor [54,55] and the growth of axillary tillers and ramets in perennial SB × SH sorghum [54][55][56][57].The above explanation holds also in the case of the SB × SH regrown from rhizomes.In these plants, the deep fAPAR drop from mid-June (anthesis) to early fall corresponds to the observed dry summers (Figures S1 and S2) and testifies to the increased susceptibility to drought stress in these plants.The conclusions drawn on fAPAR profile held particularly for fields established in the same location.Therefore, further investigations with replications in time and space are in order before any generalization is made.
In this work, high levels of model prediction accuracy (r ≥ 0.70 or r 2 ≥ 0.50) were obtained for 12 out of the 13 models deployed at the best prediction time (May).The models were therefore able to explain 50% of the variability that existed in the sorghum biomass yield data, while the remaining variance can be related to other factors non accounted for in this study such as the heterogeneity of external environmental and anthropogenic factors including rainfall distribution, soil types, and planting/tilling practices that could lead to different yield responses across the farms.The modeling performance metrics achieved in this work are nonetheless comparable to previous findings.For instance, Battude et al. [58], Shafian et al. [31], and Panda et al. [30] came across similar accuracy in their work on maize biomass and grain yields, and sorghum yields, respectively.On the other hand, the accuracy realized in this work was greater or equal to the values reported in Gao et al. [51], Diouf et al. [14], and the optimal values in sorghum as presented in Johnson [16].Linear and nonlinear models performed comparably in terms of accuracy and mean absolute error implying that the relationship between fAPAR and biomass yield was mainly linear, which was expected and also supported by previous findings [14].At the best prediction time (month of May), the correlation between the model accuracies and the MAE values was negative, denoting the expected inverse relationship between the two metrics of model prediction performance.
The simple linear model was implemented in this work to serve as a benchmark with respect to the most complex models requiring parameters optimization.Since thirteen models were implemented in this study, it is interesting to select the best algorithms.As biomass sorghum in the Mediterranean region is harvested from end of August to late November, it can be interesting to be able to predict the biomass production from May to July, allowing the farmer to know the amount to be produced one to six months ahead of harvest [42,51].SVML, SVML-G, GBT, and LM showed good prediction accuracy (r ≥ 0.50) across the evaluated prediction times, with MAE values (%) of 27.70, 27.70, 19.85, and 33.56, respectively.The GBT model was therefore the best algorithm as it performed consistently well (r ≥ 0.60) across the prediction times, and was associated with low prediction error.The GBT model can therefore be recommended for sorghum biomass yield prediction using Sentinel-2-derived fAPAR as biophysical variable under the Mediterranean region.This model can be deployed anywhere from May to July without significant loss function.
In terms of biomass yields prediction times, June, July, and June-July were the worst times.May, May-June, and May-July showed comparable average accuracies, but accuracy in May was generally high (r ≥ 0.70) across models except GBL.The month of May can therefore be recommended as the best time to predict sorghum biomass yields in the Mediterranean region.In this work, several types of sorghum were used, including high tonnage, sweet, forage, and dual purpose types.The suitability of the month of May for sorghum biomass yields prediction can be partly explained by the fact that in early sorghum growth stages, particularly in the period of time around the fast growth stage, the four sorghum types exhibit similar levels of growth and development.Furthermore, sorghum crop as currently grown in the Mediterranean region, reaches the fast growth stage generally in the month of May, meaning that predictions run in May are carried out on populations of sorghum types that are mostly at the same stage of growth and development.Overall, the days of year 150 and 165 were the most important regressors followed by days of year 135, 195, and 210 in decreasing order.The two regressors acquired in May (DoY 150 and 135) had important direct effects on the sorghum biomass, which justifies the good prediction accuracies obtained in this month.On the other hand, the two regressors corresponding to the month of July showed poor importance on biomass yields, while one of the two regressors corresponding to the month of June had meaningless effect on biomass yields, all of which explains the poor prediction accuracies obtained in June, July, and June-July (Table 2).In the Mediterranean region, sorghum is sown mid-to-late April.Therefore, being able to perform accurate sorghum biomass yields prediction in May, i.e., up to six months ahead of harvesting is a remarkable opportunity for the farmer and farming cooperatives that can use this information for several business-related purposes.They can efficiently organize the biomass business operations including the rational mobilization of the fleets of harvesting machinery, transport vehicles, and storage facilities.The predictive models developed in this work can also be used by extension services and policy-makers for strategic purposes.Obtaining the information on potential within-season biomass availability early on before actual harvest will help assess alternative means for energy supply internally, import or export, which is expected to help avoid specific crises such as fuel shortage.The findings in this work are limited in scope to one province in Italy, within the Mediterranean region.The prediction equations produced in this work can therefore be safely used in analogous modeling experiments in other Mediterranean areas.However, for these equations to be extended to modeling activity at a global level, the training populations of farms would require updates with inclusion of data accounting for sampling additional latitudes and longitudes relevant for sorghum cultivation.

Conclusions
The importance of sorghum as food, feed, and biofuel crop was amply described in several scientific literatures.Biomass sorghum demonstrated higher yields with better energy balance relative to major crops of agroindustrial interest.As dedicated biomass sorghum crops are steadily increasing and precision farming is driving agricultural economies worldwide, the harnessing satellite technology is well-poised to bring about agricultural advantages including cutting farming operational costs.Sentinel-2-derived fraction of absorbed photosynthetically active radiation was found to satisfactorily explain primary productivity and was used in this study as biophysical variable in the predictive modeling of aboveground biomass yields in annual and perennial sorghums.Across month combinations from May to July and the thirteen machine learning prediction algorithms used in this work, the gradient boosting machine learning algorithm implementing xgbtree was identified as the best predictive model.The best prediction time for sorghum biomass was particularly the month of May, followed by May-June and May-July using fortnightly fAPAR indices.To the best of our knowledge, the present work represents the first time Sentinel-2-derived fAPAR is used in predictive modeling of sorghum biomass yields.The outcome from this study is important and can serve several purposes including farmers being able to improve their sorghum biomass business operations.Policy-makers and extension services will also benefit from the findings in this work allowing them early on within season information on potential biomass availability, which is critical to wider energy planning and avoiding energy-related crises.

Figure 1 .
Figure 1.Map of Italy (A) with a rectangle inset indicating the geographical location of the experimental sites (red dots) for pilots established in 2017 (B) and 2018 (C).

Figure 1 .
Figure 1.Map of Italy (A) with a rectangle inset indicating the geographical location of the experimental sites (red dots) for pilots established in 2017 (B) and 2018 (C).

Figure 2 .
Figure 2. Greenness (fAPAR) maps derived from Sentinel-2 satellite imagery for five sorghum fields in Anzola (from left to right: T5-grain sorghum, T4-dual purpose sorghum, T3-sweet sorghum, T2-forage sorghum, T1-biomass sorghum) for a selected number of dates in 2017, as available via WatchITgrow.T5-grain sorghum was not included in this study (refer to section 2.1 for detail).

Figure 2 .
Figure 2. Greenness (fAPAR) maps derived from Sentinel-2 satellite imagery for five sorghum fields in Anzola (from left to right: T5-grain sorghum, T4-dual purpose sorghum, T3-sweet sorghum, T2-forage sorghum, T1-biomass sorghum) for a selected number of dates in 2017, as available via WatchITgrow.T5-grain sorghum was not included in this study (refer to Section 2.1 for detail).

Figure 4 .
Figure 4. Visualization of models MAE (t ha −1 ) dispersion using boxplot approach and fAPAR acquired in May.PLS-DA, PCA-DA, RF, SVML, SVML-G, SVM-R, SVM-P, NNET,GBT, GBD, GBL, LM, and NLNET, respectively, partial least squares discriminant analysis, principal component analysis discriminant analysis, random forest, Support Vector Machines with Linear Kernel, Support Vector Machines with Linear Kernel grid search, Support Vector Machines with Radial Basis Function Kernel, Support Vector Machines with Polynomial Kernel, neural network, eXtreme Gradient Boosting xgbtree method, eXtreme Gradient Boosting xgbDART method, eXtreme Gradient Boosting xgbLinear method, Linear model, and Neural Network neuralnet method.

Figure 4 .
Figure 4.Visualization of models MAE (t ha −1 ) dispersion using boxplot approach and fAPAR acquired in May.PLS-DA, PCA-DA, RF, SVML, SVML-G, SVM-R, SVM-P, NNET, GBT, GBD, GBL, LM, and NLNET, respectively, partial least squares discriminant analysis, principal component analysis discriminant analysis, random forest, Support Vector Machines with Linear Kernel, Support Vector Machines with Linear Kernel grid search, Support Vector Machines with Radial Basis Function Kernel, Support Vector Machines with Polynomial Kernel, neural network, eXtreme Gradient Boosting xgbtree method, eXtreme Gradient Boosting xgbDART method, eXtreme Gradient Boosting xgbLinear method, Linear model, and Neural Network neuralnet method.

Figure 5 .
Figure 5. Relative importance of regressors (day of year, D) on sorghum biomass yields in 2017 and 2018, using eXtreme Gradient Boosting xgbtree (GBT) method.

Figure 5 .
Figure 5. Relative importance of regressors (day of year, D) on sorghum biomass yields in 2017 and 2018, using eXtreme Gradient Boosting xgbtree (GBT) method.