Machine Learning Applied to Tree Crop Yield Prediction Using Field Data and Satellite Imagery: A Case Study in a Citrus Orchard

: The overall goal of this study is to deﬁne an intelligent system for predicting citrus fruit yield before the harvest period. This system uses a machine learning algorithm trained on historical ﬁeld data combined with spectral information extracted from satellite images. To this end, we used 5 years of historical data for a Moroccan orchard composed of 50 parcels. These data are related to climate, amount of water used for irrigation, fertilization products by dose, phytosanitary treatment dose, parcel size, and root-stock type on each parcel. Additionally, two very popular indices, the normalized difference vegetation index and normalized difference water index were extracted from Sentinel 2 and Landsat satellite images to improve prediction scores. We managed to build a total dataset composed of 250 rows, representing the 50 parcels over a period of 5 years labeled with the yield of each parcel. Several machine learning algorithms were tested with the necessary parameter optimization, while the orthonormal automatic pursuit algorithm gave good prediction scores of 0.2489 (MAE: Mean Absolute Error) and 0.0843 (MSE: Mean Squared Error). Finally, the approach followed in this study shows excellent potential for fruit yield prediction. In fact, the test was performed on a citrus orchard, but the same approach can be used on other tree crops to achieve the same goal.


Introduction
Precision farming is a new field that has used new technologies such as artificial intelligence to improve agriculture around the world [1][2][3][4]. One of the main challenges in this field is yield prediction [5], this information is vital for farmers to have an idea of the orchard's production and for decision makers to compare demand and supply to make the decisions necessary to balance the market [6].
Yield estimation can be conducted for annual or tree crops. In the case of tree crops, the existing works can be split into two parts. The first concerns the estimation of the yield per tree. The idea is to prepare the training dataset by collecting data from each tree individually, which means that the model should take data from only one tree as an input and should be able to estimate the yield of this tree. In this case, the existing articles focus more on counting trees and detecting fruits inside. The idea is to use specific object detection algorithms such as Faster RCNN, Yolo, and retinanet (deep learning algorithms), which take as input an RGB (red, green, and blue) image of a tree in order to count its fruit [7][8][9][10]. The problem with this kind of method is that we have to wait for the fruit to ripen to obtain results, which is not good for farmers and producers who want to predict the yield a few months before harvest. Moreover, this method needs a lot of effort and becomes difficult when we have large orchards with hundreds of parcels with thousands of trees in each parcel.
The second part concerns the estimation of the yield per parcel, which means that the model must predict the total yield of a given parcel. To achieve this goal, a large historical dataset with a maximum of factors correlated with yield is needed. In this regard, some data can be collected directly in the field by parcel and during each year, such as the quantity of water used for irrigation, the fertilization and phytosanitary treatment products used, the root-stock type, and climatic data [11][12][13][14][15]. Other spectral information such as vegetation and canopy size can be obtained from satellite or unmanned aerial vehicles (UAV) images [16][17][18]. Xujun Ye et al. [19] tried to predict the yield of Satsuma tangerine based on aerial hyperspectral images. The data used are composed of some vegetation indices combined with the canopy size of trees. The indices extracted are: Normalized Difference Vegetation Index (NDVI), Simple Ratio (SR), and Photochemical Reflectance Index (PRI). Three images were used to obtain these indices over a period of three years (one image per year). For the prediction model, a multiple linear regression (MLR) and partial least squares (PLS) regression were used, and the result obtained is 0.5355 RRMSE (Predictive Root Mean Squared Error).
Although spectral data and canopy size for predicting tree yield are important, we found that most of the existing works use paid satellite data (high resolution images), which are very expensive for farmers, and even with these images, the results are not convincing. So, as we said at the beginning, the integration of field data is necessary to improve the prediction results. De Ollas et al. [20] conducted a review study on climate change in the Mediterranean basin and its impact on crop productivity in terms of yield and quality. The study included four crops: sweet orange, clementine citrus, olive, and grapevine. They found that the Mediterranean region is experiencing a significant climate change in terms of decreased rainfall and a sharp increase in temperature, which significantly affects tree production.
Vogel et al. [21] used the random forest algorithm to study the impact of climate change on yield. A global dataset containing several agricultural crops was used, and they found that certain climatic variables (temperature, precipitation, frost, hot days, and cold nights) could impact the yield every year with a rate varying between 20 and 49%.
Good irrigation and the amount of water used in each parcel are also vital for yield. Nagaz et al. [22] conducted a deficit irrigation experiment on two orange orchards, and the result shows that the yield is reduced by 24% and 45%, respectively, in the two orchards, which means that the amount of water used for irrigation directly influences the tree yield.
Fertilization and soil analysis are also two vital variables that have attracted many researchers in recent years. These factors are considered the subject of many studies [23][24][25][26], for example, the study of Zhiguo Li et al. [25] demonstrates that balanced fertilization in nitrogen (N), phosphorus (P), and potassium (K) is necessary to have a perfect yield, which means that the dose of each product used in the fertilization will help the prediction model give a high score in the yield prediction.
Despite all these works that demonstrated the importance of field data for yield prediction, collecting these data remains a challenge for precision agriculture. This requires specific sensors with a technical team working in the field to control the quality of the data collected, which is why historical and open source databases are very limited [27,28].
The main objective of this article is to present our approach for citrus yield prediction per parcel based on the combination of field data and spectral information obtained from satellite images.
In the remaining part, we provide an explanation of how we collected our dataset depending on some important factors, the data preparation methods that we used, the machine learning algorithms that we developed with the necessary optimizations, and finally, the validation method with the scores obtained.

Study Region
The study orchard is located in Morocco, in the center of the Marrakech-Safi region. The climate of this region is semi-arid and experiences large temporal fluctuations in rainfall and temperature. The average annual rainfall is 199.6 mm, and the average annual temperature is 18.5°C [29]. This orchard is planted with citrus, with a variety of mandarin Afourer, and the parcels size varies between 1 and 6 hectares.

Data Acquisition
Data collection is a vital part of any data science project [30]. In our case, our dataset includes two parts: field data, which is collected directly from the orchard, and spectral data obtained from satellite images. For the first data category, we have 5 years of historical information about irrigation, fertilization, and phytosanitary treatment. Moreover, a climate station is installed inside the orchard to collect daily data on temperature, precipitation, humidity, wind speed, and solar radiation. In detail, our field dataset is made up of 250 rows, which represent 50 parcels multiplied by 5 years; these data present the quantity of water irrigated in each parcel for each agricultural year, the dose of phytosanitary treatment used, and the different products with the doses that were used to fertilize the parcels, as well as the climatic data. This data is labeled according to the yield of each parcel during each crop year.
The different products used for fertilization and phytosanitary treatment during the 5 years of data are presented in the Table 1.  The second part of the data (spectral information) is composed of two elements: the Normalized Difference Vegetation Index (NDVI) and the Normalized Difference Water Stress Index (NDWI). These indices are obtained from the sentinel-2 [31] and Landsat [32] satellite imagery, which are highly used in the agriculture field [33][34][35].
The NDVI and NDWI formulas are presented in Equations (1) and (2), respectively, of which near-infrared radiation (NIR) is a reflection in the near-infrared spectrum, the Red is a reflection in the red range of the spectrum, and the Short Wave Infrared (SWIR) is the part of the range with wavelengths in the range of 0.841-0.876 nm.

Data Processing
To exploit the maximum field data in order to create a robust prediction model, each part of the data (factor) is prepared according to a specific method as follows: Phytosanitary treatment (C1): It consists of the quantities and doses used over the agricultural year. These products are used against certain specific diseases which attack citrus fruits such as mites, ceratitis, snails, etc. (Table 1). Sometimes, these products are used in certain parcels among all, or in a specific year only. We therefore assigned a 0 to the empty cells to complete our dataset. After having prepared this part of the data, we obtained 52 columns, which contain the quantities and doses used in each parcel during the 5 years.
Fertilization (C2): It consists of the quantities and doses of fertilization products used during each year in each parcel (Table 1). For the preparation method, the same operation used for phytosanitary treatment was followed for the fertilization part; so, we obtained 100 columns including the different quantities and doses of fertilization products used in each parcel during each year.
Climate (C3): It is composed of 5 variables (temperature, precipitation, humidity, wind speed, and solar radiation), which were averaged for each month to obtain 60 columns.
Irrigation (C4): We have just 1 column, which contains the total amount of water used in each parcel during each year.
Parcels information (C5): It consists of some fixed information about each parcel such as the size, the root-stock type, and agricultural year.
Yield (target): It presents the number of crates from each parcel after the harvest period. We normalized it and, of course, this normalization keeps the same distribution and percentage difference.
Finally, we combined all these columns to obtain a large dataset with a size of 216,250 (columns, rows) labeled by the yield of each parcel during the 5 years. Figure 1 summarizes these factors and how we prepared them to construct the field data. For spectral information, we used Google Earth Engine (cloud) to extract NDVI and NDWI indices over a specific period (8 months) from March to October each year. This decision was made based on the life cycle of the Afourer mandarin in Morocco, which begins in March after the harvest period. Then, we took the average of these indices during the 8 months to obtain the NDVI and NDWI matrices per each parcel during each year. Finally, we extracted the mean, max, min, and mode values from those bands to build our spectral data (Figure 2).

Data Exploration
Before starting the prediction phase using machine learning algorithms, data exploration is a vital phase that helps to have a clear view of the variables and make the necessary feature engineering. In our case, the target distribution ( Figure 3), which starts from 2015 to 2019 with 50 parcels, follows a normal distribution, which is very beneficial for machine learning models. In addition, Figure 4 presents the variation of the yield in some parcels during the 5 years of our data. The bar graphs in the figure show four samples taken at random from different locations in the orchard. Additionally, in each graph, the bars presents the normalized yield by year. As you see, we have a considerable variation of yield in each parcel; this means that there are other factors related to field and climate influencing this yield. In order to summarize the dataset, we tried to create a correlation matrix which represents all the information hidden behind the features, the relationship between them, as well as their relationship with the target [36]. Figure 5 presents a correlation matrix (heatmap) of the input variables. As we have many variables, and it is impossible to visualize them in the figure, we tried to reduce the number of variables; for example, we averaged the values for each climatic variable over each year. Moreover, for fertilization and phytosanitary treatment products, we only visualized their quantity.
As you see in the heatmap, a lot of variables are positively correlated (red color), and others are negatively correlated (blue color), which means that there is a good dependency between them. For example, we see that the climatic data are correlated with phytosanitary treatment features. The interpretation of these results is that there are diseases which attack the trees at a specific time of the year (for example Ceratitis capitata, which attacks citrus fruits in winter [37]). Moreover, the correlation between fertilization and climate is strongly demonstrated for the reason that the quantities of fertilization used during the year sometimes depend on high temperature or on water stress, which is correlated with the temperature [38]. This dependency is very important for the prediction algorithms to perform good training, and it can help us to make some feature engineering.
Regarding the correlation between the input variables and the target (Figure 6), we found that some variables such as quantity AMMONITRATE, parcels size, nitrogen, phosphorus, potassium, etc., have a good correlation of up to 0.5 and 0.6, which is a positive indicator before starting the prediction phase. We also found that the variables related to the phytosanitary treatment have a good correlation with the target, because the quantities and the products used depend on the disease that attacked the parcel, which surely influence the yield of this parcel. Generally, there is an agricultural expert who checks the parcels each time, and when they observe a disease or an alteration requiring a specific product, they initiate the treatment operation in the infected area.

Our Approach
Machine learning algorithms have demonstrated good performance in recent years, which are used in most fields [39,40]. In our case, after collecting and preparing our dataset, we decided to use machine learning algorithms to predict the yield of citrus fruits, in particular the Afourer variety.
Technically, we have a regression problem, as we are trying to predict yield quantities. Basically, we have a supervised learning problem with regression prediction. In this case, there are linear algorithms such as linear regression, lasso regression, ridge regression, etc. These algorithms analyze the relationships between the dependent variable Y and the set of independent variables X. This relationship is expressed in the form of an equation that predicts the values of the target variable in the form of a linear combination of parameters [41].
Other algorithms are widely used, such as decision tree regression, which determines the best feature in the training dataset after dividing it into subsets containing the possible values of the best feature. After this, the algorithm recursively generates new decision trees using the created data subsets. When there is no more prediction based on this data, the algorithm stops and returns the final model [42].
In addition, an ensembling learning technique has been used in recent years. This technique builds multiple models with different parameters to increase the prediction score. There are two main ensembling methods, boosting (sequential) and bagging (parallel) [43]. In the boosting ensemble learning method, the models train sequentially, and they try to learn until they make mistakes. Regarding the bagging method, the models are trained simultaneously, each one of them trains on a subsample of data, and the final model builds based on voting among the final predictions [44].
Through ensembling learning techniques, several powerful algorithms have been built, such as CatBoost Regressor, Light Gradient Boosting Machine, Random Forest Regressor, Extreme Gradient Boosting, etc. [45,46]. Finally, a cross-validation technique with a tuning parameter algorithm is needed to obtain the hyper-parameters and avoid overfitting [47].
In our case, we split our dataset into two parts, the training part and the test part. In fact, this split is not random, we tried to imagine that our model could be used to predict the yield of an agricultural year before the harvest period, so we took the data for the years 2015, 2016, 2017, and 2018 to train the models, and we made the test using the year 2019 (we used the past to predict the future). Figure 7 summarizes the steps of our approach.
To avoid overfitting and to select the best algorithm, we used the K-fold technique with 4 folds in the training and validation phase. During each round, a grid-search technique is applied to find the best parameters for each algorithm. After the four iterations, we obtained an average score, which can be considered as the trust validation score. Then, we tested the models on another part of the data (test), which is completely new, in order to have the prediction score. Figure 8 illustrates the cross-validation steps.

Results and Discussion
To choose the best algorithm that fits our dataset, we tested several types of algorithms: linear, decision trees, and some ensembling algorithms.
The metrics used for scoring are MAE and MSE, whose formulas are presented successively in Equations (3) and (4), where y i is the prediction, x i is the real value, and n is the total number of data points.

Cross-Validation and Model Selection
In the first time, we used all the data (field and spectral data) to perform crossvalidation. A k-fold technique with k = 4 is used, and the scores obtained are presented in Table 2.
The results in Table 2 show a good performance in the prediction, with an average score approaching 0.11 (MSE), this indicates that the data really helped the models to train well. Moreover, the Orthogonal Matching Pursuit algorithm [48] gave a good score of 0.10 (MSE CV).
After performing cross-validation and parameter tuning using the grid-search algorithm, we tried to train the new models on the all the training datasets and test them on other new data that they have not really seen before (data corresponding to the year 2019). Moreover, to see the importance of field data and the added value of spectral data, we tested the models on two parts of data, the first part corresponds to the indicators collected in the field (field data), and the second part contains the total data which contain the spectral indicators (field data + spectral information). The test results are presented in Table 3. As you see in Table 3, the scores obtained by the chosen algorithms are very close to each other, but the OMP algorithm gave good prediction scores, which are (0.2489 (MAE) and 0.0843 (MSE)). After adding the NDVI and NDWI indices, the scores improved to (0.2315 (MAE) and 0.0748 (MSE)), which demonstrates the importance of satellite imagery in predicting yield.
The hyper-parameters of the OMP algorithm that we obtained after the grid search, and which were used to train the final model, are: OrthogonalMatchingPursuit( f it_intercept = True, n_nonzero_coe f s = 44, normalize = True, precompute = 'auto , tol = None) where the fit_intercept concerning the include an intercept of the model (Boolean), n_nonzero_coefsint, is the desired number of non-zero entries in the solution. The normalized parameter, which concerns the normalization of input data or not, precomputes to speed up the calculations, and tol is the maximum norm of the residual.
To find out more about the OMP algorithm, the Algorithm 1 contains its learning steps and how it chooses the right variables.

Algorithm 1 Orthogonal Matching Pursuit
Input: y, φ Initialization: r 0 = y, A 0 = , l = 0; Repeat l = l + 1; match step: h l = Φ T r l−1 ; identify step: x l argmin z:supp(z)⊆A l y − φz 2 ; r l = y − φx l ; Until stop criterion satisfied; output : x k ; The key idea of the OMP algorithm is that it tries to reconstruct the support set A of x iteratively, starting with A = . Then, in each iteration l, the inner products between the columns of φ and the residuals r l−1 are calculated, then the absolute value of this inner product is added to A. Here, the residual r l−1 from the former iteration represents the component of the measurement vector y that cannot be spanned by the columns of φ indexed by A. In this way, the columns φ, which are the most relative to y, are iteratively chosen [48].
In general, the OMP algorithm classifies the variables by their correlation with the target before integrating them one by one for learning [49]. This technique is very effective when the dataset contains many variables with a minimum of rows (small dataset).

Discussion of Results
In order to better understand and discuss the results obtained, we calculated the percentage error between the measured value and the actual value for each parcel (Formula (5)).
percentage error = |Measured value -True value| True value (5) Table 4 contains the prediction scores obtained by our model on 50 test parcels. The first column shows the scores obtained using parcel information and climate data only, the second column presents the scores after adding phytosanitary treatment data, the third column shows the scores after adding fertilization data, and finally, the last column contains the scores that are obtained using all field data combined with spectral information (NDVI and NDWI). As you see in Table 4, the parcel information combined with climate data (column 1) does not give a good yield prediction result, the average prediction does not exceed 0.43, and some parcels have scores ranging from up to 100% error. In general, the climatic data are presented globally and are identical for all the parcels, this does not really help the model to exploit the data, because it finds the same input with different output (yields), hence the fact that parcel information gave meaning to the climate data. After adding the phytosanitary treatment data, the scores are improved to 0.30 on average (column 2). We can clearly see that some parcels have experienced a significant improvement (0,1,3,6,7,8,10,13,21,34 . . . ); this reflects that the yield of these parcels is very correlated with phytosanitary treatments products. Moreover, we can conclude that these parcels are more likely to be exposed to the diseases and insects that affect the crop the most. Other parcels' prediction scores did not have much difference (2,12,16,23,25,30 . . . ), which means that these parcels are little affected, or they are more resistant to diseases than others.
In the third step (column 3), we added the fertilization data. We can clearly see that the average score improved to 0.20; this score is very interesting and clearly shows that the combination of climatic, phytosanitary treatment, and fertilization data is very important for yield prediction. In this sense, we found that the score of the most parcels improved positively. We also see that some parcels, which were not improved before, improved after adding the fertilization data, such as parcel 26, which is predicted with a score of 0.08, and parcel 27 with a score of 0.07. Although some parcels failed to improve their prediction scores, such as the parcels 24 and 45, the prediction results using only field data are quite interesting and original.
Finally, after adding the NDVI and NDWI data, the average error rate went from 0.20 to 0.16, which means that the results improved by more than 4%. Thus, the spectral data played an important role in improving the results of many parcels, such as parcel 1, which went from 15% error to 9%, and parcel 12 from 14% to 9%.
In general, some parcels were predicted with a very good score not exceeding 5% error, such as parcels 3, 21, 26, 30, 31, 34, and 44. Moreover, most of the scores do not exceed 20% error, which indicates that the data used is very interesting, and the approach followed in this study showed the ability to generate excellent predictive results.
To conclude, the results obtained in this study are very satisfactory compared with the state of the art (an average error of no more than 0.162). We tried to work with data (irrigation, fertilization, phytosanitary treatments, and climate) easily collectible thanks to the current evolution of agriculture, and we used open-source satellites that provide free images. Therefore, this solution is inexpensive and farmers can use it to obtain a real idea of the yield before the harvest period.
Finally, a bar chart graph was constructed (Figure 9) to show a visual comparison between actual yield quantities (red color) and the predicted ones (green color). As you see, there are some parcels that are very well-predicted (1,3,26,27,30,31,34,44). Other parcels are very close, and others are not bad (0, 11,23,24). However, overall, we are satisfied with these results, and we recommend farmers start collecting field data to obtain a proactive view of their yield.

Conclusions
Predicting tree yield is one of the most difficult tasks in modern agriculture. Most of the existing work attempts to predict the productivity of trees by counting their fruits [18,[50][51][52].
This method is expensive and difficult, especially in the case of large orchards. In addition, farmers need to know their yield at least one month before harvest and in a simple and fast way. The main contribution of our research is to show the importance of field data in predicting the yield of a parcel of trees.
The results obtained are very good and the power of field data was demonstrated. In particular, data such as fertilization and phytosanitary treatment have greatly helped our prediction model. Moreover, the proposed approach requires only data, it is inexpensive, fast, and generates the result immediately.
Finally, even with the good results obtained, our team continues daily data collection, and we are sure that feeding our dataset will improve the prediction scores. Moreover, we trust that this experience can be generalized to other tree crops. Funding: This research was funded by the Hassan II Academy of Science and Technology under the project entitled "multispectral satellite imagery, data mining, and agricultural applications".