# A Machine Learning Framework for Olive Farms Profit Prediction

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Methodology

#### 2.1. Classification vs. Regression

#### 2.2. Causal Analysis and One-Hot Encoding Variables

#### 2.2.1. One-Hot Encoding

#### 2.2.2. The Dummy Variable Trap

#### 2.3. The Problem of Overfitting. Bias and Variance

- The model makes wrong assumptions. We are not referring to the obvious, i.e., we diagnose a patient to be ill when in fact he is healthy. Let us assume instead the case in which we ultimately predict that a loan candidate is not trustworthy just because her name is Jane. Those examples may derive from classification problems, but the key concept, which is also valid for regression, is that this case reveals the shortcoming of the algorithm in observing the relations between the predictors and the dependent variable, or maybe more importantly, in exposing a bad choice when choosing the group of predictors [47,48]. This is commonly referred to as underfitting [49]. Underfitting usually characterizes overly simple models. Simplicity refers not only to non-complex models but also to the omission of all the steps of manipulating and preparing data as discussed in our proposed framework. Additionally, mishandling data refers to non-treatment of outlier values and removal of useless or highly correlated features which do not add value to the output variable.
- The model is very sensitive to fluctuations in the observations we use during training. Inconsistency among the data may be due to noise or outliers, but it can also mean rare but anticipated behavior. The algorithm can become overly complex trying to capture the noise and all inconsistencies. The fact that it will succeed during training does not indicate that it will be effective when dealing with new unknown data. It will be apparent, especially after multiple training executions, that it will fail to generalize a successful predictive behavior against new data [50]. It will produce a spread on the estimated values compared to the actual observations. This is referred to as overfitting.

#### 2.4. Dataset Splitting. Training and Test Sets

#### 2.4.1. Training and Test Sets

#### 2.4.2. Splitting Strategy

#### 2.4.3. Splitting Timing

#### 2.5. Exploratory Data Analysis

- Descriptive Analysis. During this step, characteristics of the dataset were examined, such as dimensions, types of variables, and statistical summaries to get a view of the data.
- Visualizations. Plotting single and multiple variables values led to a better understanding of each feature and the relations among them.
- Cleaning. This involved duplicates removal, locating missing values, and techniques to fill in for missing data.
- Transforms. Data could be further processed or massaged without altering the quality or the patterns they convey. Altering the scales, examining their distributions, and readjusting were methods to better accommodate the algorithms with the structure of our information [61].
- ○
- Standardizing values was extremely useful because it provided a convenient way to compare values that were part of different distributions. A dataset is standardized when the input features are transformed to have a close to zero mean (or standard deviation close to 1). The effect was that the shape of the data was shifted to resemble a normal distribution. Standardization assists Machine Learning algorithms like k-nearest neighbors, linear regression, and support vector machines to build more robust models [46]. Standardization was performed by subtracting the mean (μ) from each observation (χ) and dividing the result by the standard deviation (σ) of the feature [62].$$\mathrm{Z}=\frac{\mathsf{\chi}-\mathsf{\mu}}{\mathsf{\sigma}}$$
- ○
- Scaling changes the values of a feature down to a specific range, usually [0, 1] [62]. Hence, the presence of outliers affects the scaling process [47]. It is most useful when the input variables exhibit numeric distances among each other. Transforming them to a common range can enhance Machine Learning algorithms execution [63].
- ○
- ○
- Regularization is an approach to treat poor performance caused by high collinearity among input variables [46]. The concept applied by regularization methods was to penalize increasing complexity during the modeling process, thus preventing overfitting. It was apparent that preprocessing like scaling and standardization was highly important for the regularization treatment because the values of the variables would be at comparable scales and ranges. Approaches include L1 regularization, L2 regularization, and dropout. Regularization is embedded in algorithms like Lasso, Ridge regression, and Elastic Net. Lasso and Elastic Net due to the nature of their penalizing mechanisms can be considered as methods that also perform auto feature selection, as described right below.
- ○
- Feature Engineering and Selection. Having many feature variables which participate in the training process of a model is not always a road to success [57]. It may seem logical that the more inputs we possess (no matter the number of observations), the best prediction we can achieve. This is a misconception that requires attention in Machine Learning projects. Not every feature at our disposal can contribute to the predictive value of a model. Just the fact that we had historical information on it does not equate to usefulness. On the contrary, it may have a negative impact by causing, for example, unnecessary bias. Moreover, the collinearity among features was very important [57]. Collinear predictors have a negative impact on the modeling process most of the time [52]. Therefore, it was required that we (a) checked for predictors which did not contribute to the predictive power, (b) eliminated predictors which were highly correlated, and (c) constructed new appropriate ones if needed. A high association among variables was indicated by increased dependency. Importance existed both on the strength and direction of the association. There are methods for calculating dependency between discrete and continuous variables. Pearson’s χ2 Statistic, Cramer’s V Statistic, and Contingency Coefficient C [41] are very popular for examining discrete variables. Pearson’s coefficient determination is based on the mean and the standard deviation [41]. Therefore, the samples needed to have a Gaussian-or close distribution [64]. This is where transforms played a major role in preparing the data for analysis. Transforms were expected to help in the proposed framework because input variables describe different physical measures, which are quite dissimilar in ranges and values.

#### 2.6. Resampling

#### 2.6.1. Measuring Error

#### 2.6.2. Resampling for Model Assessment

#### 2.6.3. K-Fold Cross-Validation (CV)

#### 2.6.4. Repeated K-Fold Cross-Validation (RCV)

#### 2.6.5. Nested Cross-Validation (NCV)

#### 2.6.6. Leave-One-Out Cross-Validation (LOOCV)

#### 2.7. Machine Learning Algorithms

#### 2.7.1. Algorithms in General

#### 2.7.2. Parametric vs. Non-Parametric Modelling

- Linear Regression
- Bayes Ridge Regression
- Ridge Regression
- LASSO regression

- K-nearest Neighbors
- Regression Trees
- Support Vector Machines Regression

#### 2.8. Hyperparameter Tuning

#### 2.9. Ensembling

- Extreme Gradient Boosting
- Gradient Boosting
- Random Forests
- Extra Trees

#### 2.10. Performance Metrics

- Mean Absolute Error and Mean Squared Error. The mean absolute error (MAE) is a computationally simple regression error metric. The absolute value of the difference for every predicted and observed value is used to calculate the residual (difference between the predicted value and observed value) [42,80]. The equation is shown below:$$\mathrm{MAE}=\frac{1}{\mathrm{n}}\times {\displaystyle \sum}_{\mathrm{i}=1}^{\mathrm{n}}\left|{\mathrm{y}}_{\mathrm{i}}-\widehat{{\mathrm{y}}_{\mathrm{I}}}\right|$$
- The Mean Squared Error (MSE) squares those differences instead of calculating the absolute values [42,81]. Both MAE and MSE range from 0 to positive infinity [81]. The mathematical formula is [46]:$$\mathrm{MSE}=\frac{1}{\mathrm{n}}\times {\displaystyle \sum}_{\mathrm{i}=1}^{\mathrm{n}}{\left({\mathrm{y}}_{\mathrm{i}}-\widehat{{\mathrm{y}}_{\mathrm{I}}}\right)}^{2}$$

- 3.
- Root Mean Squared Error (RMSE). It is the square root of the MSE [42,80]. It represents the sample standard deviation of the residuals. Practically, it reveals the degree of spread out among the residuals. It is often preferred over MSE because its units are the same as those of the output variable [80].
- 4.
- R
^{2}or coefficient of determination. R Squared and Adjusted R Squared are indication measures on how well the model fits the data [82,83]. It provides great insight when evaluating the training process but is also useful during the testing phase. Adjusted R^{2}improves R^{2}in that it can describe better the avoidance of overfitting. R^{2}value tends to increase as the number of input features increases [84]. Adjusted R^{2}remains unaffected by this phenomenon, but this poses a challenge when the features number is high in a modeling process [85]. R Squared values range from 0 to 1. Approaching towards 1 indicates a better fit [46]. The formula for R^{2}is [46]:$${\mathrm{R}}^{2}=\frac{\mathrm{Sum}\mathrm{of}\mathrm{Squared}\mathrm{Errors}}{\mathrm{Total}\mathrm{Sum}\mathrm{of}\mathrm{Squares}}=1-\frac{\mathrm{MSE}}{\mathrm{Var}\left(\mathrm{y}\right)}$$

## 3. Case Study

#### 3.1. Area of Study

- Management Practices
- Soil types
- Precipitation
- Relative irrigation (percentage concerning optimum irrigation amount estimated by the hydrological models [28])
- Number of irrigation trips reduction

#### 3.2. Classification, Regression and Binning Predictors

- It could produce a model with lower performance.
- There would be a loss in prediction precision due to the fixed combinations of the possible outcome.
- The number of false positives could increase.

#### 3.3. One-Hot and Label Encoding

- Management practice with values of PH and PL (heavy pruning & light pruning)
- The soil type with values of Cl, SL, and LS (Clay, Sandy Loam, Loamy Sand)
- Precipitation with values of Dry, Normal, and Wet

- management.M1
- soil_type.CL
- soil_type.SL
- precipitation
- relative_irrigation
- number_of_trips_reduction
- relative_profit_percentage

#### 3.4. Splitting

^{2}(optimal value is 1) and RMSE (optimal value is 0) were used as performance metrics to evaluate the results.

#### 3.5. Data Analysis

#### 3.6. Resampling

#### 3.6.1. Choosing the Appropriate Resampling Method

#### 3.6.2. Cross-Validation Parameters

#### 3.6.3. Sensitivity Analysis on K

- The median value of the cross-validation results. If the mean accuracy and the median are close values, it is a significant indication that the specific execution reflects the central tendency very well and without skewness [97].
- The standard error. It is the measure that exhibits the deviation of the sample mean from the population mean. It was useful because it reflected the accuracy of the mean value in representing the data [80,98]. We observed the min and max values in the experiments’ executions and if the preferred scenario had also a low standard error, it was chosen as the recommended one.

#### 3.7. Machine Learning Algorithms

- Linear Regression
- Bayes Ridge Regression
- Ridge Regression
- LASSO Regression
- K-Nearest Neighbors
- CART Decision Trees
- Support Vector Machines Regression (SVMR)

- Extreme Gradient Boosting
- Gradient Boosting
- Random Forests
- Extra Trees

#### 3.8. Hyperparameter Tuning

#### 3.9. Performance Metrics Comparison and Choice

^{2}which reflected the generalization quality of the model by displaying the variance in the results [105]. It is worth pointing out that solely relying on R

^{2}may not be a good practice because a high R

^{2}value does not necessarily point to well-fit data [106]. Anscombe’s residual is an excellent reminder of this [107]. For that reason, we observed both RMSE and R

^{2}when assessing model performance.

## 4. Results

- Fit the models based on the group of ML algorithms we chose. No data transform or algorithm tuning is performed without any further tweaking. The performance results after evaluating the algorithms on the test set are given in Table 1:

^{2}train score refers to the coefficient of determination during training. The last two columns display metrics for the validation phase on the holdout set. The best test performance is given by SVMR algorithm.

- 2.
- The same process is repeated by performing stratified continuous splitting and the results are shown in Table 2.

- 3.
- Exploration of the nature of data will reveal feature correlations and distributions. Those will point to feature extractions and data transforms. Table 3 displays the Pearson correlation values between feature pairs.

- 4.
- Scaling, standardization, and a power transform (Box-Cox) are also applied to the dataset features to help the algorithms’ execution. The results are presented in Table 5.

- 5.
- In this and the following step, cross-validation experiments will be performed. Standardization is applied to the data before assessing cross-validation performance. Initially, cross-validation is executed on the training dataset with a default value of 10 as the number of folds. The results are sets of root mean squared errors. The boxplots in Figure 1 display cross-validation performance assessment for the algorithms under testing.

- 6.
- Sensitivity analysis on values for k cross-validation is executed. Tests were performed on values of k in the integer range [2, 40]. As described in Section 3.6.3, the red lines display the optimal (baseline) results given by the LOOCV. In each execution, we observe (a) the blue lines which display the cross-validation root mean squared error (mean accuracy) and must be close to the red line, as well as (b) the yellow lines which show the median. Blue segments below the red line (ideal case) are considered pessimistic estimates and above the line optimistic estimates. The second case means overfitting [110]. The characteristics which point to the optimal k are (a) a small distance from the LOOCV mean and (b) a low standard error. From the execution table values and the line plot which are concentrated in Figure 2, the optimal value for k is 38 with a mean accuracy score of 0.06605177. The difference from the LOOCV median is 0.01193512. The standard error ranges from 0.00203894 to 0.00398050 throughout the experiments. For k = 38 the standard error mean is 0.00264908, a value near the minimum of the executions set.

- 7.
- Support Vector Machines Regressor (SVMR) is the predominant algorithm as shown from earlier steps. Repeated cross-validation will help to improve its performance by tuning its hyperparameters. Exhaustive grid search will be used to achieve that, exploiting the optimal values of repeated cross-validation. The values exploration will be executed inside nested cross-validation of 10 folds. Standardization is applied on both the validation and the test sets. SVMR has three major hyperparameters for tweaking [111,112]. (A) Kernel. The kernel types that we will test are linear, poly, rbf, sigmoid. (B) The tolerance for the stopping criterion. The set that will be tested is: [0.000001, 0.00001, 0.0001, 0.001, 0.01]. (C) The C regularization parameter. It represents how strict the algorithm will be when there are errors on fitting. The range to test is: [1, 1.5, 2, 2.5, 3].

- 8.
- In this step, we will experiment with ensemble methods. Extreme Gradient Boosting, Gradient Boosting, Random Forests, and Extra Trees are evaluated on the standardized dataset with repeated cross-validation (folds = 15, repeats = 15). The results are presented in Table 8.

- 9.
- Finally, hyperparameter tuning will be executed on Gradient Boosting to investigate if its performance can be further enhanced. N estimators are the hyperparameter to adjust and the default value used is 100 [113]. Usually, if this number is increased, so is performance (at a computational cost) [113]. We will test the range: [50, 200]. This experiment using nested cross-validation showed that 50 was the best value with a mean accuracy of 0.051271 and a standard error of 0.006015, improving the scores with the default values.
- 10.

## 5. Discussion

^{2}and a 42.88% improvement in RMSE when testing our proposed model on the holdout dataset.

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

- Bagging: The term comes from Bootstrap Aggregating because those techniques are combined. It works by sampling random bootstrapped sub samples from the initial dataset. Afterwards, the algorithm picks the most robust sub model to form the predictor.
- Random Forests: There is a resemblance to Bootstrap Aggregation. Bagging is somehow predictable in behavior because although splitting is done, the algorithm has all the predictors at hand. In a random forest, multiple trees are produced but they are based on different predictors. This is a factor which contributes to crisper independence among the base models and leads to powerful aggregation and accurate predictions. The bottom line is that the base models must be as efficient and as diverse as possible.
- Boosting: The principle in boosting algorithms is to convert multiple weak training models into stronger ones. Weight values are attributed to the learners depending on their comparative performance. Higher values are attributed to false predicted cases. In the end, the weighted sum is used for the final prediction. Boosting differs from bagging in that it trains the learners sequentially, referencing the weighted line of data.
- Stacking: In stacking, weak base models are exploited but processed in parallel. A weakness of this approach is that each base model equally contributes to the ensemble regardless of how well it performs. They are often heterogeneous methods, meaning that the group of base learners consist of different algorithms.

## References

- Lehmann, J.; Coumou, D.; Frieler, K. Increased record-breaking precipitation events under global warming. Clim. Chang.
**2015**, 132, 501–515. [Google Scholar] [CrossRef] - Aquastat FAO’s Information System on Water and Agriculture. Available online: https://www.fao.org/e-agriculture/news/aquastat-faos-global-information-system-water-and-agriculture (accessed on 4 December 2021).
- Brauman, K.A.; Siebert, S.; Foley, J.A. Improvements in crop water productivity increase water sustainability and food security—A global analysis. Environ. Res. Lett.
**2013**, 8, 24030. [Google Scholar] [CrossRef] - Cuevas, J.; Daliakopoulos, I.N.; Del Moral, F.; Hueso, J.J.; Tsanis, I.K. A Review of Soil-Improving Cropping Systems for Soil Salinization. Agronomy
**2019**, 9, 295. [Google Scholar] [CrossRef][Green Version] - Ali, M.; Talukder, M. Increasing water productivity in crop production—A synthesis. Agric. Water Manag.
**2008**, 95, 1201–1213. [Google Scholar] [CrossRef] - Fischer, G. Transforming the global food system. Nat. Cell Biol.
**2018**, 562, 501–502. [Google Scholar] [CrossRef] [PubMed] - Betts, R.A.; Alfieri, L.; Bradshaw, C.; Caesar, J.; Feyen, L.; Friedlingstein, P.; Gohar, L.; Koutroulis, A.; Lewis, K.; Morfopoulos, C.; et al. Changes in climate extremes, fresh water availability and vulnerability to food insecurity projected at 1.5 °C and 2 °C global warming with a higher-resolution global climate model. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.
**2018**, 376, 20160452. [Google Scholar] [CrossRef] [PubMed] - WWAP. World Water Development Report Volume 4: Managing Water under Uncertainty and Risk; WWAP: Paris, France, 2012; Volume 1, ISBN 9789231042355. [Google Scholar]
- Koutroulis, A.; Grillakis, M.; Daliakopoulos, I.; Tsanis, I.; Jacob, D. Cross sectoral impacts on water availability at +2 °C and +3 °C for east Mediterranean island states: The case of Crete. J. Hydrol.
**2016**, 532, 16–28. [Google Scholar] [CrossRef][Green Version] - Giannakis, E.; Bruggeman, A.; Djuma, H.; Kozyra, J.; Hammer, J. Water pricing and irrigation across Europe: Opportunities and constraints for adopting irrigation scheduling decision support systems. Water Supply
**2015**, 16, 245–252. [Google Scholar] [CrossRef][Green Version] - Christias, P.; Mocanu, M. Information Technology for Ethical Use of Water. In International Conference on Business Information Systems; Springer: Cham, Switzerland, 2019; pp. 597–607. [Google Scholar] [CrossRef]
- Labadie, J.W.; Sullivan, C.H. Computerized Decision Support Systems for Water Managers. J. Water Resour. Plan. Manag.
**1986**, 112, 299–307. [Google Scholar] [CrossRef] - Gurría, A. Sustainably managing water: Challenges and responses. Water Int.
**2009**, 34, 396–401. [Google Scholar] [CrossRef] - Paredes, P.; Wei, Z.; Liu, Y.; Xu, D.; Xin, Y.; Zhang, B.; Pereira, L. Performance assessment of the FAO AquaCrop model for soil water, soil evaporation, biomass and yield of soybeans in North China Plain. Agric. Water Manag.
**2015**, 152, 57–71. [Google Scholar] [CrossRef][Green Version] - Foster, T.; Brozović, N.; Butler, A.P.; Neale, C.M.U.; Raes, D.; Steduto, P.; Fereres, E.; Hsiao, T.C. AquaCrop-OS: An open source version of FAO’s crop water productivity model. Agric. Water Manag.
**2017**, 181, 18–22. [Google Scholar] [CrossRef] - Steduto, P.; Hsiao, T.C.; Raes, D.; Fereres, E. AquaCrop—The FAO Crop Model to Simulate Yield Response to Water: I. Concepts and Underlying Principles. Agron. J.
**2009**, 101, 426–437. [Google Scholar] [CrossRef][Green Version] - Simionesei, L.; Ramos, T.B.; Palma, J.; Oliveira, A.R.; Neves, R. IrrigaSys: A web-based irrigation decision support system based on open source data and technology. Comput. Electron. Agric.
**2020**, 178, 105822. [Google Scholar] [CrossRef] - Mannini, P.; Genovesi, R.; Letterio, T. IRRINET: Large Scale DSS Application for On-farm Irrigation Scheduling. Procedia Environ. Sci.
**2013**, 19, 823–829. [Google Scholar] [CrossRef][Green Version] - Allen, R.G.; Pereira, L.S.; Raes, D.; Smith, M. Others Crop Evapotranspiration-Guidelines for Computing Crop Water Requirements-FAO Irrigation and Drainage Paper 56; FAO: Rome, Italy, 1998; Volume 300, p. 6541. [Google Scholar]
- Rinaldi, M.; He, Z. Decision Support Systems to Manage Irrigation in Agriculture. In Advances in Agronomy; Elsevier BV: Amsterdam, The Netherlands, 2014; Volume 123, pp. 229–279. [Google Scholar]
- Car, N.J. USING decision models to enable better irrigation Decision Support Systems. Comput. Electron. Agric.
**2018**, 152, 290–301. [Google Scholar] [CrossRef] - Karipidis, P.; Tsakiridou, E.; Tabakis, N. The {Greek} olive oil market structure. Agric. Econ. Rev.
**2005**, 6, 64–72. [Google Scholar] - Mili, S. Market Dynamics and Policy Reforms in the EU Olive Oil Industry: An Exploratory Assessment. In Proceedings of the 98th Seminar, No. 10099, Chania, Greece, 29 June–2 July 2006. [Google Scholar]
- Fousekis, P.; Klonaris, S. Spatial Price Relationships in the Olive Oil Market of the Mediterranean. Agric. Econ. Rev.
**2002**, 3, 23–35. [Google Scholar] - Tempesta, T.; Vecchiato, D. Analysis of the Factors that Influence Olive Oil Demand in the Veneto Region (Italy). Agriculture
**2019**, 9, 154. [Google Scholar] [CrossRef][Green Version] - García-González, D.L.; Aparicio, R. Research in Olive Oil: Challenges for the Near Future. J. Agric. Food Chem.
**2010**, 58, 12569–12577. [Google Scholar] [CrossRef] - Skaggs, R.K.; Samani, Z. Farm size, irrigation practices, and on-farm irrigation efficiency. Irrig. Drain.
**2005**, 54, 43–57. [Google Scholar] [CrossRef] - Christias, P.; Daliakopoulos, I.N.; Manios, T.; Mocanu, M. Comparison of Three Computational Approaches for Tree Crop Irrigation Decision Support. Mathematics
**2020**, 8, 717. [Google Scholar] [CrossRef] - Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Pearson: New York, NY, USA, 2010; ISBN 9780136042594. [Google Scholar]
- Deisenroth, M.P.; Faisal, A.A.; Ong, C.S. Mathematics for Machine Learning; Cambridge University Press (CUP): Cambridge, UK, 2020; p. 391. [Google Scholar]
- Müller, A.C.; Guido, S. Introduction to Machine Learning with Python: A Guide for Data Scientists, 1st ed.; O’Reilly Media: Sevastopol, CA, USA, 2016; ISBN 1449369413. [Google Scholar]
- Tsanis, I.K.; Koutroulis, A.G.; Daliakopoulos, I.N.; Jacob, D. Severe climate-induced water shortage and extremes in Crete. Clim. Chang.
**2011**, 106, 667–677. [Google Scholar] [CrossRef] - James, G.; Witten, D.; Hastie, T.; Tibshirani, R. Springer Texts in Statistics an Introduction to Statistical Learning-with Applications in R; Springer: Berlin, Germany, 2013; ISBN 9781461471370. [Google Scholar]
- Ziegel, E.R. The Elements of Statistical Learning. Technometrics
**2003**, 45, 267–268. [Google Scholar] [CrossRef] - Cook, D.O.; Kieschnick, R.; McCullough, B. Regression analysis of proportions in finance with self selection. J. Empir. Financ.
**2008**, 15, 860–867. [Google Scholar] [CrossRef] - Ruppert, D. Statistics and Finance: An Introduction. Technometrics
**2005**, 47, 244–245. [Google Scholar] - Hunt, J.O.; Myers, J.N.; Myers, L.A. Improving Earnings Predictions and Abnormal Returns with Machine Learning. Account. Horizons
**2021**. [Google Scholar] [CrossRef] - Huang, J.-C.; Ko, K.-M.; Shu, M.-H.; Hsu, B.-M. Application and comparison of several machine learning algorithms and their integration models in regression problems. Neural Comput. Appl.
**2019**, 32, 5461–5469. [Google Scholar] [CrossRef] - Bary, M.N.A. Robust regression diagnostic for detecting and solving multicollinearity and outlier problems: Applied study by using financial data. Appl. Math. Sci.
**2017**, 11, 601–622. [Google Scholar] [CrossRef][Green Version] - Leek, J. The Elements of Data Analytic Style; Leanpub: Victoria, BC, Canada, 2015; p. 93. [Google Scholar]
- Heumann, C.; Schomaker, M. Shalabh Introduction to Statistics and Data Analysis: With Exercises, Solutions and Applications in R; Springer International Publishing: Berlin, Germany, 2017; ISBN 9783319461625. [Google Scholar]
- Chen, L.-P. Practical Statistics for Data Scientists: 50+ Essential Concepts Using R and Python. Technometrics
**2021**, 63, 272–273. [Google Scholar] [CrossRef] - Jin, X.-B.; Yang, N.-X.; Wang, X.-Y.; Bai, Y.-T.; Su, T.-L.; Kong, J.-L. Hybrid Deep Learning Predictor for Smart Agriculture Sensing Based on Empirical Mode Decomposition and Gated Recurrent Unit Group Model. Sensors
**2020**, 20, 1334. [Google Scholar] [CrossRef] [PubMed][Green Version] - Shetty, S.A.; Padmashree, T.; Sagar, B.M.; Cauvery, N.K. Performance Analysis on Machine Learning Algorithms with Deep Learning Model for Crop Yield Prediction; Springer: Singapore, 2021; pp. 739–750. [Google Scholar]
- Blankmeyer, E. How Robust Is Linear Regression with Dummy Variables? Online Submiss. Available online: https://digital.library.txstate.edu/handle/10877/4105 (accessed on 4 December 2005).
- Raschka, S.; Mirjalili, V. Python Machine Learning: Machine Learning & Deep Learning with Python, Scikit-Learn and TensorFlow 2, 3rd ed.; Packt Publishing: Birmingham, UK, 2019; ISBN 9781789955750. [Google Scholar]
- Gerón, A. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems, 2nd ed.; O’Reilly Media, Inc.: Sevastopol, CA, USA, 2019; ISBN 9781492032649. [Google Scholar]
- Kubben, P.; Dumontier, M.; Dekker, A.L.A.J.; André, L.A.J. Fundamentals of Clinical Data Science, 1st ed.; Springer: London, UK, 2019; ISBN 978-3319997124. [Google Scholar]
- Fortmann-Roe, S. Understanding the Bias-Variance Tradeoff. Available online: http://scott.fortmann-roe.com/docs/BiasVariance.html (accessed on 4 December 2005).
- Cawley, G.C.; Talbot, N.L.C. On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res.
**2010**, 11, 2079–2107. [Google Scholar] - VanderPlas, J. Python Data Science Handbook: Essential Tools for Working with Data; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2016; ISBN 9781491912058. [Google Scholar]
- Kuhn, M.; Johnson, K. Applied Predictive Modeling; Springer Science Business Media: New York, NY, USA, 2013; p. 600. [Google Scholar]
- Mohri, M.; Rostamizadeh, A.; Talwalkar, A. (Eds.) Foundations of Machine Learning, 2nd ed.; MIT: Cambridge, MA, USA, 2018; Volume 3, ISBN 9780262039406. [Google Scholar]
- Sambasivam, G.; Opiyo, G.D. A predictive machine learning application in agriculture: Cassava disease detection and classification with imbalanced dataset using convolutional neural networks. Egypt. Inform. J.
**2021**, 22, 27–34. [Google Scholar] [CrossRef] - De Luna, R.G.; Dadios, E.P.; Bandala, A.A.; Vicerra, R.R.P. Tomato Growth Stage Monitoring for Smart Farm Using Deep Transfer Learning with Machine Learning-based Maturity Grading. AGRIVITA J. Agric. Sci.
**2020**, 42, 24–36. [Google Scholar] [CrossRef] - Balducci, F.; Impedovo, D.; Pirlo, G. Machine Learning Applications on Agricultural Datasets for Smart Farm Enhancement. Machines
**2018**, 6, 38. [Google Scholar] [CrossRef][Green Version] - Kuhn, M.; Johnson, K. Feature Engineering and Selection: A Practical Approach for Predictive Models; CRC Press: Boca Raton, FL, USA, 2019; ISBN 9781351609470. [Google Scholar]
- Brownlee, J. Machine Learning Mastery with R; Brownlee Publishing: London, UK, 2016. [Google Scholar]
- Brownlee, J. Imbalanced Classification with Python: Better Metrics, Balance Skewed Classes, Cost-Sensitive Learning; Brownlee Publishing: London, UK, 2021. [Google Scholar]
- Datar, R.; Garg, H. Hands-On Exploratory Data Analysis with R; Packt: Birmingham, UK, 2019; ISBN 9781789804379. [Google Scholar]
- Yegnanarayana, B. Artificial neural networks for pattern recognition. Sadhana
**1994**, 19, 189–238. [Google Scholar] [CrossRef] - Matloff, N. Statistical Regression and Classification: From Linear Models to Machine Learning; CRC Press: Boca Raton, FL, USA, 2017; ISBN 9781498710923. [Google Scholar]
- Liu, H. Feature Engineering for Machine Learning and Data Analytics; CRC Press: Boca Raton, FL, USA, 2018; ISBN 978-1491953242. [Google Scholar]
- Brownlee, J. Statistical Methods for Machine Learning; Brownlee Publishing: London, UK, 2018; p. 291. [Google Scholar]
- Fortmann-Roe, S. Accurately Measuring Model Prediction Error. Available online: https://scott.fortmann-roe.com/docs/MeasuringError.html (accessed on 4 December 2021).
- Brain, D.; Webb, G.I. On The Effect of Data Set Size on Bias and Variance in Classification Learning. In Proceedings of the Fourth Australian Knowledge Acquisition Workshop (AKAW ’99), Sydney, Australia, 5–6 December 1999; University of New South Wales: Sydney, Australia; pp. 117–128. Available online: https://www.bibsonomy.org/bibtex/2eb55c4bdfb45c25cad6b1c613e9ef74f/giwebb (accessed on 26 October 2021).
- Xiang, H.; Lin, J.; Chen, C.-H.; Kong, Y. Asymptotic Meta Learning for Cross Validation of Models for Financial Data. IEEE Intell. Syst.
**2020**, 35, 16–24. [Google Scholar] [CrossRef] - Lin, W.-Y.; Hu, Y.-H.; Tsai, C.-F. Machine Learning in Financial Crisis Prediction: A Survey. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev.
**2012**, 42, 421–436. [Google Scholar] [CrossRef] - López de Prado, M. Advances in Financial Machine Learning: Lecture 7/10. SSRN Electron. J.
**2018**, 366. Available online: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3266136 (accessed on 14 October 2018). - Krstajic, D.; Buturovic, L.J.; E Leahy, D.; Thomas, S. Cross-validation pitfalls when selecting and assessing regression and classification models. J. Cheminformatics
**2014**, 6, 70. [Google Scholar] [CrossRef][Green Version] - Tantithamthavorn, C.; Member, S.; McIntosh, S.; Hassan, A.E.; Matsumoto, K.; Member, S. An Empirical Comparison of Model Validation Techniques for Defect Prediction Models. IEEE Trans. Softw. Eng.
**2016**, 43, 1–18. [Google Scholar] [CrossRef] - Rodríguez, J.D.; Pérez, A.; Lozano, J.A. Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation. IEEE Trans. Pattern Anal. Mach. Intell.
**2010**, 32, 569–575. [Google Scholar] [CrossRef] - Varma, S.; Simon, R. Bias in error estimation when using cross-validation for model selection. BMC Bioinform.
**2006**, 7, 91. [Google Scholar] [CrossRef] [PubMed][Green Version] - Scikit-Learn Developers 3.1. Cross-Validation: Evaluating Estimator Performance. Available online: https://scikit-learn.org/stable/modules/cross_validation.html (accessed on 30 April 2021).
- Machine Learning. Available online: https://en.wikipedia.org/wiki/Machine_learning (accessed on 4 December 2021).
- Wainer, J.; Cawley, G. Nested cross-validation when selecting classifiers is overzealous for most practical applications. Expert Syst. Appl.
**2021**, 182, 115222. [Google Scholar] [CrossRef] - Opitz, D.W.; Maclin, R. Popular Ensemble Methods: An Empirical Study. J. Artif. Intell. Res.
**1999**, 11, 169–198. [Google Scholar] [CrossRef] - Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; Chapman and Hall/CRC Press: Boca Raton, FL, USA; London, UK; New York, NY, USA, 2012; ISBN 9781439830055. [Google Scholar]
- Kuncheva, L.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn.
**2003**, 51, 181–207. [Google Scholar] [CrossRef] - Matloff, N. Probability and Statistics for Data Science; CRC Press: Boca Raton, FL, USA, 2019; ISBN 9780367260934. [Google Scholar]
- Pascual, C. Tutorial: Understanding Linear Regression and Regression Error Metrics. (Hentet: 9 May 2021). Available online: https://www.dataquest.io/blog/understanding-regression-error-metrics/ (accessed on 30 April 2021).
- Swalin, A. Choosing the Right Metric for Evaluating Machine Learning Models—Part 1 by Alvira Swalin USF-Data Science Medium. Available online: https://medium.com/usf-msds/choosing-the-right-metric-for-machine-learning-models-part-1-a99d7d7414e4 (accessed on 30 April 2021).
- Scikit-Learn Metrics and Scoring: Quantifying the Quality of Predictions—Scikit-Learn 0.24.2 Documentation. Available online: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics (accessed on 30 April 2021).
- Westfall, P.H.; Arias, A.L. R-Squared, Adjusted R-Squared, the F Test, and Multicollinearity. In Understanding Regression Analysis; Chapman and Hall/CRC Press: Boca Raton, FL, USA, 2020; pp. 185–200. [Google Scholar]
- Karch, J. Improving on Adjusted R-Squared. Collabra Psychol.
**2020**, 6, 6. [Google Scholar] [CrossRef] - Grömping, U. Variable importance in regression models. Wiley Interdiscip. Rev. Comput. Stat.
**2015**, 7, 137–152. [Google Scholar] [CrossRef] - Gorgens, E.; Montaghi, A.; Rodriguez, L.C. A performance comparison of machine learning methods to estimate the fast-growing forest plantation yield based on laser scanning metrics. Comput. Electron. Agric.
**2015**, 116, 221–227. [Google Scholar] [CrossRef] - Zhang, Y.; Yang, X.; Shardt, Y.A.W.; Cui, J.; Tong, C. A KPI-Based Probabilistic Soft Sensor Development Approach that Maximizes the Coefficient of Determination. Sensors
**2018**, 18, 3058. [Google Scholar] [CrossRef][Green Version] - Takayama, K. Encoding Categorical Variables with Ambiguity. In Proceedings of the International Workshop NFMCP in conjunction with ECML-PKDD, Tokyo, Japan, 16 September 2019. [Google Scholar]
- Kuhn, M. Comparing the Bootstrap and Cross-Validation. Available online: http://appliedpredictivemodeling.com/blog/2014/11/27/08ks7leh0zof45zpf5vqe56d1sahb0 (accessed on 30 April 2021).
- Kohavi, R. A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. Int. Jt. Conf. Artif. Intell.
**1995**, 14, 1137–1145. [Google Scholar] - Sujjaviriyasup, T.; Pitiruek, K. Agricultural product forecasting using machine learning approach. Int. J. Math. Anal.
**2013**, 7, 1869–1875. [Google Scholar] [CrossRef] - Thorp, K.R.; Batchelor, W.D.; Paz, J.O.; Kaleita, A.L.; DeJonge, K.C. Using Cross-Validation to Evaluate CERES-Maize Yield Simulations within a Decision Support System for Precision Agriculture. Trans. ASABE
**2007**, 50, 1467–1479. [Google Scholar] [CrossRef][Green Version] - Paul, M.; Vishwakarma, S.K.; Verma, A. Analysis of Soil Behaviour and Prediction of Crop Yield Using Data Mining Approach. In Proceedings of the 2015 International Conference on Computational Intelligence and Communication Networks CICN 2015, Jabalpur, India, 12–14 December 2015; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2016; pp. 766–771. [Google Scholar]
- Molinaro, A.M.; Simon, R.; Pfeiffer, R.M. Prediction error estimation: A comparison of resampling methods. Bioinformatics
**2005**, 21, 3301–3307. [Google Scholar] [CrossRef][Green Version] - Kim, J.-H. Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap. Comput. Stat. Data Anal.
**2009**, 53, 3735–3745. [Google Scholar] [CrossRef] - Brownlee, J. Repeated k-Fold Cross-Validation for Model Evaluation in Python. Available online: https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/ (accessed on 30 April 2021).
- Fan, J.; Li, R.; Zhang, C.-H.; Zou, H. Statistical Foundations of Data Science; CRC Press: Boca Raton, FL, USA, 2020; ISBN 978-1-466-51084-5. [Google Scholar]
- Storm, H.; Baylis, K.; Heckelei, T. Machine learning in agricultural and applied economics. Eur. Rev. Agric. Econ.
**2019**, 47, 849–892. [Google Scholar] [CrossRef] - Mbunge, E.; Fashoto, S.G.; Mnisi, E.J. Machine learning approach for predicting maize crop yields using multiple linear regression and backward elimination. Int. J. Sci. Technol. Res.
**2020**, 9, 3804–3814. [Google Scholar] - Vinciya, P.; Valarmathi, A. Agriculture Analysis for Next Generation High Tech Farming in Data Mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng.
**2016**, 6, 2277. [Google Scholar] - Chen, Y.-A.; Hsieh, W.-H.; Ko, Y.-S.; Huang, N.-F. An Ensemble Learning Model for Agricultural Irrigation Prediction. In Proceedings of the 2021 International Conference on Information Networking, Jeju Island, Korea, 13–16 January 2021; Volume 2021-Janua, pp. 311–316. [Google Scholar]
- Shahhosseini, M.; Hu, G.; Archontoulis, S.V. Forecasting Corn Yield with Machine Learning Ensembles. Front. Plant Sci.
**2020**, 11, 1120. [Google Scholar] [CrossRef] - Trafalis, T.; Ince, H. Support vector machine for regression and applications to financial forecasting. In Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks, IJCNN 2000, Neural Computing: New Challenges and Perspectives for the New Millennium, Como, Italy, 24–27 July 2000; Volume 6, pp. 348–353. [Google Scholar]
- Miles, J. R Squared, Adjusted R Squared. In Wiley StatsRef: Statistics Reference Online; Wiley: Hoboken, NJ, USA, 2014. [Google Scholar]
- Barrett, J.P. The coefficient of determination-some limitations. Am. Stat.
**1974**, 28, 19–20. [Google Scholar] - Regression Models for Data… by Brian Caffo [PDF/iPad/Kindle]. Available online: https://leanpub.com/regmods (accessed on 5 August 2021).
- Ghojogh, B.; Crowley, M. The Theory behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial. arXiv
**2019**, preprint. arXiv:1905.12787. [Google Scholar] - Chen, D.; Hagan, M. Optimal use of regularization and cross-validation in neural network modeling. In Proceedings of the IJCNN’99, International Joint Conference on Neural Networks, Proceedings (Cat. No.99CH36339), Baltimore, MD, USA, 7–11 June 1992; Volume 2, pp. 1275–1280. [Google Scholar]
- Steyerberg, E. Overfitting and optimism in prediction models. In Statistics for Biology and Health; Springer: Cham, Switzerland, 2019; pp. 83–100. [Google Scholar]
- Sklearn.Svm.SVR—Scikit-Learn 1.0 Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html (accessed on 25 September 2021).
- Koutsoukas, A.; Monaghan, K.J.; Li, X.; Huan, J. Deep-learning: Investigating deep neural networks hyper-parameters and comparison of performance to shallow methods for modeling bioactivity data. J. Cheminformatics
**2017**, 9, 42. [Google Scholar] [CrossRef] [PubMed] - Sklearn.Ensemble.GradientBoostingRegressor—Scikit-Learn 1.0 Documentation. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html?highlight=gradientboostingregressor#sklearn.ensemble.GradientBoostingRegressor (accessed on 25 September 2021).
- Shakoor, T.; Rahman, K.; Rayta, S.N.; Chakrabarty, A. Agricultural production output prediction using Supervised Machine Learning techniques. In Proceedings of the 2017 1st International Conference on Next Generation Computing Applications, NextComp Mauritius, East Africa, Mauritius, 19–21 July 2017; Institute of Electrical and Electronics Engineers Inc.: Piscataway, NJ, USA, 2017; pp. 182–187. [Google Scholar]
- Treboux, J.; Genoud, D. High Precision Agriculture: An Application of Improved Machine-Learning Algorithms. In Proceedings of the 2019 6th Swiss Conference on Data Science (SDS), Bern, Switzerland, 14 June 2019; pp. 103–108. [Google Scholar]
- Liakos, K.G.; Busato, P.; Moshou, D.; Pearson, S.; Bochtis, D. Machine learning in agriculture: A review. Sensors
**2018**, 18, 2674. [Google Scholar] [CrossRef] [PubMed][Green Version] - Sabu, K.M.; Kumar, T.M. Predictive analytics in Agriculture: Forecasting prices of Arecanuts in Kerala. Procedia Comput. Sci.
**2020**, 171, 699–708. [Google Scholar] [CrossRef] - Yuan, C.Z.; San, W.W.; Leong, T.W. Determining Optimal Lag Time Selection Function with Novel Machine Learning Strategies for Better Agricultural Commodity Prices Forecasting in Malaysia. In Proceedings of the 2020 2nd International Conference on Information Technology and Computer Communications, Guangzhou, China, 23–25 June 2020; pp. 37–42. [Google Scholar]
- Chen, Z.; Goh, H.S.; Sin, K.L.; Lim, K.; Chung, N.K.H.; Liew, X.Y. Automated Agriculture Commodity Price Prediction System with Machine Learning Techniques. Adv. Sci. Technol. Eng. Syst. J.
**2021**, 6, 376–384. [Google Scholar] [CrossRef] - Lebrini, Y.; Benabdelouahab, T.; Boudhar, A.; Htitiou, A.; Hadria, R.; Lionboui, H. Farming systems monitoring using machine learning and trend analysis methods based on fitted NDVI time series data in a semi-arid region of Morocco. In Proceedings of the Remote Sensing for Agriculture, Ecosystems, and Hydrology XXI, Strasbourg, France, 21 October 2019; Volume 11149, p. 31. [Google Scholar]
- Ouyang, H.; Wei, X.; Wu, Q. Agricultural commodity futures prices prediction via long—And short-term time series network. J. Appl. Econ.
**2019**, 22, 468–483. [Google Scholar] [CrossRef] - Tang, F.; Mao, B.; Fadlullah, Z.M.; Kato, N.; Akashi, O.; Inoue, T.; Mizutani, K. On Removing Routing Protocol from Future Wireless Networks: A Real-time Deep Learning Approach for Intelligent Traffic Control. IEEE Wirel. Commun.
**2018**, 25, 154–160. [Google Scholar] [CrossRef] - Abroyan, N. Convolutional and recurrent neural networks for real-time data classification. In Proceedings of the 7th International Conference on Innovative Computing Technology INTECH 2017, Luton, UK, 16–18 August 2017; pp. 42–45. [Google Scholar]
- Lakshmanaprabu, S.K.; Mohanty, S.N.; S., S.R.; Krishnamoorthy, S.; Uthayakumar, J.; Shankar, K. Online clinical decision support system using optimal deep neural networks. Appl. Soft Comput. J.
**2019**, 81, 105487. [Google Scholar] [CrossRef] - Aggarwal, C.C.; Sathe, S. Outlier Ensembles: An Introduction; Springer: Berlin, Germany, 2017; ISBN 9783319547657. [Google Scholar]

ML Algorithm | R^{2} Train Score | R^{2} Test Score | RMSE Test Score |
---|---|---|---|

Linear Regression | 0.5470 | 0.4754 | 0.00432 |

Bayes Ridge Regression | 0.54686 | 0.47553 | 0.00432 |

Ridge Regression | 0.54698 | 0.47548 | 0.00432 |

LASSO Regression | 0 | −0.0016 | 0.00825 |

K-Nearest Neighbors | 0.66823 | 0.39674 | 0.00497 |

CART | 0.8056 | 0.51325 | 0.00401 |

Support Vector Machines Regression | 0.64731 | 0.56272 | 0.0036 |

ML Algorithm | R^{2} Train Score | R^{2} Test Score | RMSE Test Score |
---|---|---|---|

Linear Regression | 0.49999 | 0.60878 | 0.00362 |

Bayes Ridge Regression | 0.49978 | 0.60663 | 0.00364 |

Ridge Regression | 0.49997 | 0.60816 | 0.00363 |

LASSO Regression | 0.64847 | −0.0014 | 0.00927 |

K-Nearest Neighbors | 0.6485 | 0.42249 | 0.00535 |

CART | 0.79909 | 0.58683 | 0.00382 |

Support Vector Machines Regression | 0.60912 | 0.62566 | 0.00347 |

Management.M1 | Soil_Type.CL | Soil_Type.SL | Precipitation | Relative_Irrigation | Number_of_Trips_Reduction | |
---|---|---|---|---|---|---|

management.M1 | --- | 7.071077 ×10^{−1} | 2.272757 × 10^{−16} | 1.029597 × 10^{−16} | 1.007215 × 10^{−16} | 9.568540 × 10^{−17} |

soil_type.CL | --- | --- | 5 × 10^{−1} | 1.724138 × 10^{−16} | 5.771856 × 10^{−16} | 2.890379 × 10^{−16} |

soil_type.SL | --- | --- | --- | 4.659030 × 10^{−17} | 4.062555 × 10^{−16} | 4.985459 × 10^{−17} |

precipitation | --- | --- | --- | --- | 3.69276 × 10^{−17} | 2.604227 × 10^{−17} |

relative_irrigation | --- | --- | --- | --- | --- | 8.9092 × 10^{−18} |

number_of_trips_reduction | --- | --- | --- | --- | --- | --- |

ML Algorithm | R^{2} Train Score | R^{2} Test Score | RMSE Test Score |
---|---|---|---|

Linear Regression | 0.47358 | 0.58799 | 0.00381 |

Bayes Ridge Regression | 0.47341 | 0.58531 | 0.00384 |

Ridge Regression | 0.47357 | 0.58718 | 0.00382 |

LASSO Regression | 0 | −0.0014 | 0.00927 |

K-Nearest Neighbors | 0.64755 | 0.41146 | 0.00545 |

CART | 0.79909 | 0.58683 | 0.00382 |

Support Vector Machines Regression | 0.60679 | 0.62352 | 0.00349 |

ML Algorithm | R^{2} Train Score/R^{2} Test Score | RMSE Score | ||||
---|---|---|---|---|---|---|

Scaling | Standardi Zation | Power Transform | Scaling | Standardi Zation | Power Transform | |

Linear Regression | 0.49999/0.60878 | 0.49999/0.60878 | 0.49615/0.61141 | 0.00362 | 0.00362 | 0.0036 |

Bayes Ridge Regression | 0.49971/0.60841 | 0.498789/0.60872 | 0.49579/0.61105 | 0.00363 | 0.00362 | 0.0036 |

Ridge Regression | 0.49998/0.60878 | 0.49999/0.60885 | 0.49613/0.61141 | 0.00362 | 0.00362 | 0.0036 |

LASSO Regression | 0/−0.0014 | 0/−0.0014 | 0/−0.0014 | 0.00927 | 0.00927 | 0.00927 |

K-Nearest Neighbors | 0.74781/0.62416 | 0.72762/0.66027 | 0.71500/0.55056 | 0.00348 | 0.00315 | 0.00416 |

CART | 0.79909/0.58683 | 0.79909/0.58683 | 0.79909/0.58683 | 0.00382 | 0.00382 | 0.00382 |

Support Vector Machines Regression | 0.63347/0.65531 | 0.63978/0.66826 | 0.63010/0.64965 | 0.00319 | 0.00307 | 0.00324 |

ML Algorithm | Mean Accuracy—(Standard Error) For Repeated Cross-Validation after Sensitivity Analysis |
---|---|

Linear Regression | 0.066925—(0.008849) |

Bayes Ridge Regression | 0.067004—(0.008873) |

Ridge Regression | 0.067029—(0.008846) |

LASSO Regression | 0.093284—(0.010321) |

K-Nearest Neighbors | 0.059908—(0.007890) |

CART | 0.058579—(0.006713) |

Support Vector Machines Regression | 0.057669—(0.006979) |

R^{2} Training Score | R^{2} Test Score | RMSE Test Score | |
---|---|---|---|

SVMR—Default hyperparameters | 0.63978 | 0.66826 | 0.00307 |

SVMR—Tuned hyperparameters | 0.63961 | 0.66833 | 0.00307 |

Ensemble Method | Mean Accuracy—(Standard Error) For Repeated Cross-Validation |
---|---|

Extreme Gradient Boosting | 0.057299—(0.0066) |

Gradient Boosting | 0.050332—(0.005987) |

Random Forests | 0.056302—(0.006642) |

Extra Trees | 0.058541—(0.006728) |

Ensemble Method | R^{2} Training Score | R^{2} Test Score | RMSE Test Score |
---|---|---|---|

Extreme Gradient Boosting | 0.79745 | 0.62404 | 0.00348 |

Gradient Boosting | 0.77405 | 0.73282 | 0.00247 |

Gradient Boosting (Tuned) | 0.74533 | 0.72741 | 0.00252 |

Random Forests | 0.7971 | 0.62462 | 0.00348 |

Extra Trees | 0.79909 | 0.58924 | 0.0038 |

R^{2} Training Score | R^{2} Test Score | RMSE Test Score | |
---|---|---|---|

Support Vector Machines Regression (Experiment 1) | 0.64731 | 0.56272 | 0.0036 |

Gradient Boosting (Experiment 9) | 0.74533 | 0.72741 | 0.00252 |

Performance improvement on the test set (%) | 29.27% | 42.88% |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Christias, P.; Mocanu, M.
A Machine Learning Framework for Olive Farms Profit Prediction. *Water* **2021**, *13*, 3461.
https://doi.org/10.3390/w13233461

**AMA Style**

Christias P, Mocanu M.
A Machine Learning Framework for Olive Farms Profit Prediction. *Water*. 2021; 13(23):3461.
https://doi.org/10.3390/w13233461

**Chicago/Turabian Style**

Christias, Panagiotis, and Mariana Mocanu.
2021. "A Machine Learning Framework for Olive Farms Profit Prediction" *Water* 13, no. 23: 3461.
https://doi.org/10.3390/w13233461