Machine Learning-Based Crop Yield Prediction in South India: Performance Analysis of Various Models

: Agriculture is one of the most important activities that produces crop and food that is crucial for the sustenance of a human being. In the present day, agricultural products and crops are not only used for local demand, but globalization has allowed us to export produce to other countries and import from other countries. India is an agricultural nation and depends a lot on its agricultural activities. Prediction of crop production and yield is a necessary activity that allows farmers to estimate storage, optimize resources, increase efficiency and decrease costs. However, farmers usually predict crops based on the region, soil, weather conditions and the crop itself based on experience and estimates which may not be very accurate especially with the constantly changing and unpredictable climactic conditions of the present day. To solve this problem, we aim to predict the production and yield of various crops such as rice, sorghum, cotton, sugarcane and rabi using Machine Learning (ML) models. We train these models with the weather, soil and crop data to predict future crop production and yields of these crops. We have compiled a dataset of attributes that impact crop production and yield from specific states in India and performed a comprehensive study of the performance of various ML Regression Models in predicting crop production and yield. The results indicated that the Extra Trees Regressor achieved the highest performance among the models examined. It attained a R-Squared score of 0.9615 and showed lowest Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) of 21.06 and 33.99. Following closely behind are the Random Forest Regressor and LGBM Regressor, achieving R-Squared scores of 0.9437 and 0.9398 respectively. Moreover, additional analysis revealed that tree-based models, showing a R-Squared score of 0.9353, demonstrate better performance compared to linear and neighbors-based models, which achieved R-Squared scores of 0.8568 and 0.9002 respectively.


Introduction
India is an agricultural nation; it relies on agriculture as a major contributor towards its economy.According to the estimates, released by the Ministry of Statistics & Programme Implementation (MoSPI), the Gross Value Added (GVA) of agriculture and allied sectors in 2020-2021 was 20.3%, it was 19% in 2021-2022 and it again came down to 18.3% in 2022-2023 [1].India being one of the world's agricultural powerhouses, is the world's top producer of several spices, cereal crops such as rice and wheat, fruits and vegetables, along with commercial crops such as tea.Being one of the biggest producers not only produces for its own consumption but is also a key exporter of agricultural goods to several nations, rice and sugar being the major agricultural exports.India's export trade was around Rs 380,000 crore in 2021-2022 (Department of Commerce, Government of India).Majority of India's people live in rural areas, and agriculture is the major source of income for these • A unique dataset has been compiled by us, encompassing essential factors influencing crop growth such as meteorological factors, soil factors, and agricultural factors for the certain specific states in India as mentioned above.• A statistical feature analysis was conducted on the influence of specific features related to meteorological, soil, and crop data on crop production and yield.This feature selection analysis helped in identifying the most contributing features to the prediction of crop production and yield.The impact of these features on crop production and yield was examined, focusing on their relationship to the outcome variables.• A comprehensive analysis has been conducted on the trained ML models, evaluating their performance using selected metrics.The ML models were also compared with each other based on these metrics.

Literature Review
Anakha Venugopal et al. [4] has proposed a mobile application which predicts the crop's name and also calculates its yield.The dataset used included some meteorological data such as temperature, wind speed and humidity and also the crop production related data.However, the data related to soil was not included in the dataset and the absence of the soil data limited the analysis of soil-related factors.The classification algorithms used in the paper are Logistic Regression, Naïve Bayes and Random Forest.And among these, the highest accuracy of 92.81% was shown by Random Forest model followed by Naïve Bayes with 91.50% accuracy and Logistic Regression with 87.80% accuracy.Thomas van Klompenburg et al. [5] conducted a Systematic Literature Review (SLR) to gather algorithms and attributes used in crop prediction studies.Their analysis revealed that the most used features in these studies are rainfall, temperature and soil type.And also, it showed that the most commonly used algorithm is Artificial Neural Networks (ANN).The paper also discussed the evaluation metrics used in these studies, which revealed that Root Mean Squared Error (RMSE) was the most popular choice.And further, it provided an additional analysis of deep learning-based studies.However, the review doesn't delve into the challenges related to the data quality, feature selection or model interpretability.Sonal Agarwal et al. [6] proposed a hybrid approach for crop yield prediction, combining machine learning and deep learning techniques.The enhanced model provided a better accuracy of 97% compared to the existing model which had an accuracy of 93%.Support

Literature Review
Anakha Venugopal et al. [4] has proposed a mobile application which predicts the crop's name and also calculates its yield.The dataset used included some meteorological data such as temperature, wind speed and humidity and also the crop production related data.However, the data related to soil was not included in the dataset and the absence of the soil data limited the analysis of soil-related factors.The classification algorithms used in the paper are Logistic Regression, Naïve Bayes and Random Forest.And among these, the highest accuracy of 92.81% was shown by Random Forest model followed by Naïve Bayes with 91.50% accuracy and Logistic Regression with 87.80% accuracy.Thomas van Klompenburg et al. [5] conducted a Systematic Literature Review (SLR) to gather algorithms and attributes used in crop prediction studies.Their analysis revealed that the most used features in these studies are rainfall, temperature and soil type.And also, it showed that the most commonly used algorithm is Artificial Neural Networks (ANN).The paper also discussed the evaluation metrics used in these studies, which revealed that Root Mean Squared Error (RMSE) was the most popular choice.And further, it provided an additional analysis of deep learning-based studies.However, the review doesn't delve into the challenges related to the data quality, feature selection or model interpretability.Sonal Agarwal et al. [6] proposed a hybrid approach for crop yield prediction, combining machine learning and deep learning techniques.The enhanced model provided a better accuracy of 97% compared to the existing model which had an accuracy of 93%.Support Vector Machine (SVM) was used for machine learning algorithm and for deep learning algorithms, Long Short-Term Memory networks (LSTM) and Recurrent Neural Network (RNN) was used.The study lacks an in-depth exploration of the trade-offs between different hybrid models or their computational requirements.
A.B. Sarr et al. [7] investigate crop yield prediction methods specifically for Senegal.They proposed a study in which they used three machine learning models which are SVM, Random Forest and Neural Network and one multiple linear regression that is Least Absolute Shrinkage and Selection Operator (LASSO) to predict the yield of essential food staple crops in Senegal.They used three combinations of predictors: vegetation data, climate data and combination of both, for training the models.The best performance was shown by models trained with combination of both vegetation and climate data.However, This study overlooked the influence of soil conditions on the prediction of crop yields.S. S Kale et al. [8] aimed to predict crop yield of different crops using neural network regression.Dataset is obtained from Indian government websites for districts in Maharashtra, India.The model predicted with an 82% accuracy by using three layered ANN that uses Rectified Linear activation function (RELU) activation function and Adam optimizer.N. Bali et al. [9] explored various machine learning algorithms and techniques used in crop yield prediction, and assesses advanced techniques like deep learning in such estimations and also explores the efficiency of hybridized models.It concluded that factors such as precipitation and temperature were the most influencing factors along with agronomic practices adopted by farmers.ANN and Adaptive Neuro-Fuzzy Inference System (ANFIS), hybridized fuzzy and ANN models showed the best accuracy.In addition to the mentioned studies, various other studies have successfully incorporated neural networks into crop yield prediction, such as those referenced in citations [10][11][12][13].
Research conducted by Hames Sherif [14] identified the important factors responsible for staple crops production in semi-arid and desert climates in Africa to predict their yield.Machine learning models used in the research were Multiple Linear Regression (MLR) model and random forest regressor.Metrics such as RMSE and R-squared score were used to compare the performances between the models.It was found that random forest regressor had better accuracy in its predictions compared to the MLR model.However, the run time for training the random forest model was significantly higher than that of MLR model.One limitation of this research paper is that the accuracy of the data obtained from various sources is difficult to assess, as it is largely collected by member countries and includes imputations for missing data whose accuracy is unknown.H.A. Burhan [15] evaluated regression machine learning methods for crop yield prediction of major crops in turkey.The data used to train these models includes pesticide use, meteorological factors and crop yield values.The best R-squared scores were shown by random forest and decision tree regression methods, but support vector regression seemed to show extremely poor performance.
M. Kuradusenge et al. [16] used yields and weather data from a district in Rwanda to predict crop harvest specifically Irish potatoes and Maize.The models used to predict are Polynomial, Random Forest and Support Vector regressors.Among these models, random forest showed the best performance with R-squared score of 0.875 for potatoes and 0.817 for maize.However, the paper did not include other weather-related features such as humidity, wind speed and solar radiation and also soil data for training the models, the predictions made does not incorporate the impact of these factors, despite their actual significance.F. Abbas et al. [17] used four machine learning algorithms, particularly linear regression, elastic net, K-nearest Neighbor (KNN) and Support Vector Regression (SVR) to predict the potato tuber yield from crop and soil data.The best performance among these models is shown by SVR model, while KNN model showed poor performance among them.Furthermore, several other studies have discussed the influence of soil factors on crop yield prediction such as [18,19].P. Das et al. [20] presented a novel hybrid approach of combining the soft computing algorithm, multivariate adaptive regression spline (MARS) for feature selection with SVR and ANN models to predict grain yield.The MARS-based hybrid models showed better performance compared to the regular models.Y. Shen et al. [21] proposed an architecture combining long short-term memory neural network and random forest (LSTM-RF) to predict wheat yield using multispectral canopy water stress indices (CWSI) and vegetation indices (VIs) as training data.The combined model, LSTM-RF, showed a better R-squared score of 0.71 than just LSTM model with R-squared score of 0.61.Other research works that utilized LSTM for improved crop yield prediction performance include [22][23][24][25].
N. Banu Priya et al. [26] has aimed to predict crop yield production using data mining techniques and machine learning algorithms, to improve the accuracy of the crop production to manage agricultural risk.Random forest regression, decision tree regression and gradient boost regression have been used and have achieved 88% R-squared score with random forest regression.However, a limitation is that the study uses relatively simple factors like the state, district, crop, and season, which may not fully capture the complexity of crop yield variability.The ability of statistical models to predict crop yield production with respect to changes in mean temperature and precipitation was examined by David B. Lobell et al. [27] as simulated by a crop model (CERES Maize).Results suggested that statistical models when compared to crop models represent a useful but imperfect tool in prediction crop production, however it was also observed that they performed better at broader scales concluding that statistical models would still play an important role in predicting the impacts of climate change.One limitation of this study is that these models rely on historical yields and simplified weather measurements, which may not fully capture the complexity of climate impacts on agriculture.Douglas K. Bolton et al. [28] Used country level data from the USDA (United States Department of Agriculture) from NASA's Moderate Resolution Imaging Spectroradiometer (MODIS) to develop empirical models for predicting soybean and maize in Central United States.It was found that inclusion of phenology data greatly improved the model performance, however crop phenology data may not be readily available especially when taking into account regional crop varieties and it can also complicate the interaction relationships and increase model complexity.The paper provided by Yogesh Gandge et al. [29] aims to predict crop yield to help farmers and government to plan better, and uses data mining techniques to efficiently extract features and data used to predict crop yield and finds that there is still scope for improvement and to use better unified models along with a bigger dataset to predict crop yield at a greater accuracy.
Keerthana Mummaleti et al. [30] analyzed the usage and implementation of ensemble techniques in predicting the crop type from location parameters, by retrieving 7 features from various databases with 28,252 instances.The paper concluded that an ensemble of decision tree regression and Ada boost regression gave the best accuracy, giving a recommendation of which crop should be cultivated in the region based on weather conditions.Although this study uses various models to predict the crop type, it doesn't consider other important climatic factors and soil quality, which can significantly affect the predictions.Shah Ayush et al. [31] suggested the optimal climate factors to maximize crop yield and to predict crop yield, using multivariate polynomial regression, support vector machine regression and random forest models.It uses yield and weather data from United States Department of Agriculture.The paper found that support vector regression obtained the best possible results.One limitation of the paper is that it does not consider other factors such as soil quality, which can also impact yields.
S. Misra Veenadhari et al. [32] predicted the influence of climatic parameters on the crop yields in selected districts of Madhya Pradesh, India but other agricultural parameters were not considered in this paper.A prediction accuracy of 76 to 90% was achieved for the selected crops and districts and an overall prediction accuracy of 82%.The paper's limitation in predicting crop yield using machine learning and climatic parameters lies in its exclusive focus on climate-related factors, neglecting other crucial agro-input variables that influence crop productivity.V. Sellam et al. [33] analyzed the influence of environmental parameters like Area Under Cultivation (AUC), Annual Rainfall (AR), and Food Price Index (FPI) in crop yield.This has been done using Regression Analysis and achieved an accuracy of 0.7 (R-squared measure) using their linear regression model with least squares fit.The limitation of the paper in its reliance on a single predictive model and data from a single country, suggesting the need for future studies to explore other machine learning algorithms and expand the scope of the research to other regions.P. Mishra et al. [34] used Gradient Boosting Regression to improve the prediction of crop yields for districts in France.The model showed a R-squared score of 0.51 which was significantly better than other models, namely Ada Boosting, KNN, Linear and Random Forests.The limitation of this paper is that it focuses only on predicting maize yields in France and does not consider other crops or regions.Gradient Boosting Regression was also used in other studies [35][36][37] to improve the accuracy of crop yield prediction.
Leveraging Data Mining techniques, particularly KNN, V. Latha Jothi et al. [38] provided research which focused on using historical data like rainfall, temperature, and groundwater levels to predict future crop production, aiding in analyzing past and predicting future groundwater levels for improved agricultural planning.The limitation of this research paper is the difficulty in estimating the rainfall precisely, which is an important factor for crop yield prediction.Similar research work of using KNN models for crop yield prediction was done in [39][40][41][42].

Methodology
The steps to predict the crop production yield are: Model training, and Variable identification, data collection and pre-processing steps are some of the most important steps when it comes to training any ML (Machine Learning) model as the performance of the model depends greatly on the quality, consistency and accuracy of the data it is trained on.Therefore, it is crucial that these steps are done with diligence.Figure 2 shows the overall methodology flow.
Computers 2024, 13, x FOR PEER REVIEW 7 of 28 Price Index (FPI) in crop yield.This has been done using Regression Analysis and achieved an accuracy of 0.7 (R-squared measure) using their linear regression model with least squares fit.The limitation of the paper in its reliance on a single predictive model and data from a single country, suggesting the need for future studies to explore other machine learning algorithms and expand the scope of the research to other regions.P. Mishra et al. [34] used Gradient Boosting Regression to improve the prediction of crop yields for districts in France.The model showed a R-squared score of 0.51 which was significantly better than other models, namely Ada Boosting, KNN, Linear and Random Forests.The limitation of this paper is that it focuses only on predicting maize yields in France and does not consider other crops or regions.Gradient Boosting Regression was also used in other studies [35][36][37] to improve the accuracy of crop yield prediction. Leveraging Data Mining techniques, particularly KNN, V. Latha Jothi et al. [38] provided research which focused on using historical data like rainfall, temperature, and groundwater levels to predict future crop production, aiding in analyzing past and predicting future groundwater levels for improved agricultural planning.The limitation of this research paper is the difficulty in estimating the rainfall precisely, which is an important factor for crop yield prediction.Similar research work of using KNN models for crop yield prediction was done in [39][40][41][42].

Methodology
The steps to predict the crop yield and production are: Variable identification, data collection and pre-processing steps are some of the most important steps when it comes to training any ML (Machine Learning) model as the performance of the model depends greatly on the quality, consistency and accuracy of the data it is trained on.Therefore, it is crucial that these steps are done with diligence.Figure 2 shows the overall methodology flow.

Variable Identification
We first identify the variables that are required that will be useful to predict crop yield and production.The most contributing factors that influence crop yield and production are the region under consideration, the meteorological factors, the soil type and soil profile, and the crop itself.
The regions of consideration are identified.We have considered particular districts from the following states in India to predict crop yield and production.The meteorological and soil data attributes are identified.

Data Collection
Once the required data is identified, we proceeded to identify the data sources to collect the data.The data is from government or other open access data sources.

Data Source
The dataset was built based on the data from official and verified websites, they are: • data.icrisat.org(http://data.icrisat.org/dld/,accessed on 1 November 2023): The data related to the crop production such as yield, area, crop type etc. was obtained from the ICRISAT website.• power.larc.nasa.gov:The meteorological data included in the dataset such as tempera- ture, utilized in this study was obtained from this website.• geoportal.natmo.gov.in:The district wise soil related data in the dataset was obtained from this website.

Data Pre-Processing
Data pre-processing involves the steps necessary to make the data more usable and in the right structure and format for the ML model.This is done by handling missing values, data discrepancies and inconsistencies.First the independent datasets of meteorological data and agricultural data were merged based on the common district and year attributes which in turn was merged with soil data based on the district attribute.By merging the datasets, a more comprehensive and diverse set of features was created by leveraging the information from multiple sources which is much more usable by the ML model.After merging all the datasets, we have total of 2550 rows (34 districts × 15 years × 5 crops).Figure 3 shows the algorithm for merging the datasets.
Computers 2024, 13, x FOR PEER REVIEW 8 of 28

Variable Identification
We first identify the variables that are required that will be useful to predict crop yield and production.The most contributing factors that influence crop yield and production are the region under consideration, the meteorological factors, the soil type and soil profile, and the crop itself.
The regions of consideration are identified.We have considered particular districts from the following states in India to predict crop yield and production.The meteorological and soil data attributes are identified.

Data Collection
Once the required data is identified, we proceeded to identify the data sources to collect the data.The data is from government or other open access data sources.

Data Source
The dataset was built based on the data from official and verified websites, they are:

Data Pre-Processing
Data pre-processing involves the steps necessary to make the data more usable and in the right structure and format for the ML model.This is done by handling missing values, data discrepancies and inconsistencies.First the independent datasets of meteorological data and agricultural data were merged based on the common district and year attributes which in turn was merged with soil data based on the district attribute.By merging the datasets, a more comprehensive and diverse set of features was created by leveraging the information from multiple sources which is much more usable by the ML model.After merging all the datasets, we have total of 2550 rows (34 districts × 15 years × 5 crops).Figure 3 shows the algorithm for merging the datasets.Secondly, the predictor variables that were redundant and could potentially lead to overfitting were dropped.For example, Area and Yield were two predictor variables, the product of which was Production (another existing predictor variable) which is redundant here as it is directly dependent on these two predictor variables.Thus, this variable contributes nothing unique of its own and thus was dropped.This leads to prevention of the Secondly, the predictor variables that were redundant and could potentially lead to overfitting were dropped.For example, Area and Yield were two predictor variables, the product of which was Production (another existing predictor variable) which is redundant here as it is directly dependent on these two predictor variables.Thus, this variable contributes nothing unique of its own and thus was dropped.This leads to prevention of the overfitting that would have been caused by an unnecessarily complex model due to the presence of redundant predictor variables.Thirdly, after the above two steps some of the categorical variables must be converted to numeric or variables as the models that are used to train accept numeric values as inputs.Variables like Nitrogen content, Phosphorus content, Soil type, Soil depth, pH was converted using one hot encoding, and Crop Types was converted using label encoding.Finally, the data is normalized, which is the transforming of data into standard ranges.This removes the impacts of different units and scales, increases the speed of model training, helps in giving consistent results and handles outliers and skewed data distributions.We standardized the data by removing the means and scaling it to unit variance.And also, outliers were removed, after which, the range of values for the target variable was between 1 and 1000.This also includes the removal of rows with the target variable as zero, indicating that the specific crop was not cultivated in the corresponding district and year A total of 918 rows were thus excluded.The final dataset, after pre-processing, comprised 38 attributes (21 attributes without one hot encoding) and consisted of 1632 rows.Figure 4 shows the algorithm for encoding and normalization.
Computers 2024, 13, x FOR PEER REVIEW 9 of 28 overfitting that would have been caused by an unnecessarily complex model due to the presence of redundant predictor variables.Thirdly, after the above two steps some of the categorical variables must be converted to numeric or variables as the models that are used to train accept numeric values as inputs.Variables like Nitrogen content, Phosphorus content, Soil type, Soil depth, pH was converted using one hot encoding, and Crop Types was converted using label encoding.Finally, the data is normalized, which is the transforming of data into standard ranges.This removes the impacts of different units and scales, increases the speed of model training, helps in giving consistent results and handles outliers and skewed data distributions.We standardized the data by removing the means and scaling it to unit variance.And also, outliers were removed, after which, the range of values for the target variable was between 1 and 1000.This also includes the removal of rows with the target variable as zero, indicating that the specific crop was not cultivated in the corresponding district and year A total of 918 rows were thus excluded.
The final dataset, after pre-processing, comprised 38 attributes (21 attributes without one hot encoding) and consisted of 1632 rows.Figure 4 shows the algorithm for encoding and normalization.

Feature Selection
The next important step in the pipeline is feature selection.Feature selection is the process of identifying and reducing the dataset to the most important features.This consequently includes removing features that do not affect the output variable or may even interfere and deviate the outputs.This also helps in reducing the dimensionality of the dataset and thus reduces risk of over fitting on the data.Feature Selection can be done in the following major ways: Filter methods, wrapper methods, embedded methods and Dimensionality reduction techniques.
Filter methods essentially access feature relevance by performing statistical tests and select features independently of the ML model.It can be correlation based where the higher correlation features are considered more relevant.It can be chi-squared where the dependence of categorical data with the target variable is evaluated based on the observed versus expected frequencies.It can be information gain which identifies the amount of Input: Dataset with all the necessary attributes.

Output: Encoded and Normalized Dataset
• The dataset has a number of categorical values which are not suitable as inputs for machine learning models.Therefore, the categorical attributes are encoded.
• Identify all categorical attributes in the dataset.
• Identify the categorical variables for each attribute.
• Add new columns corresponding to each categorical variable and set its value to 1 or 0 depending on whether or not that instance has that categorical value as its attribute (one hot encoding).
• Remove the mean and scale each attribute to unit variance (Normalization)

Feature Selection
The next important step in the pipeline is feature selection.Feature selection is the process of identifying and reducing the dataset to the most important features.This consequently includes removing features that do not affect the output variable or may even interfere and deviate the outputs.This also helps in reducing the dimensionality of the dataset and thus reduces risk of over fitting on the data.Feature Selection can be done in the following major ways: Filter methods, wrapper methods, embedded methods and Dimensionality reduction techniques.
Filter methods essentially access feature relevance by performing statistical tests and select features independently of the ML model.It can be correlation based where the higher correlation features are considered more relevant.It can be chi-squared where the dependence of categorical data with the target variable is evaluated based on the observed versus expected frequencies.It can be information gain which identifies the amount of information gain a feature contributes towards the target variable.Figure 5 shows the feature selection process.
Computers 2024, 13, x FOR PEER REVIEW 10 of 28 information gain a feature contributes towards the target variable.Figure 5 shows the feature selection process.We employed a filter-based feature selection method for its simplicity and computational efficiency.This approach allowed us to identify a subset of the best features by evaluating their relevance to the target variable.To accomplish this, we utilized the f_regression score function, which ranks features based on their capability to explain the variance observed in the target variable.To determine the optimal number of features for selection, we examined the relationship between the number of selected features and the average R2 score across all models as shown in Figure 7.After pre-processing, the dataset contained 38 features available for selection.Through analysis, we discovered that the highest average R2 score was obtained when using a subset of 31 features.Hence, we considered this subset as the optimal number of features for further analysis.

Input: All features in the dataset.
Output: Extracted features after feature selection 1. Identify k value which is the number of features to be selected.
2. We use ranking based on correlation as the scoring function to determine the most important features.

After determining the most important features, we
filter them from the dataset.information gain a feature contributes towards the target variable.Figure 5 shows the feature selection process.We employed a filter-based feature selection method for its simplicity and computational efficiency.This approach allowed us to identify a subset of the best features by evaluating their relevance to the target variable.To accomplish this, we utilized the f_regression score function, which ranks features based on their capability to explain the variance observed in the target variable.To determine the optimal number of features for selection, we examined the relationship between the number of selected features and the average R2 score across all models as shown in Figure 7.After pre-processing, the dataset contained 38 features available for selection.Through analysis, we discovered that the highest average R2 score was obtained when using a subset of 31 features.Hence, we considered this subset as the optimal number of features for further analysis.

Input: All features in the dataset.
Output: Extracted features after feature selection 1. Identify k value which is the number of features to be selected.
2. We use ranking based on correlation as the scoring function to determine the most important features.
3. After determining the most important features, we filter them from the dataset.We employed a filter-based feature selection method for its simplicity and computational efficiency.This approach allowed us to identify a subset of the best features by evaluating their relevance to the target variable.To accomplish this, we utilized the f_regression score function, which ranks features based on their capability to explain the variance observed in the target variable.To determine the optimal number of features for selection, we examined the relationship between the number of selected features and the average R2 score across all models as shown in Figure 7.After pre-processing, the dataset contained 38 features available for selection.Through analysis, we discovered that the highest average R2 score was obtained when using a subset of 31 features.Hence, we considered this subset as the optimal number of features for further analysis.

Train-Test Split
The next step before training the model on the data is to split the dataset into training and testing datasets.This is done so that the model can be trained on the training dataset, and it can be separately tested on unseen data from the testing dataset.We have chosen 80-20 training-test dataset split for the models.

Model Training
Now the training data is fed as inputs to the various ML models that we have considered.They include:

Linear Regression
This algorithm aims to fit a straight line by minimizing the deviations of the predicted result from the actual values.
This model is very simple, easy to predict.straightforward and training is quick.It has applications in a wide variety of fields.But this method also has limitations such as assuming that the variables are linearly dependent and have a linear relationship, whereas many real-time datasets have complex non-linear relationships.Moreover, the variables should be independent of each other and the residuals should follow normal distribution and minimal outliers.
Despite these limitations, linear regression performs well in sales predictions, risk assessment and problems where there is an identifiable trend.

Y = a + bX +cZ
(1) Equation (1) explains linear regression where Y is the target variable which follows a linear dependency with the attributes X and Z.
Ridge Regression is a modification to the linear regression model, in which we try to reduce variance of a linear regression fit by introducing a bias in the minimization equation.
Ridge regression generally outperforms linear regression when the dataset is small, because the minimizing of the least squares over fits the data resulting in low variance.It can also be used for discrete data and also with logistic regression.Some of the limitations of this model is that it sometimes shrinks coefficients to zero and trades off variance for bias.
Bayesian Ridge regression is another regression model that aims to find linear regression fits through probability estimates rather than point estimates.Bayesian ridge regression aims to combine Bayesian regression with ridge regression, this done by probabilistically estimating the relationship between attributes and target while also considering uncertainties, making it good for handling multicollinearity.However, it can be computationally expensive.Figure 8 shows the algorithm for linear regression.

Train-Test Split
The next step before training the model on the data is to split the dataset into training and testing datasets.This is done so that the model can be trained on the training dataset, and it can be separately tested on unseen data from the testing dataset.We have chosen 80-20 training-test dataset split for the models.

Model Training
Now the training data is fed as inputs to the various ML models that we have considered.They include:

Linear Regression
This algorithm aims to fit a straight line by minimizing the deviations of the predicted result from the actual values.
This model is very simple, easy to predict.straightforward and training is quick.It has applications in a wide variety of fields.But this method also has limitations such as assuming that the variables are linearly dependent and have a linear relationship, whereas many real-time datasets have complex non-linear relationships.Moreover, the variables should be independent of each other and the residuals should follow normal distribution and minimal outliers.
Despite these limitations, linear regression performs well in sales predictions, risk assessment and problems where there is an identifiable trend.
Equation (1) explains linear regression where Y is the target variable which follows a linear dependency with the attributes X and Z.
Ridge Regression is a modification to the linear regression model, in which we try to reduce variance of a linear regression fit by introducing a bias in the minimization equation.
Ridge regression generally outperforms linear regression when the dataset is small, because the minimizing of the least squares over fits the data resulting in low variance.It can also be used for discrete data and also with logistic regression.Some of the limitations of this model is that it sometimes shrinks coefficients to zero and trades off variance for bias.
Bayesian Ridge regression is another regression model that aims to find linear regression fits through probability estimates rather than point estimates.Bayesian ridge regression aims to combine Bayesian regression with ridge regression, this done by probabilistically estimating the relationship between attributes and target while also considering uncertainties, making it good for handling multicollinearity.However, it can be computationally expensive.Figure 8 shows the algorithm for linear regression.

K Neighbors Regression
K Neighbors Regression, unlike the linear regression is a non-parametric model, it uses the average of the neighborhood of X (determined by k value) to estimate Y.For k = 1 it is a perfect fit, but also causes over fitting, whereas on high values of k it performs poorly on both trained and unseen data but an optimum k is identified at the elbow point by cross validation.This model thus performs better than linear regression by allowing for more flexibility through estimations through approximations of neighbors, but it suffers from the problem of dimensionality and is thus unsuitable for datasets with higher number predictor variables.

Decision Tree Regression
Decision tree regression model builds a tree to predict the target variable by splitting data based on features that best predict the target variable (using standard deviation, information gain or entropy) and creating a tree structure with leaf nodes containing the predictions.
This model can handle multiple predictors well but can overfit, to prevent this a minimum number of observations per split is required.It works well with noisy data but is computationally expensive and unstable.

Bagging & Boosting Regressions
Bagging is an ensemble learning technique.Ensemble learning techniques are essentially models which combine multiple regression models, and this is done mainly to reduce over fitting and improve model accuracy.It involves two important processes: bootstrapping and aggregating.Bootstrapping is the division of the dataset and training different models over random samples of it.Aggregating is the combining of the results produced by the different models to produce the final result, this can be done by voting or averaging.Bagging helps with the problem of over fitting as multiple models are trained 4. Now the weights and bias are updated by using gradient descent optimization algorithm, it computes the negative gradient of the loss function with respect to the weights and bias to minimize the loss function, the step size of the updates is determined by the learning rates.
5. This process is repeated till the line/model that best fits the input data is obtained.

K Neighbors Regression
K Neighbors Regression, unlike the linear regression is a non-parametric model, it uses the average of the neighborhood of X (determined by k value) to estimate Y.For k = 1 it is a perfect fit, but also causes over fitting, whereas on high values of k it performs poorly on both trained and unseen data but an optimum k is identified at the elbow point by cross validation.This model thus performs better than linear regression by allowing for more flexibility through estimations through approximations of neighbors, but it suffers from the problem of dimensionality and is thus unsuitable for datasets with higher number predictor variables.

Decision Tree Regression
Decision tree regression model builds a tree to predict the target variable by splitting data based on features that best predict the target variable (using standard deviation, information gain or entropy) and creating a tree structure with leaf nodes containing the predictions.
This model can handle multiple predictors well but can overfit, to prevent this a minimum number of observations per split is required.It works well with noisy data but is computationally expensive and unstable.

Bagging & Boosting Regressions
Bagging is an ensemble learning technique.Ensemble learning techniques are essentially models which combine multiple regression models, and this is done mainly to reduce over fitting and improve model accuracy.It involves two important processes: bootstrapping and aggregating.Bootstrapping is the division of the dataset and training different models over random samples of it.Aggregating is the combining of the results produced by the different models to produce the final result, this can be done by voting or averaging.Bagging helps with the problem of over fitting as multiple models are trained over different subsets of the data.It is simpler to implement and can also handle complex relationships between variables.Boosting on the other hand sequentially combines multiple weak base estimators to create a more powerful, robust and accurate model.Figure 9 shows the decision tree regressor flowchart.
Computers 2024, 13, x FOR PEER REVIEW 13 of 28 over different subsets of the data.It is simpler to implement and can also handle complex relationships between variables.Boosting on the other hand sequentially combines multiple weak base estimators to create a more powerful, robust and accurate model.Figure 9 shows the decision tree regressor flowchart.

Bagging Regression
Uses decision trees as the base estimator and combines their predictions into a final prediction.This is done by bootstrap aggregating.This helps to reduce over fitting and improve the model and enabling it to capture different patterns and relationships in data.The aggregation of predictions is done by averaging the predictions of each base estimator.This technique helps to minimize the impact of outliers and noise in the crop production and prediction dataset.Figure 10 shows the Bagging regressor flowchart.

Random Forest Regression
This is an extension of bagging regression with additional randomization techniques.It performs feature subsampling, optimizes the splits by considering a random subset of features at each node of the decision tree.Random forest regression performs better due to this.

Extra Trees Regression
This is a variation of decision tree-based ensemble models.Additional randomness is introduced during the training when compared to traditional random forests.This model also performs feature subsampling.It also randomly selects threshold for each candidate feature instead of exhaustive searching.The extra randomizations help in further reduction of variance in the crop production prediction.It is robust to noisy data and also performance good on the higher number of features present in this crop dataset.Figure 11 shows the Extra Trees regressor flowchart.

Bagging Regression
Uses decision trees as the base estimator and combines their predictions into a final prediction.This is done by bootstrap aggregating.This helps to reduce over fitting and improve the model and enabling it to capture different patterns and relationships in data.The aggregation of predictions is done by averaging the predictions of each base estimator.This technique helps to minimize the impact of outliers and noise in the crop production and prediction dataset.Figure 10 shows the Bagging regressor flowchart.

Random Forest Regression
This is an extension of bagging regression with additional randomization techniques.It performs feature subsampling, optimizes the splits by considering a random subset of features at each node of the decision tree.Random forest regression performs better due to this.

Extra Trees Regression
This is a variation of decision tree-based ensemble models.Additional randomness is introduced during the training when compared to traditional random forests.This model also performs feature subsampling.It also randomly selects threshold for each candidate feature instead of exhaustive searching.The extra randomizations help in further reduction of variance in the crop production prediction.It is robust to noisy data and also performance good on the higher number of features present in this crop dataset.Figure 11 shows the Extra Trees regressor flowchart.

Gradient Boosting Regression
This model utilizes gradient boosting to perform regression predictions.It combines multiple weak regression models to create a powerful model.It iteratively builds an ensemble of weak regressors where the loss function to be minimized is the negative gradient of the loss with respect to the previous regressor's predictions.This enables it to capture the complex relationships and patterns that are present in crop and climate data and the factors influencing crop production and yield. Figure 12 shows the algorithm for gradient booting regression.
multiple weak regression models to create a powerful model.It iteratively builds an ensemble of weak regressors where the loss function to be minimized is the negative gradient of the loss with respect to the previous regressor's predictions.This enables it to capture the complex relationships and patterns that are present in crop and climate data and the factors influencing crop production and yield. Figure 12 shows the algorithm for gradient booting regression.

Light Gradient Boosting Regression (LGBM)
It is an accurate, efficient and scalable gradient boosting algorithm that is suitable for large scale datasets like crop datasets.It iteratively combines on multiple trained weak learners or regressors in a boosting manner to make better and more accurate predictions.It uses gradient based methods to determine best splits during tree growth.It grows trees leaf wise focusing on leaves with larger gradients, and it can also handle categorical features without encoding.
LGBM is very similar to gradient boosting approach algorithm-wise, but it makes some optimizations to make the training faster and increase performance.Instead of using every unique feature value as a potential split point, it uses a histogram-based approach where it uses bins to discretize the values of the features and uses the histograms to efficiently find the optimal split points.

Model Evaluation
In model evaluation, the predicted values from the trained models are compared against the actual values to calculate relevant performance metrics.These metrics provide • Train a weak learner (decision tree) on the residuals by fitting it using the meteorological, crop and soil data as the input features.

Inputs
• Update the model's prediction by adding the prediction of the weak learner (decision tree) to it.
• Calculate the new residuals finding the difference between the decision tree's predictions from the negative gradient values (ensemble model's predictions).Make the new residuals the target variable for the next boosting iteration of the next weak learner (decision tree).

Light Gradient Boosting Regression (LGBM)
It is an accurate, efficient and scalable gradient boosting algorithm that is suitable for large scale datasets like crop datasets.It iteratively combines on multiple trained weak learners or regressors in a boosting manner to make better and more accurate predictions.It uses gradient based methods to determine best splits during tree growth.It grows trees leaf wise focusing on leaves with larger gradients, and it can also handle categorical features without encoding.
LGBM is very similar to gradient boosting approach algorithm-wise, but it makes some optimizations to make the training faster and increase performance.Instead of using every unique feature value as a potential split point, it uses a histogram-based approach where it uses bins to discretize the values of the features and uses the histograms to efficiently find the optimal split points.

Model Evaluation
In model evaluation, the predicted values from the trained models are compared against the actual values to calculate relevant performance metrics.These metrics provide valuable insights for further analysis of the model's performance.Figure 13 shows the model evaluation technique.
valuable insights for further analysis of the model's performance.Figure 13 shows the model evaluation technique.The performance of the trained machine learning models is analyzed based on the below evaluation metrics:

Mean Absolute Error (MAE)
Mean Absolute Error calculates the average absolute difference between the actual and the predicted values.It measures the average magnitude of the errors but does not account for the direction.

Mean Squared Error (MSE)
Mean Squared Error calculates the average of squared differences between the predicted and actual values.It measures average squared magnitude of errors and penalizes larger errors more heavily compared to MAE.

Root Mean Squared Error (RMSE)
Root Mean Squared Error is the square root of MSE, which makes it easier to interpret as it is on the same scale as the target variable.RMSE measures the standard deviation of the residuals, giving a measure of the average magnitude of errors.

𝑅𝑀𝑆𝐸 = ∑ (𝑦 − 𝑥 ) 𝑛
R-squared Score R-squared is a statistical measure that measures the ability of the independent variables to explain the variation in the target variable.It ranges from 0 to 1.A R 2 of 1 indicates a perfect fit, while 0 shows there is no explanation of the variance.

Root Relative Squared Error (RRSE)
Root Relative Squared Error of a model is the square root of ratio of the average of squared differences between its predicted values and actual values to the average of squared differences between predictions of a naïve mean model and the actual values.

Output: Accuracy/Performance Evaluation of the Model
• We feed the testing data as inputs to the trained model.
• On obtaining the predicted outputs Op from the model, we compare it to the actual outputs.
• Obtain performance metrics from this comparison.The performance of the trained machine learning models is analyzed based on the below evaluation metrics:

Mean Absolute Error (MAE)
Mean Absolute Error calculates the average absolute difference between the actual and the predicted values.It measures the average magnitude of the errors but does not account for the direction.
Mean Squared Error calculates the average of squared differences between the predicted and actual values.It measures average squared magnitude of errors and penalizes larger errors more heavily compared to MAE.
Root Mean Squared Error is the square root of MSE, which makes it easier to interpret as it is on the same scale as the target variable.RMSE measures the standard deviation of the residuals, giving a measure of the average magnitude of errors.
R-squared is a statistical measure that measures the ability of the independent variables to explain the variation in the target variable.It ranges from 0 to 1.A R 2 of 1 indicates a perfect fit, while 0 shows there is no explanation of the variance.
Root Relative Squared Error of a model is the square root of ratio of the average of squared differences between its predicted values and actual values to the average of squared differences between predictions of a naïve mean model and the actual values.
where, Data pertaining to the fundamental factors influencing crop prediction, encompassing soil-related factors, meteorological factors, and agricultural factors, has been collected for 34 districts across South India, covering the states of Tamil Nadu, Andhra Pradesh, Telangana, Karnataka, and Kerala, spanning the time period from 2001 to 2015.And the data collected was used to create three primary datasets each containing the attributes related to the discussed key factors.The three primary datasets are soil related dataset, agricultural dataset and meteorological dataset.

Soil Related Dataset
This dataset includes attributes that are associated with the properties and characteristics of the soil that directly affect the crop growth.Table 1 shows a sample of soil dataset.The attributes are:

•
Nitrogen (N): The attribute includes values indicating the amount of nitrogen in the soil, classified as low, medium, and high, for the particular district.

•
Phosphorous (P) The attribute includes values indicating the amount of phosphorous in the soil, classified as low, medium, and high, for the particular district.

•
Potassium (K): The attribute includes values indicating the amount of potassium in the soil, classified as low, medium, and high, for the particular district.

•
Soil Type: The attribute 'Soil Type', categorized district-wise, comprises values representing different soil types.The values include red, black, mixed red and black and alluvial soil.

•
Soil Depth: The attribute 'Soil depth' refers to the thickness of soil layer at the particular district.This attribute has values categorized as 0 to 25 cm, 50 to 100 cm, 100 to 300 cm and above 300 cm.• pH: The 'pH' attribute in the soil dataset represents the measurement of the soil's acidity or alkalinity level for the particular district.The values are categorized as Slightly Acidic, Slightly Alkaline, Neutral, Strongly Alkaline and Strongly Acidic.

Meteorological Dataset
This dataset comprises attributes pertaining to the district wise atmospheric conditions and weather conditions information that influence the growth and development of the plants.Table 2 summarizes the attributes of this dataset and Table 3 shows a sample of meteorological dataset.

PS (Surface Pressure)
Refers to the surface pressure measured in kilopascals (kPa).
TS (Earth Skin Temperature) Represents to the temperature of the surface of the Earth measured in degrees Celsius ( • C).
QV2M (Specific Humidity) Denotes the specific humidity measured at a height of 2 m above the earth surface expressed in grams per kilogram (g/Kg).

WS2M (Wind Speed)
Indicates the wind speed, the rate at which air is moving horizontally, measured at a height of 2 m above the surface represented in meters per second (m/s).
T2M_MAX (Temperature Maximum) Represents the maximum temperature recorded at a height of 2 m above the surface measured in degrees Celsius ( • C).
T2M_MIN (Temperature Minimum) Represents the minimum temperature recorded at a height of 2 m above the surface measured in degrees Celsius ( • C).

ALLSKY_KT (All Sky Insolation Clearness Index)
Refers to the clearness index of insolation, which is the ratio of the actual solar radiation received on the Earth's surface to the maximum possible solar radiation under clear sky conditions.It is a dimensionless index.
CLOUD_AMT (Cloud Amount) Indicates the proportion of the sky covered by clouds and is expressed as a percentage (%).

PRECTOTCORR (Precipitation)
Denotes the corrected precipitation measurement, which represents the average amount of precipitation (rainfall) in millimetres per day (mm/day).

ALLSKY_SFC_UVA (All Sky Surface UVA Irradiance)
Refers to the ultraviolet-A (UVA) irradiance received at the Earth's surface under all-sky conditions and it is measured in watts per square meter (W/m 2 ).

ALLSKY_SFC_UVB (All Sky Surface UVB Irradiance)
Represents the ultraviolet-B (UVB) irradiance received at the Earth's surface under all-sky conditions and the irradiance is measured in watts per square meter (W/m 2 ).

ALLSKY_SFC_SW_DWN (Sky Surface Shortwave Downward Irradiance)
Refers to the total shortwave downward irradiance received at the Earth's surface under all-sky conditions and the irradiance is measured in kilowatt-hours per square meter per day (kW-hr/m 2 /day).
ALLSKY_SFC_PAR_TOT (All Sky Surface PAR Total) Represents to the total surface photosynthetically active radiation (PAR) in watts per square meter (W/m 2 ) under all-sky conditions.This dataset includes a range of practices and techniques carried out by farmers to enhance crop production.Table 4 shows a sample of the agriculture dataset.The attributes are:

•
Crop Type: This attribute denotes the specific type of crops cultivated in the district for the particular year.• AREA (1000 ha): This attribute represents the total area of land, measured in thousands of hectares, dedicated to crop cultivation in the district.

•
IRRIGATED AREA (1000 ha): This attribute signifies the extent of land, measured in thousands of hectares, that is irrigated for crop cultivation in the district.
• YIELD (Kg per ha): This attribute quantifies the crop yield per hectare of cultivated land in kilograms, reflecting the productivity of the district's agricultural output.

•
PRODUCTION (in 1000 tons): This attribute indicates the total crop production, measured in KGs, achieved in the district.It is derived from area and yield attribute.This is the target variable that is to be predicted.And also, given the corresponding Area, yield can be calculated from it.

Interpretation of Results
Based on the production data from 2001 to 2015 for districts across South India, including Tamil Nadu, Andhra Pradesh, Telangana, Karnataka, and Kerala, it is evident that rice was the most prominently cultivated crop during that period.This trend is attributed to the region's warmer and wetter climate, which provides optimal conditions for rice cultivation.Additionally, crops such as sugarcane and sorghum have significant production levels.Conversely, crops like rabi had significantly lower production levels, suggesting that their cultivation was notably restricted compared to other crops due to their preference for cold and dry climates.The bar plot showing average production (in 1000 tons) for each crop can be shown in Figure 14.
for the particular year.
• AREA (1000 ha): This attribute represents the total area of land, measured in thousands of hectares, dedicated to crop cultivation in the district.

•
IRRIGATED AREA (1000 ha): This attribute signifies the extent of land, measured in thousands of hectares, that is irrigated for crop cultivation in the district.• YIELD (Kg per ha): This attribute quantifies the crop yield per hectare of cultivated land in kilograms, reflecting the productivity of the district's agricultural output.• PRODUCTION (in 1000 tons): This attribute indicates the total crop production, measured in KGs, achieved in the district.It is derived from area and yield attribute.This is the target variable that is to be predicted.And also, given the corresponding Area, yield can be calculated from it.

Interpretation of Results
Based on the production data from 2001 to 2015 for districts across South India, including Tamil Nadu, Andhra Pradesh, Telangana, Karnataka, and Kerala, it is evident that rice was the most prominently cultivated crop during that period.This trend is attributed to the region's warmer and wetter climate, which provides optimal conditions for rice cultivation.Additionally, crops such as sugarcane and sorghum have significant production levels.Conversely, crops like rabi had significantly lower production levels, suggesting that their cultivation was notably restricted compared to other crops due to their preference for cold and dry climates.The bar plot showing average production (in 1000 tons) for each crop can be shown in Figure 14.We employed a scatter plot to depict the relationship between irrigated area, measured in thousands of hectares, and production, measured in thousands of tons as shown We employed a scatter plot to depict the relationship between irrigated area, measured in thousands of hectares, and production, measured in thousands of tons as shown in Figure 15.Additionally, to observe the underlying trend, we introduced a linear regression line on the plot.The equation of the fitted line was: Production (in 1000 tons) = 3.10 × IRRIGATED AREA (1000 ha) + 28.30This line represents the linear relationship between irrigated area and production, where the slope coefficient (3.10) indicates the change in production per unit increase in irrigated area, and the intercept term (28.30) denotes the estimated production when the irrigated area is zero.Moreover, the correlation coefficient (Pearson's r) for this relationship was calculated to be 0.8931, signifying a strong positive correlation between irrigated area and production.
( 1000 ) = 3.10   (1000 ℎ) 28.30This line represents the linear relationship between irrigated area and production, where the slope coefficient (3.10) indicates the change in production per unit increase in irrigated area, and the intercept term (28.30) denotes the estimated production when the irrigated area is zero.Moreover, the correlation coefficient (Pearson's r) for this relationship was calculated to be 0.8931, signifying a strong positive correlation between irrigated area and production.In order to examine the influence of meteorological factors on crop production, we specifically focused on rice as our chosen crop type.Various histograms were plotted as shown in Figure 16.The analysis of the histogram depicting the average production based on surface pressure bars revealed that the highest average rice production occurred within the range of 97 to 101 kPa of surface pressure.Increased surface pressure benefits rice crop production by retaining soil moisture and reducing evaporation, which helps regulate temperature.Furthermore, an increase in surface temperature exhibited a positive correlation with average rice production.Regarding specific humidity, the study observed a rise in average rice production with increasing specific humidity until it reached 16 g/Kg, beyond which a decline was observed suggesting that rice crop growth was optimal under moderate humidity levels.The average rice production levels remained relatively consistent until a cloud cover of approximately 60%.However, beyond this threshold, increased cloud cover resulted in a slight decrease of average rice production, possibly due to diminished sunlight availability for photosynthesis, decreased temperatures, heightened humidity fostering moisture stress and fungal diseases, altered microclimates, and reduced solar radiation important for plant metabolism.UVB radiation detrimentally impacts plant physiology by impairing photosynthesis, disrupting nutrient absorption, and stunting overall plant growth.Consequently, these effects contribute to diminished yields and reduced productivity levels.Furthermore, heightened UVB exposure triggers the generation of reactive oxygen species (ROS) within plant cells, exacerbating cellular damage and hindering growth and development.In order to examine the influence of meteorological factors on crop production, we specifically focused on rice as our chosen crop type.Various histograms were plotted as shown in Figure 16.The analysis of the histogram depicting the average production based on surface pressure bars revealed that the highest average rice production occurred within the range of 97 to 101 kPa of surface pressure.Increased surface pressure benefits rice crop production by retaining soil moisture and reducing evaporation, which helps regulate temperature.Furthermore, an increase in surface temperature exhibited a positive correlation with average rice production.Regarding specific humidity, the study observed a rise in average rice production with increasing specific humidity until it reached 16 g/Kg, beyond which a decline was observed suggesting that rice crop growth was optimal under moderate humidity levels.The average rice production levels remained relatively consistent until a cloud cover of approximately 60%.However, beyond this threshold, increased cloud cover resulted in a slight decrease of average rice production, possibly due to diminished sunlight availability for photosynthesis, decreased temperatures, heightened humidity fostering moisture stress and fungal diseases, altered microclimates, and reduced solar radiation important for plant metabolism.UVB radiation detrimentally impacts plant physiology by impairing photosynthesis, disrupting nutrient absorption, and stunting overall plant growth.Consequently, these effects contribute to diminished yields and reduced productivity levels.Furthermore, heightened UVB exposure triggers the generation of reactive oxygen species (ROS) within plant cells, exacerbating cellular damage and hindering growth and development.
Taking sugarcane as an example, when analyzing the impact of soil factors on crop yield, it can be observed that sugarcane thrive in areas with slightly acidic soils.This conforms to the fact that sugarcane thrives in slightly acidic soil due to nutrient availability and sugar production benefits within a pH range of 5.5 to 6.5 [43].Also, it is observed regions with black soil with high water retaining capacity has relatively higher sugarcane production, as shown in the Figure 17.
Table 5 presents the evaluation metrics for various machine learning models trained using the training dataset and then tested on the testing dataset.These training data includes the data of all the crops, where even the crop type is given as an input.Taking sugarcane as an example, when analyzing the impact of soil factors on crop yield, it can be observed that sugarcane thrive in areas with slightly acidic soils.This conforms to the fact that sugarcane thrives in slightly acidic soil due to nutrient availability and sugar production benefits within a pH range of 5.5 to 6.5 [43].Also, it is observed regions with black soil with high water retaining capacity has relatively higher sugarcane production, as shown in the Figure 17.Table 5 presents the evaluation metrics for various machine learning models trained using the training dataset and then tested on the testing dataset.These training data includes the data of all the crops, where even the crop type is given as an input.Looking at random forest regressor and bagging regressor, the MAE values of these two models are almost the same which indicates similar average performance.But there is a relatively larger difference between their RMSE values, that is, RMSE of bagging regressor is more than that of random forest regressor.This suggests that bagging regressor has larger errors in magnitude compared to random forest regressor, despite their similar average absolute differences.This difference in RMSE values is seen as it penalizes larger errors more than MAE.So, we can say that the predictions for bagging regressor have more uneven distribution of errors as compared to random forest regressor, although having similar average magnitude of errors.This fact can also be observed in Figure 18.
Computers 2024, 13, x FOR PEER REVIEW 23 of 28 average absolute differences.This difference in RMSE values is seen as it penalizes larger errors more than MAE.So, we can say that the predictions for bagging regressor have more uneven distribution of errors as compared to random forest regressor, although having similar average magnitude of errors.This fact can also be observed in Figure 18.And also, the difference between the RMSE and MAE value is the most for decision tree regressor, implying it is having the more uneven distribution or higher variance in errors compared to the other models.On the other hand, the lowest difference can be observed for extra trees regressor, indicating that it has more even distribution or lower variance in errors compared to the other models.As the difference increases, the variance in errors of the model's predictions also increases.
The highest R-squared score is observed for extra trees regressor (0.9615) followed by random forest regressor (0.9437) and LGBM regressor (0.9398).This suggests that these And also, the difference between the RMSE and MAE value is the most for decision tree regressor, implying it is having the more uneven distribution or higher variance in errors compared to the other models.On the other hand, the lowest difference can be observed for extra trees regressor, indicating that it has more even distribution or lower variance in errors compared to the other models.As the difference increases, the variance in errors of the model's predictions also increases.
The highest R-squared score is observed for extra trees regressor (0.9615) followed by random forest regressor (0.9437) and LGBM regressor (0.9398).This suggests that these models are better at explaining the variance in the target variable using the independent variable compared to the other models.And Bayesian regressor (0.8566) is having the least R-squared score among the other models indicating it is less effective in capturing the relationship between target variable and the independent variables.Looking at the RRSE values, to compare the models to the baseline model.The baseline model is a simple model which is represented by using the mean value of the dependent variable as a predictor.The RRSE values of all the models ranges between 0.1962 and 0.3787, indicating that the prediction errors made by these models is significantly lower compared to the baseline model.
The models are grouped based on their underlying methodologies and techniques, and then were compared at group level.The above models are grouped in three categories: Linear Models, Tree-Based Models and Neighbors-Based Models as shown in Table 6.The average MAE, MSE, RMSE, R-squared score and RRSE of the models, according to their groups is shown in Table 7.Here, it can be observed that average MAE values of the tree-based and neighborsbased models are significantly lower compared to that of linear models.The same holds true for RMSE values as well.And the average R-squared scores of linear models is less than that of tree-based and neighbors-based models, suggesting that these are comparatively less effective in explaining the variability in the target variable using the independent variables.The average RRSE values for all the groups are low, suggesting the average prediction errors caused by the model groups are significantly lower compared to the baseline model.Overall, the performance of tree-based and neighbors-based models is better than that of the linear models.
Table 8 displays how well different models performed for each crop.Rice had the highest average R-squared score of 0.9119, showing it was predicted most accurately.Linear models did particularly well for rice, with an average score of about 0.9214.Sugarcane also had good accuracy, with an average score of 0.8472.However, linear models didn't perform as well for sugarcane, with an average score of only 0.7159.For crops like sorghum and rabi, all types of models did decently, but tree-based models stood out for their good performance.Sorghum had average scores of 0.4350 for linear, 0.4778 for neighbors-based, and 0.8323 for tree-based models.Rabi had scores of 0.5031, 0.6211, and 0.8642 respectively.Cotton's tree-based and neighbors-based models performed well, with scores of 0.7797 and 0.7723.However, linear models for cotton had negative scores, indicating they were worse than simply guessing the mean yield.

Conclusions
In conclusion, this research paper has compiled data from various sources to analyze the primary factors influencing crop yield in selected districts.Our findings highlight the importance of soil factors, meteorological conditions, and agricultural practices.Each of these factors was thoroughly investigated by compiling a primary dataset for each category, which was later merged into a comprehensive dataset.
The study also examined the influence of specific features within each factor on crop yield.The original dataset was utilized to train various regression machine learning models, and their performance was compared using metrics such as the R-squared score and RMSE.The Extra Trees Regressor model achieved the highest R-squared score of 0.9615, indicating its good prediction accuracy.Furthermore, the ML models were categorized into distinct groups based on their underlying techniques and methodologies, specifically linear, neighbors-based, and tree-based models.
Analyzing the average performances of these model groups revealed that the treebased models demonstrated the highest average R-squared score of 0.9353, followed by neighbors-based models with a score of 0.9002, and linear models with a score of 0.8568.Additionally, the study briefly discusses the performance of the models in predicting crop yields for each specific crop, which is presented in a tabulated format.
Overall, this research paper provides valuable insights into the factors influencing crop yield and demonstrates the effectiveness of machine learning models in predicting and understanding agricultural outcomes.The findings contribute to the existing body of knowledge and underscore the significance of considering various factors in optimizing crop production.

Figure 1 .
Figure 1.States considered in the research.

Figure 1 .
Figure 1.States considered in the research.

Figure 3 .
Figure 3. Algorithm for merging the datasets.

Figure 4 .
Figure 4. Algorithm for encoding and normalization.

Figure 4 .
Figure 4. Algorithm for encoding and normalization.

Figure 5 .
Figure 5. Feature selection process.Wrapper methods identify the feature subsets by training different combinations of the dataset on the ML model, thus taking into account the performance of the model and the features influencing the performance.Therefore, here the feature selection is essentially wrapped around the ML model.They are computationally expensive but access feature relevance more accurately.There is Recursive feature elimination (RFE) where we begin with all features and recursively eliminate unimportant features by recursively fitting the data on a smaller subset of features.There is Forward feature selection where we add on the most important features one at a time, backward feature selection which eliminates features based on their contribution to model performance.Embedded methods select the features as a part of the model's training process while considering the performance of the model at the same time.They can be L1 regularization, tree based or neural network based.Tree based methods aim to assign scores of importance to features based on their overall contribution to prediction the output variable whereas neural networkbased methods are used for unsupervised feature selection.Dimensionality reduction techniques like PCA aim to reduce the dimensionality of the dataset by transforming the features into a smaller dimension and encompassing most of the information contained in the original dataset.Figure 6 shows the feature selection algorithm.
Figure 6  shows the feature selection algorithm.

Figure 5 .
Figure 5. Feature selection process.Wrapper methods identify the feature subsets by training different combinations of the dataset on the ML model, thus taking into account the performance of the model and the features influencing the performance.Therefore, here the feature selection is essentially wrapped around the ML model.They are computationally expensive but access feature relevance more accurately.There is Recursive feature elimination (RFE) where we begin with all features and recursively eliminate unimportant features by recursively fitting the data on a smaller subset of features.There is Forward feature selection where we add on the most important features one at a time, backward feature selection which eliminates features based on their contribution to model performance.Embedded methods select the features as a part of the model's training process while considering the performance of the model at the same time.They can be L1 regularization, tree based or neural network based.Tree based methods aim to assign scores of importance to features based on their overall contribution to prediction the output variable whereas neural network-based methods are used for unsupervised feature selection.Dimensionality reduction techniques like PCA aim to reduce the dimensionality of the dataset by transforming the features into a smaller dimension and encompassing most of the information contained in the original dataset.Figure 6 shows the feature selection algorithm.

Figure 6
shows the feature selection algorithm.Computers 2024, 13, x FOR PEER REVIEW 10 of 28

Figure 5 .
Figure 5. Feature selection process.Wrapper methods identify the feature subsets by training different combinations of the dataset on the ML model, thus taking into account the performance of the model and the features influencing the performance.Therefore, here the feature selection is essentially wrapped around the ML model.They are computationally expensive but access feature relevance more accurately.There is Recursive feature elimination (RFE) where we begin with all features and recursively eliminate unimportant features by recursively fitting the data on a smaller subset of features.There is Forward feature selection where we add on the most important features one at a time, backward feature selection which eliminates features based on their contribution to model performance.Embedded methods select the features as a part of the model's training process while considering the performance of the model at the same time.They can be L1 regularization, tree based or neural network based.Tree based methods aim to assign scores of importance to features based on their overall contribution to prediction the output variable whereas neural networkbased methods are used for unsupervised feature selection.Dimensionality reduction techniques like PCA aim to reduce the dimensionality of the dataset by transforming the features into a smaller dimension and encompassing most of the information contained in the original dataset.Figure 6 shows the feature selection algorithm.
Figure 6  shows the feature selection algorithm.

Figure 7 .
Figure 7. Number of features vs. Average R-squared score.

Figure 7 .
Figure 7. Number of features vs. Average R-squared score.

Inputs:
Training data Outputs: Trained machine learning (linear regressor) model 1.Initialize model with arbitrary values for the weights and bias.The meteorological, crop and soil data are fed as inputs to the model.2. Initial predictions of the crop yield are computed from the initialized model.3. A loss function is used to find out how poorly the model performed in its predictions, here mean squared error (MSE) is used as the loss function.
: Training data Outputs: Trained machine learning model 1.Initialize the model • Set the crop yield as the initial prediction of the ensemble model.2. For each weak learner (decision tree) in the ensemble model: • Calculate the negative gradient of the loss function by computing the residuals with respect to the model's prediction of the crop yield.

4 .
Actual value x = Mean of the actual values y i = Predicted value n = Number of observations/rows Results and Discussions 4.1.Dataset Description

Figure 14 .
Figure 14.Bar plots showing average production (in 1000 tons) for each crop.

Figure 14 .
Figure 14.Bar plots showing average production (in 1000 tons) for each crop.

Figure 16 .
Figure 16.Bar plots showing influence of meteorological factors on rice crop production (a) Surface Pressure (kPa) vs. Average production of Rice (in 1000 tons) (b) Surface Temperature (°C) vs. Average production of Rice (in 1000 tons) (c) Specific Humidity (g/Kg) vs. Average production of Rice (in 1000 tons) (d) Cloud Amount (%) vs. Average production of Rice (in 1000 tons) (e) UVB Irradiance (W/m 2 ) vs. Average production of Rice (in 1000 tons).

Figure 16 .Figure 17 .
Figure 16.Bar plots showing influence of meteorological factors on rice crop production (a) Surface Pressure (kPa) vs. Average production of Rice (in 1000 tons) (b) Surface Temperature ( • C) vs. Average production of Rice (in 1000 tons) (c) Specific Humidity (g/Kg) vs. Average production of Rice (in 1000 tons) (d) Cloud Amount (%) vs. Average production of Rice (in 1000 tons) (e) UVB Irradiance (W/m 2 ) vs. Average production of Rice (in 1000 tons).Computers 2024, 13, x FOR PEER REVIEW 22 of 28

Figure 17 .
Figure 17.Bar plots showing influence of soil factors on sugarcane crop production (a) pH vs. Average production of Rice (in 1000 tons) (b) Soil Type vs. Average production of Rice (in 1000 tons).

Figure 18 .
Figure 18.Residual Plots of Random Forest regressor and Bagging regressor.

Figure 18 .
Figure 18.Residual Plots of Random Forest regressor and Bagging regressor.

Table 1 .
Sample of soil dataset.

Table 2 .
Summary of the attributes in meteorological dataset.

Table 3 .
Sample of meteorological dataset.

Table 4 .
Sample of agricultural dataset.

Table 4 .
Sample of agricultural dataset.

Table 5 .
Performance metrics values for each model.

Table 5 .
Performance metrics values for each model.The least MAE can be observed for extra trees regressor indicating this model has the lowest average magnitude of errors in its predictions compared to the other models.And the highest MAE is shown by linear regressor, suggesting the predictions by this model is less accurate compared to other models.While for RMSE values, the least value is still shown by extra trees regressor but the highest value is shown by Bayesian ridge regressor.MAE Extra Trees < MAE Random Forest ≈ MAE LGMB ≈ MAE Bagging < MAE Gradient Boosting < MAE Decision Tree < MAE KNN < MAE Bayesian Ridge ≈ MAE Ridge ≈ MAE Linear RMSE Extra Trees < RMSE Random Forest < RMSE LGMB < RMSE Bagging < RMSE Gradient Boosting < RMSE Decision Tree < RMSE KNN < RMSE Linear ≈ RMSE Ridge ≈ RMSE Bayesian Ridge

Table 6 .
Models grouped based on their underlying techniques and methodologies.

Table 7 .
Group wise average performance of models.MAE Tree−Based < MAE Neighbours−Based < MAE Linear

Table 8 .
Performance of the models for each crop.