Abstract
Agriculture is the backbone of a country and plays a vital role in shaping its economic performance. Factors such as natural disasters, extreme weather changes, pests, and soil quality significantly impact productivity, often leading to economic losses. Accurate predictions in agricultural practices, particularly crop recommendations, can substantially boost productivity and resource management. This research aims to develop a robust crop recommendation system using ensemble learning (EL), which integrates multiple machine learning (ML) models for improved performance. This study utilizes two datasets: a real-time dataset available on Kaggle, collected using IoT sensors, and a synthetic dataset generated using CTGAN. These datasets provide crop recommendations for 22 different crops, based on key features like nitrogen, phosphorus, potassium, soil pH, humidity, and rainfall. The performance of various ML models—such as linear regression (LR), support vector machine (SVM), decision tree (DT), naïve Bayes (NB), K-nearest neighbor (KNN), random forest (RF), extra tree classifier, XGBoost, and gradient boost—is compared with that of EL models, including voting, bagging, boosting, and stacking ensemble techniques. The stacking ensemble model achieved the highest accuracy at 99.36% across all ensemble techniques. By further optimizing this model using the Optuna hyper-parameter tuning technique, the accuracy was improved to 99.43%.
1. Introduction
Agriculture is a vital asset for any country, contributing to its strength and independence. A nation thrives when it can successfully meet its agricultural needs, becoming self-sufficient and reducing reliance on other countries for daily necessities. The wealth of a country lies in its farming industry and its farmers. However, modern agriculture faces numerous challenges, including global warming, wars, infectious diseases, and pests [1]. To combat these issues, AI can be used in crop prediction, weather forecasting, soil health analysis, precision farming, yield prediction, pest control, and many more applications [1,2].
In this research paper, we propose a machine learning (ML)-based crop recommendation system. The dataset used in this study was sourced from Kaggle (https://www.kaggle.com/datasets/atharvaingle/crop-recommendation-dataset, accessed on 19 September 2024) and was collected using IoT sensors. These sensors measure critical soil parameters including moisture, pH, temperature, humidity, and essential soil nutrients such as phosphorus (P), nitrogen (N), and potassium (K). Based on these parameters, our system recommends the most suitable crop to achieve improved yield.
2. Literature Review
Dey et al. (2024) [3] divided their dataset into two parts: one containing 11 agricultural plants and the other containing 10 horticultural plants. The authors implemented five distinct machine learning models—SVM, XGBoost, RF, KNN, and DT—on each of the separate datasets rather than on a combined dataset. The XGBoost model achieved an accuracy of 99.09%.
Islam et al. (2023) [4] proposed an ML-enabled IOT device to monitor soil nutrients like N, P, K, pH, temperature, and soil humidity. The author employed ML-based algorithms like catBoost, voting, and bagging to predict the recommended crop based on the parameters of the soil studied. CatBoost obtained the highest accuracy at 97.5%.
Kiruthika et al. (2023) [5] proposed a method based on improved-distribution-based chicken swarm optimization (IDCSO) with weight-based long short-term memory (WLSTM) for crop prediction. The author achieved an accuracy of 95% by employing the IDCSO algorithm for feature selection.
Ramzan et al. (2024) [6] implemented ML and EL models on two types of data: real-time data and hybrid data (real-time data and manual data). The author implemented ML algorithms to predict the recommended crop and compared the performance of ML and EL models.
Nikhin et al. (2024) [7] used ML models to predict crop yield with weather, soil, and crop data. The author found that the extra tree regressor achieved the highest performance among the other ML models, followed by the random forest regressor and LGBM regressor.
Elbasi et al. (2023) [8] used fifteen different ML algorithms with a new feature combination scheme. The author achieved 99.59% accuracy using the Bayes net algorithm and 99.46% using the naïve Bayes classifier and Hoeffding tree algorithm.
S.P. Raja et al. (2022) [9] developed a range of feature selection and classification techniques to predict the yield size of plant cultivations. Their study likely involved identifying key features (such as soil quality, temperature, humidity, and other environmental or agronomic factors) that influence crop yield. By using various machine learning or statistical models, they aimed to classify and predict the expected yield based on these selected features.
Sharma et al. (2024) [10] performed a crop prediction by demonstrating how different ML models, such as K-nearest neighbors (KNN) and deep learning algorithms, can achieve high accuracy in crop selection and disease prediction.
Parween et al. (2021) [11] explored the integration of IoT with ML techniques to create a precise crop prediction system, improving decision-making for farmers through real-time environmental monitoring. The system helped to reduce input costs and boosted productivity by recommending the most appropriate crops based on current soil and weather conditions.
Bakthavatchalam et al. (2022) [12] proposed a machine learning-based system to recommend the best high-yielding crops based on a combination of eight different agricultural attributes. Their aim was to improve precision agriculture using supervised learning algorithms implemented in WEKA. The study evaluated different classification algorithms for crop prediction using a multilayer perceptron and rule-based classifier. The performance of the models was evaluated based on accuracy metrics. The results showed that the selected classifiers achieved a high level of prediction accuracy, with a performance rate of 98.2273%.
3. Data Pre-Processing
The dataset used in this study is an IOT sensor dataset available in Kaggle. The dataset includes soil nutrients measures like nitrogen (N), phosphorous (P), and potassium (K) and other parameters such as the pH of soil, moisture, and rainfall, with the type of recommended crop as the target variable, as shown in Figure 1. The Kaggle dataset has 2200 rows, and the class distribution of the dataset is shown in Figure 2.
Figure 1.
Sample dataset.
Figure 2.
Class distribution of original dataset.
The Kaggle dataset is relatively small, is well-balanced, and contains no missing values. However, in real-world scenarios, sensor data can have missing values, noisy data, and errors. To account for this and expand our dataset, we generated approximately 1000 synthetic rows using the CTGAN (conditional tabular generative adversarial network) [13], a deep learning model. CTGAN is designed to create synthetic datasets by learning the distribution of the original tabular data, which helps in maintaining the same statistical properties. The CTGAN was trained for 200 epochs with a generator learning rate of 0.0002 and a discriminator learning rate of 0.0001. The quality of the synthetic data and the Kaggle dataset was evaluated, and a graph is given in Figure 3. The synthetic data generated by CTGAN were then concatenated with the original data to create a more robust dataset. This distribution of the combined dataset with 3200 rows is shown in Figure 4. The SMOTE [14] technique is applied to address the issue of imbalanced class data. After implementing SMOTE, each class contains 161 samples each. The details of the dataset are available in Table 1.
Figure 3.
Class distribution of original + synthetic dataset.
Figure 4.
Evaluation of original and synthetic dataset using CTGAN.
Table 1.
Dataset details.
4. Methodology
In this research paper, we have developed a classification model to predict the recommended crop based on seven features: N, P, K, soil pH, temperature, humidity, and rainfall. We have implemented various machine learning (ML) models, including linear regression (LR), support vector machine (SVM), decision tree (DT), naïve Bayes (NB), and K-nearest neighbor (KNN). Additionally, we have utilized ensemble learning (EL) models such as random forest (RF), extra tree, XGBoost, bagging, gradient boosting, voting, and stacking. Ensemble learning methods have provided promising results in many fields [15,16]. The details of the EL model implemented in this research work are given in Table 2.
Table 2.
Ensemble model parameters.
These machine learning (ML) and ensemble learning (EL) models were evaluated on two types of datasets: the original dataset sourced from Kaggle and a concatenated dataset comprising both the Kaggle dataset and a synthetic dataset. The synthetic dataset was generated using the CTGAN generative AI technique. Pre-processing steps were applied to the datasets to normalize the values and convert categorical variables into numerical values, as illustrated in Figure 5. The datasets were then divided into training and testing datasets in the ratio 80% and 20%, respectively. The model was created using the training dataset and was tested using the testing dataset.
Figure 5.
Model of proposed approach.
Our findings indicate that the ensemble learning models outperformed the individual machine learning models. However, the synthetic dataset contains more noisy data compared to the Kaggle dataset, resulting in lower accuracy. The accuracy, recall, and precision of the ML and EL models are given in Table 3.
Table 3.
Performance evaluation of ML and EL models.
5. Model Evaluation
The experiment was conducted in Google Colab Pro with Python 3 Google Compute Engine backend with 15 GB of GPU RAM and 12 GB of system RAM. The dataset was divided into training and testing in a ratio of 80% and 20%, respectively. Seven ML models, namely, LR, SVM, DT, NB, K-neighbor, RF, and extra tree, were tested on the Kaggle and synthetic dataset. The model performance was improved by implementing EL methods such as bagging, boosting, voting, and stacking [17]. The performance of the ML and EL models on the Kaggle and synthetic datasets is given in Table 3.
The performance of the models was evaluated based on the metrics accuracy, precision, and recall, as shown below:
The performance of the stacking ensemble model created with four base learners, namely, the extra tree classifier, random forest classifier, XGB classifier, and decision tree classifier, and with one meta-learner logistic regression obtained an accuracy of 99.36%, and it is the highest of all the ML and EL models, as shown in Figure 6. The proposed stacked ensemble model was compared with the existing models, as shown in Table 4. The proposed stacked ensemble model has the highest accuracy of 99.36% when compared to all the other existing models in [3,5,6,9,12]. However, the accuracy of the Bayes net algorithm discussed in [13] is greater as the author selected only specific features of the dataset for classification. In our proposed method, we utilize the full dataset for prediction. Further, the stacking ensemble model is optimized using Optuna, an automatic hyper-parameter tuning framework [18]. Optuna finds the optimal combination of hyper-parameters for the ML models. After tuning the stacked ensemble model, its accuracy increased to 99.43%. However, the accuracy of the synthetic dataset, as given in Table 4, is lower compared to the Kaggle dataset. In this synthetic dataset, the random forest (RF) model achieved the highest accuracy of 75.66%, outperforming the other ML and EL models.
Figure 6.
Performance of ML and EL models.
Table 4.
Comparison of proposed and existing models.
6. Conclusions
In this research, the performance of ML models and EL models are compared on both Kaggle and synthetic datasets. We found that EL models perform well on the Kaggle dataset; in particular, the stacked ensemble model created with four base learners—the extra tree classifier, random forest classifier, XGB classifier, and decision tree classifier—and with one meta-learner logistic regression outperformed the other ML and EL models. The performance of the stacked EL model is further improved using the Optuna optimizer. However, on the synthetic dataset generated using CTGAN, random forest achieved the highest accuracy, outperforming both the ML and EL models. This study highlights that EL models may not perform well if the dataset contains noisy data, as demonstrated by the lower accuracy in the synthetic dataset.
Author Contributions
Conceptualization, H.G.; methodology, H.G.; software, R.K.; validation, D.K. and R.K.; formal analysis, H.G. and D.K.; investigation, R.K.; resources, H.G.; data curation, H.G.; writing—original draft preparation, H.G.; writing—review and editing, D.K.; visualization, H.G. and D.K.; supervision, H.G.; project administration, H.G.; funding acquisition, H.G. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
Conflicts of Interest
The authors declare no conflict of interest.
References
- van Klompenburg, T.; Kassahun, A.; Catal, C. Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
- Talaviya, T.; Shah, D.; Patel, N.; Yagnik, H.; Shah, M. Implementation of artificial intelligence in agriculture for optimisation of irrigation and application of pesticides and herbicides. Artif. Intell. Agric. 2020, 4, 58–73. [Google Scholar] [CrossRef]
- Dey, B.; Ferdous, J.; Ahmed, R. Machine learning based recommendation of agricultural and horticultural crop farming in India under the regime of NPK, soil pH and three climatic variables. Heliyon 2024, 10, e25112. [Google Scholar] [CrossRef] [PubMed]
- Islam, M.R.; Oliullah, K.; Kabir, M.M.; Alom, M.; Mridha, M.F. Machine learning enabled IoT system for soil nutrients monitoring and crop recommendation. J. Agric. Food Res. 2023, 14, 100880. [Google Scholar] [CrossRef]
- Kiruthika, S.; Karthika, D. IOT-BASED professional crop recommendation system using a weight-based long-term memory approach. Meas. Sens. 2023, 27, 100722. [Google Scholar] [CrossRef]
- Ramzan, S.; Ghadi, Y.Y.; Aljuaid, H.; Mahmood, A.; Ali, B. An ingenious iot based crop prediction system using ML and EL. Comput. Mater. Contin. 2024, 79, 183–199. [Google Scholar] [CrossRef]
- Nikhil, U.V.; Pandiyan, A.M.; Raja, S.P.; Stamenkovic, Z. Machine Learning-Based Crop Yield Prediction in South India: Performance Analysis of Various Models. Computers 2024, 13, 137. [Google Scholar] [CrossRef]
- Jhajharia, K.; Mathur, P.; Jain, S.; Nijhawan, S. Crop Yield Prediction using Machine Learning and Deep Learning Techniques. Procedia Comput. Sci. 2023, 218, 406–417. [Google Scholar] [CrossRef]
- Raja, S.P.; Sawicka, B.; Stamenkovic, Z.; Mariammal, G. Crop Prediction Based on Characteristics of the Agricultural Environment Using Various Feature Selection Techniques and Classifiers. IEEE Access 2022, 10, 23625–23641. [Google Scholar] [CrossRef]
- Sharma, K.; Kumar, D. ML- and IoT-Based Crop Prediction System. In Innovations in Electrical and Electronic Engineering; Shaw, R.N., Siano, P., Makhilef, S., Ghosh, A., Shimi, S.L., Eds.; ICEEE 2023, Lecture Notes in Electrical Engineering; Springer: Singapore, 2024; Volume 1109. [Google Scholar] [CrossRef]
- Parween, S.; Pal, A.; Snigdh, I.; Kumar, V. An IoT and Machine Learning-Based Crop Prediction System for Precision Agriculture. In Emerging Technologies for Smart Cities; Bora, P.K., Nandi, S., Laskar, S., Eds.; Lecture Notes in Electrical Engineering; Springer: Singapore, 2021; Volume 765. [Google Scholar] [CrossRef]
- Bakthavatchalam, K.; Karthik, B.; Thiruvengadam, V.; Muthal, S.; Jose, D.; Kotecha, K.; Varadarajan, V. IoT Framework for Measurement and Precision Agriculture: Predicting the Crop Using Machine Learning Algorithms. Technologies 2022, 10, 13. [Google Scholar] [CrossRef]
- Xu, L.; Skoularidou, M.; Cuesta-Infante, A.; Veeramachaneni, K. Modeling Tabular Data Using Conditional GAN. Available online: https://github.com/DAI-Lab/CTGAN (accessed on 19 September 2024).
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- Gunasekaran, H.; Gladys, A.; Kanmani, D.; Macedo, R.; Wilfred Blessing, N.R. Brain Stroke Prediction Using Stacked Ensemble Model. J. Kejuruter. 2024, 36, 1759–1768. [Google Scholar] [CrossRef] [PubMed]
- Gunasekaran, H.; Deepa Kanmani, S.; Ebenezer, S.; Blessing, W.; Ramalakshmi, K. Detection of Lung and Colon Cancer using Average and Weighted Average Ensemble Models. EAI Endorsed Trans. Pervasive Health Tech. 2024, 10. [Google Scholar] [CrossRef]
- Elbasi, E.; Zaki, C.; Topcu, A.E.; Abdelbaki, W.; Zreikat, A.I.; Cina, E.; Shdefat, A.; Saker, L. Crop Prediction Model Using Machine Learning Algorithms. Appl. Sci. 2023, 13, 9288. [Google Scholar] [CrossRef]
- Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD ‘19), Anchorage, AK, USA, 1–8 August 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).