3.1. Dataset Description
The data used in this paper were collected from the Kaggle repository and comprised 880 soil samples, described in terms of 12 physicochemical soil properties and one dependent variable: soil fertility. It is well-structured and has no missing values, suggesting that the dataset is ready for machine learning analysis without preprocessing. The input parameters are Nitrogen (N) and Potassium (K) as integer values, whereas Phosphorus (P) and other chemical variables are presented as floating-point values of continuous type. The soil reaction is measured using pH, salinity, and electrical conductivity (EC). Organic Carbon (OC) will indicate the content of organic matter in the soils, whereas Sulfur (S) will indicate the availability of secondary nutrients. The dataset contains micronutrients: Zinc (Zn), Iron (Fe), Copper (Cu), Manganese (Mn), and Boron (B). The target variable is the output, an integer-coded categorical variable that indicates soil fertility levels, which were subsequently assigned qualitative values: Low, Medium, and High. The data is of mixed type; it includes 11 continuous numerical features, 2 discrete numerical features, and 1 categorical target variable.
Figure 1 presents the proposed model for soil fertility prediction.
Most soil parameters are expressed as continuous numerical values, but some, namely Nitrogen and Potassium, are represented as discrete numerical values. These physicochemical measurements are treated as input variables of the machine learning models used herein. The data set as a whole consists of eleven continuous numerical variables, two discrete numerical variables, and a categorical target variable, making it an appropriate data set for classification using machine learning methods.
3.2. Data Preprocessing
Data preprocessing was performed to improve data quality and ensure compatibility with machine learning algorithms. First, all missing values that were found in the numeric features were treated by mean imputation, whereby the missing value was replaced by the mean value of the particular feature. In this method, the dataset’s statistical properties are preserved without removing any samples.
The data used in this study were obtained from an open-source Kaggle repository containing soil fertility records. It contains 880 soil samples used to reflect the physicochemical properties of soil from agricultural fields. Such samples define soil features using 12 important soil parameters and a target variable representing soil fertility. This was followed by feature scaling via standardization, which sets all input features to have zero mean and unit variance. Distance-based and gradient-based models can be strongly influenced by feature scaling, as features with a larger range will dominate and thus influence the models.
After preprocessing and feature analysis, the data were split into training and test sets to evaluate the predictive accuracy of the machine learning models. It included 880 soil samples, of which 80% (704) formed the training set and 20% (176) the testing set.
3.5. Machine Learning Models
To provide a comprehensive measure of predictive performance, 15 supervised machine learning classification models were run. These models cover a wide variety of learning paradigms, including linear, non-linear, ensemble-based, boosting-based, and neural network. The reviewed classifiers are Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, Random Forest, Extra Trees, Gradient Boosting, AdaBoost, Naive Bayes, Linear Discriminant Analysis, Multi-layer Perceptron, Ridge Classifier, Stochastic gradient Descent Classifier, XGBoost (version 2.0.3), LightGBM (version 4.3.0), and CatBoost (version 1.2.5). To ensure a fair basis for comparison, default hyperparameter settings were used to train all models. The method allows the objective testing of the innate learning ability of each model in the soil fertility prediction problem.
3.8. Minority Class Performance
However, the low accuracy achieved by Class 2 suggests that class imbalance and feature distribution overlap exists. These drawbacks could be overcome in future research through the use of sophisticated techniques like data sampling, cost-sensitive analysis, and class weighting, which will enhance the predictive accuracy of minority classes.
Table 1 shows the relative performance of 16 machine learning models, measured by accuracy, precision, recall, and F1-score, for predicting soil fertility levels. The highest accuracy of 90.91 indicates that the Random Forest classifier has the highest precision (0.94) and, consequently, the highest effectiveness in classifying soil fertility levels, with the fewest false positives, compared to all other models. It has a lower recall (0.67), though the overall F1-score (0.69) indicates equal performance across classes. Extra Trees, CatBoost, and XGBoost are other ensemble-based models that also performed well, with accuracies over 88 percent and fairly balanced precision–recall trade-offs. Gradient Boosting and LightGBM also demonstrated the efficiency of ensemble learning methods for soil fertility classification. Linear models, such as Logistic Regression, Ridge Classifier, and SGD Classifier, achieved moderate results, indicating they are not particularly powerful at capturing complex nonlinear relationships among soil parameters.
Figure 7 presents the model accuracy comparison of different machine learning models. This value indicates the precision of different machine learning models to forecast the extent of soil fertility. As can be observed, ensemble-based models such as Random Forest, CatBoost, and XGBoost are superior to other classifiers, suggesting they can capture complex relationships among soil parameters. Models such as Logistic Regression, SGD Classifier, and LightGBM are also competitive and achieve quite high accuracy. Simpler models like Naive Bayes achieve much lower accuracy, indicating that they are not particularly effective on this dataset. In general, the findings indicate that ensemble learning techniques are more appropriate for predicting soil fertility levels.
Figure 8 presents the confusion matrix, showing the classification performance of the Random Forest model for the three soil fertility classes. The model appropriately categorized most samples into classes 0 and 1, correctly predicting 74 and 83 samples, respectively. The few misclassifications are evident, mostly between adjacent fertility levels, which is natural given the similarities in soil properties. In class 2, the model makes fewer correct predictions, indicating its inability to differentiate this class from others, likely due to fewer samples or mixed feature distributions. In general, the confusion matrix shows high predictive power for the Random Forest model, especially for the most prevalent fertility classes.
Figure 9 discusses the metrics of performance by class, revealing that the Random Forest model has high precision, recall, and F1-score for Class 0 and Class 1, whereby all the metrics have a value of more than 0.88, which indicates reliability and consistency in the classification of both Class 0 and Class 1 soils in terms of their soil fertility. Class 2, in turn, shows poorer results, indicating an inability to correctly distinguish this category, which could be explained by class imbalance and a similar distribution of features. In general, the findings show that the model has strong predictive power for the majority classes of fertility and that there is room to improve its ability to predict the minority classes.
Figure 10 presents the learning curve, which indicates that the accuracy of the training process, as measured by the Random Forest model, is very high, and the model has a high learning capacity, as evidenced by its very close to 1.0 at all training sizes. The validation accuracy also increases consistently with increasing training size and levels off at 0.89–0.90, indicating that better generalization is achieved with additional data. The fact that the training and validation accuracies differ only slightly at larger sample sizes indicates limited overfitting and high model stability. Overall, the learning curve indicates that the model is advantageous, with improved performance on additional training data and stable performance on unseen samples.
Figure 11 presents the ROC curves showing how well the Random Forest model classifies the soil fertility using the one-vs.-rest approach. The model has shown strong discriminative power between Class 0 and Class 1, with AUCs of 0.97 and 0.94, respectively. The results show that the model is very effective at differentiating these fertility classes from the other classes. In comparison, Class 2 has a lower AUC of 0.75, indicating that the class is more difficult to predict, perhaps due to class imbalance or similarities in soil properties. All in all, the ROC analysis demonstrates the good predictive power of the Random Forest model, especially for the most prevalent soil fertility classes.
Figure 12 discusses the Precision–Recall curves, which show the relationship between the precision and recall of each soil fertility category. The Random Forest model achieves strong results for Classes 0 and 1, with PR-AUCs of 0.95 and 0.92, respectively, indicating stable, accurate predictions for these two fertility levels. Conversely, Class 2 has a much lower PR-AUC of 0.32, indicating lower predictive reliability. The tendency can be attributed to class imbalance and overlapping feature distributions. All in all, the PR-AUC analysis of the Research Question reveals the strength of using the Random Forest model across the overwhelming fertility classes and the necessity of developing a stronger strategy to better represent minority fertility levels.
Figure 13 presents the probability distribution of the predictions, showing the accuracy of the Random Forest model across soil fertility classes. Class 0 and Class 1 distributions are fairly disaggregated and concentrated toward higher probability values, indicating high model confidence in predicting these classes. Class 2, on the other hand, exhibits a sharp peak at lower probability values, indicating lower confidence and greater uncertainty in its predictions. This observation aligns with previous assessment outcomes and suggests that class imbalance or shared feature attributes may reduce the model’s ability to reliably forecast the minority fertility category.
Figure 14 presents the ranking of machine learning models for soil fertility prediction. It has been observed that an ensemble-based ML model performs well. The figure illustrates the ranking of different ML models, and it shows that the RF model gives the best results. The obtained accuracy of 0.90, precision of 0.93, recall of 0.67, and F1-score is 0.69. We also noticed that all ensemble models’ performance is above or equal to 88%. This trend underscores the importance of ensemble learning for capturing non-linear relationships among soil characteristics, such as nutrient concentrations, pH, and organic matter. Probabilistic models, such as NB, do not perform well on this dataset for soil fertility prediction.
Figure 15 presents the relationship between accuracy and F1-score for soil fertility prediction. This figure demonstrates the model’s generalizability and robustness to the class imbalance problem. The RF model performs well and achieves the highest accuracy of 0.90 compared to the other models, but its significantly lower F1-score indicates an imbalance between precision and recall. We observed that even high-accuracy models can underperform when class imbalance is considered, which is important for soil fertility prediction because minority classes often reflect damaged or nutrient-deficient soils.
Figure 16 presents a comprehensive heatmap of the performance of all machine learning models across the performance metrics. We noticed that an ensemble model consistently performs well, achieving high scores across all metrics. It means that the model is robust. But the linear model and probabilistic model are not up to the mark (performance score drops, especially recall and F1-score).
Figure 17 presents a box plot of the performance metrics for soil fertility prediction. It highlights the central tendency, dispersion, and outliers. The performance metrics accuracy exhibits the highest median value. However, the distributions of precision, recall, and F1-score show greater variability, indicating inconsistent performance across classes. Precision varies significantly, indicating that various models struggle to maintain consistent confidence in predicting fertile and non-fertile soil classes. The presence of outliers in the F1-score indicates that some models fail to balance precision and recall, underscoring the inadequacy of accuracy as a standalone evaluation metric for soil fertility prediction. The performance metrics, such as accuracy, remain stable across all models, whereas precision, recall, and F1-score show greater variability. It underscores the need for a multimetric evaluation for soil fertility prediction.
The above-mentioned
Figure 18 presents the comparative F1-score distribution between the ensemble and non-ensemble models. This visualization demonstrates how the model is robust for soil fertility prediction. It has been observed that ensemble-based models are obtaining the highest median F1-score. Non-ensemble models, on the other hand, exhibit higher variability, a lower median performance, and noticeable low-performing outliers.