Comparison of Machine Learning-Based Prediction of Qualitative and Quantitative Digital Soil-Mapping Approaches for Eastern Districts of Tamil Nadu, India

: The soil–environmental relationship identiﬁed and standardised over the years has expe-dited the growth of digital soil-mapping techniques; hence, various machine learning algorithms are involved in predicting soil attributes. Therefore, comparing the different machine learning algorithms is essential to provide insights into the performance of the different algorithms in predicting soil information for Indian landscapes. In this study, we compared a suite of six machine learning algorithms to predict quantitative (Cubist, decision tree, k-NN, multiple linear regression, random forest, support vector regression) and qualitative (C5.0, k-NN, multinomial logistic regression, naïve Bayes, random forest, support vector machine) soil information separately at a regional level. The soil information, including the quantitative (pH, OC, and CEC) and qualitative (order, suborder, and great group) attributes, were extracted from the legacy soil maps using stratiﬁed random sampling procedures. A total of 4479 soil observations sampled were non-spatially partitioned and intersected with 39 environmental covariate parameters. The predicted maps depicted the complex soil–environmental relationships for the study area at a 30 m spatial resolution. The comparison was facilitated based on the evaluation metrics derived from the test datasets and visual interpretations of the predicted maps. Permutation feature importance analysis was utilised as the model-agnostic interpretation tool to determine the contribution of the covariate parameters to the model’s calibration. The R 2 values for the pH, OC, and CEC ranged from 0.19 to 0.38; 0.04 to 0.13; and 0.14 to 0.40, whereas the RMSE values ranged from 0.75 to 0.86; 0.25 to 0.26; and 8.84 to 10.49, respectively.


Introduction
Conventional methods for soil surveys involve delineating soil polygons based on the subjective decisions made by the surveyors and are usually presented in printed reports or chart format.The lack of digital forms of soil information can limit the efficiency of several applications, in addition to the inefficiency of the traditional methods in representing the within-class variability and current variability of the respective soil attributes [1,2].In the last few decades, there has been an evident shift in soil surveys from soil surveyorbased qualitative soil delineation to data-driven quantitative assessment of the soils [3].The transformation was accelerated mainly due to the increasing need for soil resource information for selecting suitable crops, the area and yield estimation of crops, determining the predominant soil types, delineating the management zones and drainage classes, adopting appropriate land use plans, and identifying potential carbon sequestration zones, among others [4].With the technical advancements in remote sensing, geographical information systems, and data analysis, a cumulative collection of mapping procedures has been implemented and evolved to increase the accuracy of the methodology and the generated maps.In addition to the soil spectral information, extensive spatial coverage and temporal consistency from remote sensing data can help with the mapping of inaccessible locations [5].
Digital soil mapping aims to link the soil responses to environmental variables through the implication of inference and numerical models.DSM can be defined as "the creation, and population of spatial soil information systems (SSINFOS) by the use of field and laboratory observational methods, coupled with spatial and non-spatial soil inference models".Further, the need for a self-updating generic framework for spatial soil inference systems (SSINFERS) has been espoused to derive the data requested by users [6].The processes included in the digital soil-mapping procedures include (1) Generating soil databases for the particular soil attribute of interest; (2) Deriving and selecting the soil environmental covariates based on the SCORPAN factors that better depict the soil attributes; (3) Model calibration, validation, and parameter tuning; (4) Spatial prediction based on the model calibrated; (5) Interpolation or extrapolation of the prediction function if required; and (6) Accuracy assessment based on the independent datasets.
Conventional soil mapping procedures for characterising soil attributes involve destructive soil sampling, aerial photo interpretation, surveying based on vegetation and topography maps, and associated laboratory analyses.These procedures are highly time consuming and expensive when the mapping is performed at national or regional levels in addition to being based on the surveyor's conceptual or mental model [7,8].Some of the machine learning techniques that have been adopted so far based on a literature survey that included comparative studies are multiple linear regression (MLR); regression kriging (RK); random forest (RF); quantile regression forest (QRF); support vector machine (SVM); Bayesian networks; neural networks, e.g., artificial neural networks (ANN) and convolutional neural networks; the generalized additive model (GAM); logistic regression; distance-based learners, e.g., k-nearest neighbour (kNN); decision trees, e.g., Cubist (CB); classification and regression tree (CART); C5.0; principal component regression (PCR); partial least square regression (PLSR); extreme learning machines (ELM); boosted regression trees; and ensemble machine learning (EML) .
The model's predictive ability also depends on the quantity and quality of soil samples, especially when derived from multiple resources.The soil samples can be biased as the surveyor collects the data from areas with better accessibility.Moreover, the associated spatial bias can degrade the statistical relationship between the soil samples and covariates generated, which can impede DSM accuracy.Therefore, attention must be focused on improving the predictive ability of the calibrated DSM model by optimising the sampling procedures, hyperparameter settings, nature of the target and covariate attributes, associated spatial supports, and selected covariate selection procedure [31].
With the advent of many specialized machine learning algorithms, comparison and validation of the algorithms are essential to screen for models that provide unreliable and redundant results when the predictions are further upscaled to the state level.Additionally, the problem inherent with most machine learning algorithms is their lack of detailed interpretability.In such cases, explicit quantification of the covariate information using the global agnostic tools was implemented in several studies including the permutation feature importance analysis.The PFI analysis was considered for its computation efficiency and limited parameterisation [32].The objectives of this study are (1) to generate digital soil class and attribute maps using a suite of six machine learning algorithms; and (2) to compare and validate the digital soil maps based on the visual interpretation and evaluation metrics derived.

Study Area
The study area included four districts of Eastern Tamil Nadu, India: Ariyalur, Cuddalore, Mayiladuthurai, and Perambalur.The districts were selected because of their unpredictable climate conditions with greatly varied geomorphological and hydrological characteristics.Given the extremities of the associated factors, the current study also indirectly assessed the potential of the digital soil-mapping procedures to mitigate the shortcomings of mapping regional-level landscapes and delineating the soil attributes.The study area extended geographically from 11  1).
The extent of the study area is covered adjacently by the various districts of Tamil Nadu, with coastal regions adjoining the Cuddalore and Mayiladuthurai districts.Ariyalur and Perambalur are considered the inland districts of Tamil Nadu, with nlack and red loam soil as the predominant soil types and a semi-arid climate.In particular, the lands of Ariyalur are characterised by the presence of limestone and ferruginous red loam.At the same time, the Cuddalore and Mayiladuthurai districts have tropical climates with alluvial, sandy loam, and sandy clay loam as the predominant soil textural classes.The study area experiences an annual temperature that varies from 26.81 to 28.01 • C.
Similarly, the annual precipitation of the study area varies from 1351 to 1737 mm from west to east, most of which is contributed by the northeast monsoon downpour.Considering the rain-fed irrigation prevalence of Ariyalur and Perambalur, maize and cotton are the most cultivated crops.In contrast, the Cuddalore and Mayiladuthurai districts are situated in the Cauvery River basin, where the major crops are paddy, pearl millet, maize, and pulses.

Soil Data
The soil data were extracted from the legacy soil map obtained from the National Resource Information System of NNRMS [33].A stratified random sampling procedure was used to derive the sampling sites with soil series as a distinctive stratum.A cumulative fraction of each of the 4479 soil observations was sampled, and around 521 sample points corresponding to the habitation, water bodies, and miscellaneous landform elements were included for the soil class delineation.The organic carbon (OC), pH, and cation exchange capacity (CEC) were the continuous soil attributes considered for the digital soil mapping.Similarly, the categorical soil attributes utilised for the delineation were the soil order, suborder, and great group.

Environmental Covariates
A total of 39 environmental covariates representing the climate, relief, organisms, and parent material were derived from remote sensing and DEM products.Spatially interpolated annual mean temperature and annual rainfall data (global cover) available from WorldClim 2.1 (https://www.worldclim.org/data/worldclim21.html,accessed on 16 March 2021) were downloaded at 30 arc seconds with a temporal range of 1970-2000.Satellite data products with their associated NDVI products in addition to the land use and land cover products were essentially considered the parameters that reflected the organisms' covariates.The Landsat-8 product ('LANDSAT/LC08/C01/T1_SR') depicts the surface reflectance of the study area downloaded from the Google Earth Engine.The data collection was limited to the period from March to May for better delineation of soil properties.A 3-month composite with a median filter was adopted to reduce the effects of cloud cover and shadow effects.

Soil Data
The soil data were extracted from the legacy soil map obtained from the National Resource Information System of NNRMS [33].A stratified random sampling procedure was used to derive the sampling sites with soil series as a distinctive stratum.A cumulative fraction of each of the 4479 soil observations was sampled, and around 521 sample points corresponding to the habitation, water bodies, and miscellaneous landform elements were included for the soil class delineation.The organic carbon (OC), pH, and cation exchange capacity (CEC) were the continuous soil attributes considered for the digital soil mapping.Similarly, the categorical soil attributes utilised for the delineation were the soil order, suborder, and great group.

Environmental Covariates
A total of 39 environmental covariates representing the climate, relief, organisms, A total of 22 secondary terrain attributes were derived from the Shuttle Radar Topography Mission (SRTM) exclusively through the hydro geomorphometric indexes of the SAGA GIS software.In the case of the parent material covariate, the spectral indices that depict the mineralogy of the study area were derived from the Landsat-8 images.Further, existing soil maps indicating the land use and land cover (organisms) [34], physiography (terrain), and geomorphology (parent material) [35] were obtained from the National Remote Sensing Centre at a 1:50,000 scale and were implemented as the covariate layers.Finally, the derived covariate spatial layers were resampled and reprojected to a 30 m spatial resolution using the ArcGIS 10.6 software.The environmental covariates implemented for the soil attribute delineation are depicted in Table 1.

Model Development
The model's training, testing, and spatial depiction of the predicted soil attributes were facilitated through the spatial environment of R. The ithir package and its associated functions for soil-related spatial functions were used in the present study.The visual presentation of the predicted soil attributes was facilitated by ArcGIS software.The packages invoked for the calibration of the machine learning algorithms and their parameterisations are specified in Table 2. Multiple linear regression is the most consistently used parametric measure for prediction.MLR determines the linear relationship between the soil attribute and covariates and predicts the outcome of the target attributes by fitting a linear equation with the model parameters (coefficients) [44].The major constraint of the MLR model is that it does not account for the nonlinear relationship among the variables considered.

Multinomial Logistic Regression (MnLR)
Typically, logistic regression models can describe the binary response variables, where the predictions are defined as the probability of the occurrence (0 and 1) [45].Multinomial logistic regression predicts the probability of the category membership of the soil attributes based on the covariate predictors [46].The present study generated logistic regression models for each soil class with default parameter settings [47].

k-Nearest Neighbour (k-NN)
K-nearest neighbour (KNN) is a simple non-parametric classifier based on the known instance to label an unknown instance based on a distance function [48].The values/classes will be assigned based on the majority classes or average of attributes or on the distance measure implemented, i.e., the Euclidean distance [49].The k-NN classifier usually requires the data to be normalised for the model calibration, as distance-based learners require the covariates to be in a similar range.The centring and scaling of the datasets were facilitated via the 'preProcess' function of the caret package.Further, the k-NN model can be tuned by specifying the number of neighbours (k) within the proximity for the model training.The default parameter setting of the train function with the 'knn' method recursively determines the optimal k values by implementing a bootstrap resampling procedure for the model calibration.A k value of 9 was determined from the model calibration for all the soil attributes based on the evaluations of the RMSE (continuous) and overall accuracy (categorical) obtained for other values of k through cross-validation.

Decision Trees (Regression Trees)
The algorithm works by recursively partitioning the datasets and determining the subset that can be further split until the predetermined termination factor is achieved.Decision tree algorithms are generally non-parametric models, and the split associated with the models can be optimised based on passing a control function.The control function can be invoked based on the 'rpart.control'parameter.The arguments that are necessarily passed through the 'rpart.control'include minsplit (minimum observation in a node for a split to be attempted) and cp (complexity parameter-split that does not decrease the overall lack of fit by the factor of "cp" that is not tried).The 'rpart' function was implemented with a 'minsplit' of 50 observations.Typically, the default arguments of the 'rpart' with a cp of 0.01 with splits based on the "Information" or "gini" index have been determined to be successful at pre-pruning and splitting so that the cross-validation excludes one or two surrogate trees.Still, sometimes it may overprune, especially when the 'rpart' is performed on large datasets with a small range [50].Since splitting the dataset with the default parameter yielded crude and incomparable results, the optimal cost-complexity pruning value (cp) was determined by initially setting the cp value with -Inf.The negative infinity value helps to generate all the maximum possible trees for the datasets.Then, based on the minimum cross-validated error generated, the cp value for pruning was assessed and further fine-tuned regression trees were generated.

Decision Trees (C5.0)
The C5.0 algorithm is a non-parametric and decision tree-based machine learning algorithm that fits the classification trees based on Quinlan's C5.0 algorithm.The C5.0 algorithm was implemented with the control function defined within the model definition.In general, the implication of trail parameters can help in the implementation of a boosted classification tree process, with the results cumulated at termination [51].The C5.0 control function was defined in the current study using CF (confidence factor), minCases (minimum number of samples that must be imparted in at least two of the splits), and earlyStopping (a logical that defines whether the internal method for stopping boosting should be used) argument values of 0.95, 20, and FALSE, respectively.

Naïve Bayes (NB) Classifier
A naïve Bayes classifier is a probability-based statistical classifier based on the assumption that the effect of the covariate value on a given class is independent of the values of the other covariates.Referred to as conditional independence, the assumption is made to reduce the associated computational time.Hence, the classifier is termed "Naive" [52].In short, the classifier assumes that the covariates are completely independent even though some dependency exists between the covariates.The classifier calculates the conditional probabilities of each covariate separately and the a priori chances for each class level.

Support Vector Regression/Machine (SVR/M)
Support vector machine is essentially data classification and a non-parametric technique extended for regression predictions.The SVR/M operates by projecting the data points (support vectors) using hyperplanes and further segregates and groups the datasets, i.e., each segment contains only one kind of data.In cases of SVR/M, the algorithm acknowledges the presence of nonlinearity in the datasets [53].It calibrates the model by limiting the error associated with the base value (principle of maximal margin).In addition, SVR/M implies kernel functions to predict nonlinear problems by projecting the nonlinear vector to high-dimensional spaces [54].Based on the object type of the response variable (factor or not), the model instigates the classification or regression of the proposed datasets.With the default parameters included, the number of support vectors is automatically determined for continuous (2788) and categorical (3133) variables, which decides the overall performance of the support vector regression and classification.

Cubist Regression
The Cubist model is the most popular model structure used because of its ability to model the nonlinear relationships associated with the datasets and is an extension of the M5 tree model.Like regression trees (rpart), the parameters are tuned by passing a control function (cubistControl) or control parameter within the model definition [55].In the present study, values of 100 and 15 were inputted into the rules and extrapolation arguments, respectively.2.4.9.Random Forest (RF) Random forest is a boosted decision tree model made by constructing multiple decision trees during training, which are later consolidated to define one single prediction for each observation in the datasets.The average of the individual trees is computed for regression prediction, and for categorical variables, predictions are made on a majority basis [56].Soil prediction studies consider random forest algorithms for their robust performance and limited need for fine-tuning.The present study implemented fine-tuning by adjusting the ntree (number of trees to be constructed) and mtry (number of covariates selected as candidates at each split) hyperparameters.For spatial prediction of the quantitative soil attributes, an ntree value of 1000 with default mtry values were inputted [57].For the spatial prediction of the qualitative soil attributes, an ntree value of 500 and mtry of 5 were inputted for pruning the decision trees.

Model Validation
In order to mitigate the effects of spatial autocorrelation on the validation of the models, spatial partitioning of the datasets was facilitated based on the random holdback procedure.The datasets were partitioned and 70% of the total was sampled for the training dataset and the remaining 30% for validation.We calculated the validation metrics for the quantitative and qualitative soil attributes separately for each predictive model calibrated.

Quantitative Soil Attributes
For the quantitative soil attributes, the coefficient of determination (R 2 ) [a], concordance correlation coefficient (CCC) [b], root mean square error (RMSE) [c], and bias [d] were implemented for determining the quality of the predictions made by each of the machine learning algorithms.The metrics were calculated using the goof function of the ithir package in R.
where p i and o i denote the predicted and observed values of the soil attributes, ρ denotes the correlation coefficient between the observed and predicted attributes, and o i and p i denote the means of the observed and predicted values of the soil attributes.

Qualitative Soil Attributes
The class assigned by the machine learning model in the classified image was compared with the class of the validation dataset to determine the "correctness" of the classification.The confusion matrix was generated based on the [58] accuracy assessment measures.For the qualitative soil attributes, the overall accuracy (OA), kappa, quantity disagreement (Q) [e], and allocation disagreement (A) [f] were based on the generated confusion matrix.Further, the substantiations were based on the total disagreement (TD) calculated [g].The metrics were derived from the goofcat (OA and kappa) and diffTablej (Q and A) functions of the ithir and diffeR packages, respectively.Due to the shortcomings of kappa, as stated by Pontius and Millones [59], disagreement measures were implemented to study the effectiveness of the model's performance.
where N is the cumulative number of items considered for validation.xii is the diagonal entry of the matrix.x i+ and x +i indicate the sum of row i and the sum of column i, respectively.

Variable Importance Measure
A major limitation of the machine learning algorithms implemented is that all models failed to exhibit the functional relationship between the predictor covariates and soil attributes.To facilitate the determination of the percentage influence of the covariates on the spatial model prediction based on the selected machine learning algorithm, the permutation feature importance (PFI) method was implemented.The PFI analysis was presented by Breiman [56] and was subsequently developed by Fisher et al [60].The analysis is a modelagnostic tool for determining the feature importance of almost every machine learning algorithm implemented, irrespective of the number of covariates implemented.The method estimates the variations in the prediction quality of a single covariate vector.Hence a covariate is deemed important if shuffling its values increases the error of the model, as in this case, where the model heavily relied on the covariate.Further, the model evaluation was also facilitated based on the ability of the model to incorporate the continuous and categorical predictor variables (covariates) for spatial predictions.The differences in the contributions of the covariates for spatial modelling can reflect the ability of the model to mitigate bias and explain the nonlinearity associated with the model trained and tested.The current study included the permutation feature importance for each machine learning algorithm, implicated using the "iml" package in R [61,62].
In addition, the calculations included for determining the contribution of the covariates to the predicted soil attributes for each learner (f ) were provided and the error of each algorithm was estimated as follows: For each covariate, the covariate space was extended by randomly permutated covariates to remove their correlation with the soil properties.The error based on the permutated covariates is as follows: By differencing the error derived through the original covariates and permutated covariates, the associated deviations can be derived through the following function and the values can be converted to a percentage for substantiation: PFI = (Error perm − Error ori )

Descriptive Statistics
Descriptive statistics predominantly define the range of variability associated with the soil properties.The variability of the soil properties is usually attributed to soil-forming factors that are closely related [63].The summary statistics of the continuous soil attributes based on the minimum, maximum, mean, and standard deviation parameters are presented in Table 3.
The study area's soils varied in their range of pH values but were typically neutral to slightly alkaline.The pH values showed a moderate spatial variation with a standard deviation of 0.95 and a mean value of 7.2.The minimum and maximum values of the pH were 4.6 and 9.8, respectively, with a coefficient of variation (CV) of 13.12.The organic carbon (OC) was typically low throughout the study area, which may be attributed to tillage activities, high temperatures, and the high erodibility of the soils.The OC values exhibited a standard deviation of 0.26 and a CV of 53%, with a mean value of 0.49%, exhibiting high variability.The minimum and maximum values of the organic carbon were 0.10 and 1.42%, respectively.
The study area soils also showed variability for the cation exchange capacity (CEC) with a standard deviation and CV of 11.24 and 57.61%, respectively, and a mean value of 19.5 meq/L.Further, the CEC's minimum and maximum values were 3.03 meq/L and 58.1 meq/L, respectively.For all soil attributes, the mean and median values were almost comparable, indicating the normality of the data distribution.The independent validation datasets partitioned using the random holdback procedure were utilised to validate the model's performance.Among the machine learning algorithms, the highest R 2 and CCC for the pH were estimated by the random forest (38%; 0.50) and Cubist (31%; 0.53) algorithms.Similarly, the highest R 2 and CCC for the OC were estimated by the random forest (13%; 0.19) and Cubist (12%; 0.29) algorithms.The RMSE values observed for the pH and OC varied slightly among the models.Compared to the other models, the highest R 2 and CCC for the CEC attribute were estimated by the RF (40%; 0.52) followed by the Cubist (29%; 0.52) algorithms, with the lowest RMSE values estimated by the RF and Cubist algorithms at 8.84 and 9.93, respectively.The bias calculated by the models was low for all models.We found that the differences among the metrics of the models were comparable.The metrics assessed for the quantitative soil attributes are presented in Table 4.

Qualitative Soil Attributes
For the qualitative soil attributes, the spatial prediction of the soil order was predicted, with the highest overall accuracy and kappa values estimated by RF (67%; 0.52) and C5.0 (65%; 0.51).Further, from a comparison of the disagreements stated by the algorithms, the random forest (33.87%) and C5.0 (35.53%) algorithms had the comparably lowest total disagreements, with higher allocation disagreements.Similarly, for the soil suborder, the highest overall accuracy and kappa values were estimated by RF (65%; 0.50) and C5.0 (64%; 0.50).On the other hand, the total disagreements were found to be the lowest for RF (35.13%) and Cubist (36.20%), with higher allocation and lower quantity disagreements.Further, for the great group soil predictions, the highest overall accuracy and kappa values were estimated by RF (65%; 0.50) and C5.0 (65%; 0.53%), with the lowest total disagreement of 35.33% for both the RF and C5.0 algorithms.In addition, the classifiers removed the spatial predictions of certain class categories, generally due to low sampling observations concerning the removed categories.The metrics assessed for the qualitative soil attributes are presented in Table 5.

Visual Assessment
The machine learning algorithms for the spatial modelling and prediction of the quantitative and qualitative soil attributes generated through the raster "predict" function were depicted as their respective attribute maps in Figures 2-7.The predicted continuous soil attribute maps were similar to the existing maps, with the added spatial variations explained through the predictions.The comparison of the prediction maps of the pH, OC, and CEC with the existing maps showed that most intricate spatial variations were explained by the Cubist and random forest algorithms, followed by support vector regression, k-NN, decision trees, and multiple linear regression, with an increasing gradient of all soil attributes from west to east.The predicted OC attribute maps depicted a very low concentration, which can be substantiated based on the slope direction, depth, and elevation of the study area.The lower elevation areas were generally estimated as having lower carbon concentrations due to the increased temperatures [64].The variations in the minimum and maximum values for the soil attributes were usually attributed to the bias and indicated the   The lower elevation areas were generally estimated as having lower carbon concentrations due to the increased temperatures [64].The variations in the minimum and maximum values for the soil attributes were usually attributed to the bias and indicated the The lower elevation areas were generally estimated as having lower carbon concentrations due to the increased temperatures [64].The variations in the minimum and maximum values for the soil attributes were usually attributed to the bias and indicated the underfitting and overfitting of the predictions.The predicted categorical soil attribute maps were almost identical to the existing class maps with variations in the shape and size of the grid clusters, except for the map derived using the naïve Bayes classifier, which depicted evident bias and inconsistencies in the predictions of the implicated categorical soil attributes.In addition, a slight ambiguity in the feature space along the areas of great diversity was also observed for the class prediction maps obtained using multinomial logistic regression and k-NN.Further, the speckled results produced by k-NN due to overfitting might be difficult to interpret.
A larger proportion of the study area was covered by the Inceptisols soil order.The random forest and C5.0-derived soil class maps presented the clearest representation of the class elements followed by SVM, k-NN, and multinomial logistics regression.Although C5.0 and SVM removed three class elements of the great group, they were considered for their potential to adhere to other class elements almost identical to the existing maps.In summary, the predicted soil attribute maps depicted the complex spatial organisation of the variations associated with the existing soil maps and confirmed that all the prediction models, except for naïve Bayes (soil class), could digitally map the soil attributes, but the maps were more accurate with the selected models.The RF and Cubist algorithms for the continuous soil attributes were almost identical to the existing maps, with much greater delineation facilitated by the Cubist algorithm for all the continuous soil attributes considered (Figure 8).
underfitting and overfitting of the predictions.The predicted categorical soil attribute maps were almost identical to the existing class maps with variations in the shape and size of the grid clusters, except for the map derived using the naïve Bayes classifier, which depicted evident bias and inconsistencies in the predictions of the implicated categorical soil attributes.In addition, a slight ambiguity in the feature space along the areas of great diversity was also observed for the class prediction maps obtained using multinomial logistic regression and k-NN.Further, the speckled results produced by k-NN due to overfitting might be difficult to interpret.
A larger proportion of the study area was covered by the Inceptisols soil order.The random forest and C5.0-derived soil class maps presented the clearest representation of the class elements followed by SVM, k-NN, and multinomial logistics regression.Although C5.0 and SVM removed three class elements of the great group, they were considered for their potential to adhere to other class elements almost identical to the existing maps.In summary, the predicted soil attribute maps depicted the complex spatial organisation of the variations associated with the existing soil maps and confirmed that all the prediction models, except for naïve Bayes (soil class), could digitally map the soil attributes, but the maps were more accurate with the selected models.The RF and Cubist algorithms for the continuous soil attributes were almost identical to the existing maps, with much greater delineation facilitated by the Cubist algorithm for all the continuous soil attributes considered (Figure 8).Similarly, when comparing the C5.0 and RF predictions with the existing maps, finer and more detailed boundary delineations were attributed to the C5.0 algorithm (Figure 9).The delineation variations may be attributed to the predictor covariates' contributions to the spatial model calibration.
Similarly, when comparing the C5.0 and RF predictions with the existing maps, finer and more detailed boundary delineations were attributed to the C5.0 algorithm (Figure 9).The delineation variations may be attributed to the predictor covariates' contributions to the spatial model calibration.

Permutation Feature Importance
The percentage contribution of each parameter derived through PFI analysis was abstracted cumulatively and these are depicted in Table 6.

Permutation Feature Importance
The percentage contribution of each parameter derived through PFI analysis was abstracted cumulatively and these are depicted in Table 6.
From Table 6, it can be inferred that the highest overall contributions were from the terrain covariate, which indicates the primary effect of the topographical characteristics on the spatial variability of the soil properties.In general, most models had differing contributions from the parent material and terrain parameters in the model calibration.Between the terrain and parent material covariates, the categorical soil attributes were mostly determined using the terrain parameters.Similarly, the parent material covariate was used the most by the continuous soil attributes.This explains the influence of the nature of the covariate parameter on spatial modelling.Further, MLR showed its linear tendency of data fitting by increasing the incorporation of continuous predictors rather than categorical predictors for its model calibration.Attributable to only two climatic parameters implemented, the influence of the climate covariate on all soil attributes sufficiently explains their importance in predictions of soil properties.The effect of the organisms covariate indicates the constant influence of underlying organisms and other anthropogenic activities on the descriptions of the soil attributes [65].Furthermore, several of the machine learning algorithms exhibited no influence of the covariates on the calibrated model, which further substantiates the higher bias associated with the models discussed.In summary, for qualitative soil attributes, the Cubist model had an almost equal contribution from all the covariates for its spatial predictions.The equal influence of the covariates is reflected in the finer delineations of the spatial variations accounted for by the Cubist algorithms.Further, impartial influences were presented by the other machine learning algorithms for each soil attribute.
The terrain covariate facilitated a greater influence of the pH soil attribute, followed by the parent material, organisms, and climatic parameters.The influence of the parent material in addition to the terrain attributes can explain the spatial variability of the pH based on the composition of soil-forming materials.The contribution of the organisms covariate was found to be the second highest for the OC after the terrain attributes, which substantiates the generalised influence of the organisms on the soil carbon and organic matter contents.More specifically, the effects of Landsat band 3, LULC, and NDVI on the OC prediction were higher, which has been reported in several studies [66,67].Similar to those of the OC, the predictions of the CEC showed a stronger influence of organisms than parent material, which contrasts the order of the influence exhibited by the covariates for the pH and CEC soil attributes.In addition, in the cases of the categorical variables, the terrain attributes had a greater influence, followed by parent material and organisms for the C5.0 algorithm.However, the RF algorithm exhibited a contrasting order, with terrain as the most influential, followed by organisms and parent material.Moreover, the inclusion of the parent material in the C5.0 algorithm explains the segregations and almost finer delineations among the classes.This might further substantiate that most soil classifications were based on parent material characteristics [68].

Model Efficiency and Performance
Model performance is generally based on the nature of the datasets implicated; the parameters depicted; and the complexity, consistency, and structure of the model proposed [69].Based on the evaluation metrics assessed for the soil attributes, discussions are herein presented for the comparison of the machine learning algorithms.Besides the metrics specified, the selections were also made by comparing the quality of the visual delineations provided by a particular machine learning algorithm.Since a standardised measure of model efficiency has not yet been determined, several evaluation metrics were assessed to capture the variations in the model training.In general, the comparison results and the resulting predictions of the study might vary from other studies.Although speculating about the reasons behind the differences is difficult, the differences could be because of the varying nature of the study area and the quantity and quality of the soil attributes.

Quantitative Soil Attributes
In general, the lower R 2 and CCC values presented by the algorithms were due to the higher variability and complex interactions depicted by the environmental covariates.The high variability can be explained by the management practices, vegetation, and climatic factors influencing the characteristics of the study area [70].Furthermore, several studies indicated a similar range of R 2 results, stating that R 2 values < 0.50 are common [71], considering the spatial prediction of the continuous soil attributes.Previous studies on continuous spatial soil predictions resulted in R 2 ranges not exceeding 70% [72][73][74][75][76][77].
Other research on digital soil mapping of the continuous soil attributes in India presented similar spatial soil predictions in the watershed regions of Karnataka, India.The R 2 values of the predictions for the pH, EC, and OC using random forest regression (RF) yielded 46%, 16%, and 19%, respectively, for the Aland watershed and 30%, 7%, and 12% for the Guppi watershed [78].Recent works evaluating digital soil-mapping approaches indicated that the variations exhibited by the R 2 measure ranged from 9% to 48% for predicting soil fertility nutrients [79].The low R 2 values were due to the lower range of the soil attributes and non-significant spatial variations of the inputted soil datasets.The differences in the results of the evaluation metrics due to scale and range have also been investigated in other studies [80] and it has been stated that the lower values might indicate the insufficiency of the covariates to explain the soil attributes aside from their range and scale [81].
The lower RMSE measure indicates the reasonable performance of the spatial prediction models, which was expected using the sampled data.Further, when comparing the bias, the pH and OC attributes provided an unbiased prediction, with k-NN and MLR suffering a moderate bias when predicting the CEC attribute.When comparing the evaluation metrics derived using the test datasets, for the spatial prediction of the continuous soil attributes, the RF model consistently made the most accurate predictions (with the highest R 2 and lowest RMSE).Furthermore, the Cubist model also performed efficiently in addition to the RF algorithm.Similar results were found in previous studies [76][77][78][79][80][81][82][83], which stated the RF and Cubist algorithms as being among the most efficient soil organic carbon prediction models.However, these studies showed differences in the R 2 and other evaluation metrics, which might be due to the different sampling designs, scales, and ranges of the soil attributes and covariates incorporated.
Several studies indicated that the random forest approach was considered suitable for spatial soil predictions because of its ability to handle many covariates, limited samples, and low need for hyperparameter tweaking [84].Furthermore, following RF and Cubist, regression trees (DT) reported comparable prediction metrics.The lower accuracy of support vector machine can be attributed to the larger sampling data implemented.Thus, based on the evaluation metrics and visual assessments, among the machine learning algorithms compared, the RF and Cubist are considered efficient models for predicting the spatial variations in the pH, OC, and CEC for the proposed study.

Qualitative Soil Attributes
The evaluation metrics assessed for the qualitative soil attributes indicated that all the prediction models performed sufficiently in explaining the variability of the soil attributes, except for multinomial logistic regression and the naïve Bayes classifier.In addition to estimating the kappa and overall accuracy (OA), disagreement measures adopted in several recent studies were also estimated in the comparison of the machine learning algorithms for their robustness in depicting the variations in the predicted class [85][86][87][88][89].
Based on the evaluation metrics assessed, the random forest and C5.0 algorithm outperformed other models and predominantly reported the most accurate results for every soil class prediction implicated.Previous studies found that the RF [85,90] and C5.0 [91] algorithms provided the most accurate results and similar results were found by Zeraatpisheh et al. [92], which resulted in multinomial logistic regression for higher soil taxonomic units (due to the inclusion of the AIC-based predictor variable selection) and RF for lower taxonomic units as the best-performing models.Other studies [93] also resulted in RF (kappa = 0.55) being a more efficient model than multinomial logistic regression (kappa = 0.33).The prediction of soil class units in Iran also reported similar results [94], with the RF model performing the best at higher taxonomical units with OAs of 0.87 and 0.52 and kappas of 0.57 and 0.38 for order and suborder class predictions, respectively.The study also reported the increased prediction accuracy of RF compared to the proposed ensemble model.
The overall accuracies of the DSM maps selected based on the evaluation metrics for each soil taxonomic level ranged from 65% to 67%, which was the recommended level reported in the previous studies [95,96].The results obtained were generally found to be comparable to or higher than the previous studies.For example, soil class prediction [89] in Brazil resulted in an overall accuracy of 0.54 for the RF algorithm.Similar results to our study area have been reported for predicting soil class units, with overall accuracies of 68%, 63.6%, and 58.8% achieved through RF, classification trees, and multinomial logistic regression, respectively [97].
When comparing the disagreement components between the algorithms, the validated naïve Bayes classifier had a higher number of disagreements.The total disagreement measure was found to be the lowest for the RF and C5.0 algorithms, with a higher allocation disagreement.The major limitation of the soil class prediction associated with the C5.0 and support vector machine was that the fitted model failed to classify some of the soil classes, which was mainly attributed to low sampling frequencies related to the particular soil class element.In general, classes with lower sampling frequencies were predicted less accurately due to the limited observations associated with segregating such classes in the feature space.
Further, it was evident that the increase in the number of class categories had no significant effect on the resulting prediction accuracies.From the assessment of the percentage contributions of the covariates, it was observed that NB and multinomial logistic regression did not consider the parent material covariates (spectral ratios).Although the RF and C5.0 algorithms provided comparable results in observing the evaluation metrics, based on the visual assessments, all the machine learning algorithms derived a comparable interpretation, except for the NB classifier.

Potential Applications at the Farm Level and Policy Decisions
Climate change and an increasing population demand a well-established soil database to increase productivity, reduce emissions, and create a safe environment for future food security.Soil databases are used to assess soil conditions to mitigate global concerns regarding environmental sustainability.In DSM, several studies have addressed the spatial prediction of the contiguous soil attributes viz.pH, CEC, organic carbon, etc., compared to the categorical attributes [98].The block-level soil information with legacy soil maps also lacks detailed soil information, whereas the soil information predicted and downscaled through DSM helps farmers with their day-to-day management decisions [99].The highresolution digital maps can be used to assess crop suitability, soil and land management practices, site-specific fertilizer recommendations, the integration of spatial variability in VRT systems, and irrigation scheduling, subsequently reducing operational costs by optimising the inputs.The proper selection of crops, effective soil and land management practices, and balanced fertilization using DSM will result in reduced input costs and increased farm outputs, thereby improving the net income of farmers in addition to ensuring their livelihoods through crop cultivation [100].
The results from this study depict the strength of digital soil-mapping techniques to generate precise information pertaining to soils.The knowledge can be transferred to farmers through mobile applications and other Information and Communication Technology (ICT) tools on digital platforms.Agro-technology transfer has reached most smallholder farmers in Tamil Nadu, which account for nine million farm holdings, with large variability in the spatial attributes of soil.The further development of soil-based ICT tools in local languages will empower farmers in terms of farm management, resource recycling, and adopting strategies to manage soil constraints [101].The technology has the potential for upscaling across the different geographies of India.

Conclusions
The current study was based on the soil information derived from the existing soil maps, which were based on the stratified random sampling procedure.From the class prediction results, it can be inferred that the strata-based sampling procedure may limit the depiction of the transition variations near the strata boundaries.The effect of the nature and range of the soil and predictor variables had a considerable influence on the resulting predictions.Irrespective of the number of covariates considered, it was observed that the continuous scale predictors had a comparably greater influence than the categorical predictor variables on the model calibration, except for NB and multinomial logistic regression.The key findings of our research are: 1.
From the suite of machine learning algorithms compared for mapping three continuous soil attributes and three soil taxonomical units, it can be inferred from the visual interpretation that all the algorithms provided a reasonable spatial prediction of the soil attributes, except for the NB classifier.2.
Among the ML models, the tree-based ensemble (RF) and rule-based models (Cubist and C5.0) efficiently predicted the soil properties spatially.The efficiency of the models can be further increased by adopting appropriate sampling and tuning methods.

3.
The probability-based machine learning algorithms (NB and multinomial logistic regression) produced biased and crude results, which might have been due to the inclusion of continuous covariate predictors for the model calibration.Hence, the transformation of covariate predictors and the implementation of suitable variable selection techniques can improve the accuracy of the predictions.

4.
The results of k-NN, SVM, and SVR with default parameterisation were used to deduce the potential of the models, which was further increased by the appropriate tuning of parameters.The time required for the computation of SVM/SVR parameteri-sation is tedious for a larger dataset; hence, it is recommended for limited observations at a smaller scale.
This study also revealed that the spatial predictions of the soil attributes at regional levels with contrasting climates and landscapes were substantial.The tree-based models can be further enhanced and utilised for spatial predictions for producing digital soil maps at other regional and state levels.The digital soil maps generated at the higher spatial resolution will help farmers to execute effective farm planning through the choice of efficient crops, the scientific scheduling of irrigation and need-based nutrient application, effective management practices in the case of soil constraints, and the facilitation of the transformation to digital agriculture.

Figure 1 .
Figure 1.(a,b) Locational information of the study area in Tamil Nadu; (c) Study area map; (d) Digital elevation model (DEM); (e) Geomorphology.

Figure 2 .
Figure 2. Existing and predicted maps of pH using different machine learning algorithms.

Figure 3 .
Figure 3. Existing and predicted maps of OC using different machine learning algorithms.

Figure 2 .
Figure 2. Existing and predicted maps of pH using different machine learning algorithms.

Figure 2 .
Figure 2. Existing and predicted maps of pH using different machine learning algorithms.

Figure 3 .
Figure 3. Existing and predicted maps of OC using different machine learning algorithms.Figure 3. Existing and predicted maps of OC using different machine learning algorithms.

Figure 3 .
Figure 3. Existing and predicted maps of OC using different machine learning algorithms.Figure 3. Existing and predicted maps of OC using different machine learning algorithms.

Figure 4 .
Figure 4. Existing and predicted maps of CEC using different machine learning algorithms.

Figure 5 .
Figure 5. Existing and predicted maps of soil order using different machine learning algorithms.

Figure 4 .
Figure 4. Existing and predicted maps of CEC using different machine learning algorithms.

Figure 4 .
Figure 4. Existing and predicted maps of CEC using different machine learning algorithms.

Figure 5 .
Figure 5. Existing and predicted maps of soil order using different machine learning algorithms.Figure 5. Existing and predicted maps of soil order using different machine learning algorithms.

Figure 5 .
Figure 5. Existing and predicted maps of soil order using different machine learning algorithms.Figure 5. Existing and predicted maps of soil order using different machine learning algorithms.

Figure 6 .
Figure 6.Existing and predicted maps of suborder using different machine learning algorithms.

Figure 7 .
Figure 7. Existing and predicted maps of the great group using different machine learning algorithms.

Figure 6 .
Figure 6.Existing and predicted maps of suborder using different machine learning algorithms.

Figure 6 .
Figure 6.Existing and predicted maps of suborder using different machine learning algorithms.

Figure 7 .
Figure 7. Existing and predicted maps of the great group using different machine learning algorithms.

Figure 7 .
Figure 7. Existing and predicted maps of the great group using different machine learning algorithms.

Figure 8 .
Figure 8. Visual assessment on the RF and Cubist predicted continuous variables maps.Figure 8. Visual assessment on the RF and Cubist predicted continuous variables maps.

Figure 8 .
Figure 8. Visual assessment on the RF and Cubist predicted continuous variables maps.Figure 8. Visual assessment on the RF and Cubist predicted continuous variables maps.

Figure 9 .
Figure 9. Visual assessment of the RF and Cubist algorithms' predicted continuous variables maps.

Figure 9 .
Figure 9. Visual assessment of the RF and Cubist algorithms' predicted continuous variables maps.

Table 1 .
List of environmental covariates.

Table 2 .
Machine learning algorithms utilised for comparison along with their hyperparameters.

Table 3 .
Descriptive statistics of the continuous soil attributes.

Table 4 .
Prediction evaluation metrics assessed for continuous soil attributes.

Table 5 .
Prediction evaluation metrics assessed for categorical soil attributes.

Table 6 .
Percentage contributions of the covariates to the soil attribute predictions implied through permutation feature importance.

Table 6 .
Percentage contributions of the covariates to the soil attribute predictions implied through permutation feature importance.