Predicting Soil Textural Classes Using Random Forest Models: Learning from Imbalanced Dataset

: Soil provides a key interface between the atmosphere and the lithosphere and plays an important role in food production, ecosystem services, and biodiversity. Recently, demands for applying machine learning (ML) methods to improve the knowledge and understanding of soil behavior have increased. While real-world datasets are inherently imbalanced, ML models overestimate the majority classes and underestimate the minority ones. The aim of this study was to investigate the effects of imbalance in training data on the performance of a random forest model (RF)


Introduction
Soil texture is one of the most important soil properties that governs many other physical, chemical, and biological properties of the soil.Soil aeration, nutrient and water availability, and heat retention are all controlled by soil texture.These properties in turn all significantly affect plant growth, development, productivity, and quality.In order to manage practices such as tillage, irrigation, fertilization, and pesticide applications, most agricultural soils are categorized into heavy and light soil textures.Heavy soils have a high clay content, whereas light soils have a high content of sand [1].
Determining soil texture class is the most commonly requested laboratory analysis at the US national soil survey laboratories of all soil physical and chemical properties [2] owing to its natural relatively static behavior.Soil textural class is specified quantitatively by determining the proportions of the different sized particles (sand, silt, and clay) of soil mineral fraction in relation to a standard USDA soil textural triangle [3].
Understanding the spatial distribution of soil properties within farming areas is indispensable for agricultural activities and provides valuable information for improving soil management [4].Environmental modeling instantly allows decision-makers to make informed soil management decisions at various scales from local to global, but a necessary pre-requisite is access to an accurate, high-resolution soil map [5].Soil texture has inherently high spatial variation throughout landscapes in both the horizontal and vertical domains [6].Therefore, the spatial patterns of soil texture are informative for various applications and disciplines.
Traditional vector-based or polygon maps of soil types and properties are cost and time-intensive to produce [7].Moreover, the classic sampling techniques and mapping approaches produce soil property maps with a spatial resolution that is not particularly useful for ecosystem management [8].Although the most typical, a frequently used approach to understanding the complex changing behavior of soils even at unvisited sites in a vast area has been geostatistical methods; however, this has changed in recent years.With the increase in the availability of tools which enable researchers to merge GIS and geostatistical operations along with free remote sensing and relief data at the global scale with fine resolution, the pixel-based technique of digital soil mapping (DSM) has become increasingly preferred as an alternative to overcome some limitations of previously used soil mapping approaches.The spatial resolution of soil properties, however, is restricted by the soil sampling approach and the mapping uncertainty becomes largely unknown.Key advances in recent years have occurred due to the increased accessibility to soil survey and pedological big data.Today, a broader variety of soil properties can be mapped in conjunction with new training datasets, as novel digital soil property maps [9,10].
The Global Soil Map project was launched in 2009 as a reaction to the increasing demand for detailed soil property maps [9].It aims to map soil texture and other soil properties at a 90 m spatial resolution.However, only a few countries have tried to produce nationwide fine-resolution soil texture and soil particle size fraction maps using DSM techniques [7,[10][11][12][13][14].
Prediction model effectiveness relies highly on data characteristics such as the number and type of soil properties, size of the sample, and data heterogeneity [15].The variability of soil texture is characterized by its correlation with surrounding environmental conditions namely climate, topography, vegetation cover, parent materials, and soil type as well as soil forming factors that directly and indirectly influence the processes governing soil texture spatial variation.DSM uses easily obtained auxiliary data related to these key environmental conditions to produce soil property maps [16].It attempts to quantitatively fit the relationships between soil properties or soil classes and associated environmental conditions through computer-assisted prediction algorithms using different models such as generalized linear models, decision trees, neural networks, fuzzy approaches, and geostatistics [16] as well as machine learning (ML) approaches.Several ML algorithms have been widely explored for DSM such as artificial neural networks [17][18][19], support vector machines [20][21][22][23], boosted regression trees [10,11,24], random forests [7,8,21,[25][26][27], and hybrid approaches [28].
Real-world datasets such as bioinformatics datasets are reported to be inherently imbalanced.A dataset is imbalanced when the class distribution is not uniform among the classes or there are far more occurrences of some classes and other classes are rare.Machine learning works poorly with imbalanced datasets which affect the accuracy of classification predictions [29].This is because most ML algorithms are biased toward the majority classes and minority classes are ignored.Several solutions such as data-level, algorithm-level, and ensemble approaches have been suggested to handle the imbalanced data problem [30] and thus improve the imbalance ratio.Although each method has its pros and cons, data-level solutions are well known and simpler to apply than techniques that resample the observed data either by random over-sampling of minority classes or undersampling of majority classes to distribute the data evenly between classes [8,31,32].In many cases, soil samples are not equally distributed between textural classes; hence, ML algorithms mostly provide better estimates for the majority class.Therefore, in modeling soil texture classes, it is important to employ a strategy to deal with the imbalanced class issue.
The aim of this study was to (a) investigate the effects of balance and imbalance in the textural classes of the training dataset, and (b) produce an accurate map of soil textural class for agricultural lands in northern Iran using DSM and ML approaches.This study is unique and novel research because while the effects of imbalance on soil type predictions using ML methods have been previously investigated, the effects of imbalance in training data have not been thoroughly investigated on predictions of soil textural class in previous DSM studies.

Study Area
The study area is located in the northern part of Iran and belongs to the provinces of Mazandaran and Guilan (Figure 1a) which cover an area of 37,850 km 2 .The region is covered by dense temperate deciduous forest and crop fields, while the northern part borders the Caspian Sea and the Alborz mountains.The climate is mild and humid with mean annual rainfall ranging from 879 to 1200 mm and reaching as high as 1900 mm in the Southwest of the study area.The precipitation increases with increasing elevation and is highest in the northwest of the study region along the Caspian Sea coast [33], Figures 1  and 2).The region usually does not experience dryness [34].The elevation varies between −27 and 5610 m above sea level.The geomorphology varies from high mountainous areas to lowlands along a south-north trajectory.The population is dense due to diverse natural resources and agricultural activities.

Soil Data
The soil samples were collected from various research projects.Most of these projects focused on agricultural fields under the supervision of the Soil Genesis and Classification Department of the Soil and Water Research Institute of Iran.A total of 6100 soil samples were collected from the surface layer (0-30 cm) and analyzed between 2000 and 2020 (Figure 1b).Sand, silt, and clay size fractions were measured using the hydrometer method [35] and soil textural class was determined based on the USDA textural classification system.
The soil samples represented a wide range of soil textures varying from clayey to very sandy soils (clay: C, clay loam: CL, loam: L, loamy sand: LS, sand: S, sandy clay loam; SCL, sandy loam: SL, silty clay: ZC, silty clay loam: ZCL, silt loam: ZL) with the clay content ranging from 0% to 70%, sand content ranging from 0% to 99%, and silt content ranging from 1% to 93%.Table 1 shows the percentage of each soil texture class, and Figure 2 illustrates the distribution of them in the original dataset.The high variation in soil texture could be related to the heterogeneity of parent materials, topography, morphology, erosion, land-use management, debris flows, and coastal sediments in the study area [36].

Balanced Data
Imbalanced data are often present in classes of many real-world observations.This problem sometimes causes poor performance of ML algorithms because the algorithms are biased to the majority classes and overlook the minority classes [37].As a consequence, minority class instances are often misclassified as the majority ones [37].Moreover, it has been shown most ML models overestimate the majority classes and underestimate the minority ones [38].
Several techniques have been developed to address the class imbalance issue.Among them, resampling techniques have been reported as effective solutions to deal with the problem by obtaining a more balanced data distribution [31,[39][40][41].A well-known and widely used resampling strategy is the synthetic minority oversampling technique, SMOTE [42].The SMOTE algorithm uses an interpolation method to synthesize new examples from the existing examples, thus the minority class is over-sampled.In addition, this technique can be combined with a certain percentage of under-sampling of the majority class.To generate a synthetic sample, first, a random sample is selected from the minority class and its k nearest neighbors are found.Then, one of the randomly selected neighbors is chosen and a synthetic sample is generated at a point that is randomly selected between the two samples in feature space.

Environmental Covariates
In DSM, soil properties and classes are usually represented by environmental variables that are used as surrogates for soil forming factors including climate, relief, parent material, organisms, and time [16].These environmental variables can be extracted from different sources such as remote sensing images, digital elevation models (DEMs), climatic maps, and legacy soil maps.Here, environmental covariates were extracted from remotely sensed images, DEMs, and bioclimatic data (Table 2).Remotely sensed data mostly represent organisms and parent materials while terrain-based data mostly represent the relief factor.Since different soils show different spectral behavior, spectral bands and their combinations have been used successfully in DSM to model and map soil properties and classes.Remotely sensed data including spectral bands and their combination as indices, were extracted from Landsat 8, Sentinel 2, and MODIS images.There are several terrain attributes that can be extracted from DEMs which are directly or indirectly relevant to soil classes.Terrain-based attributes were obtained from a DEM with 30 m spatial resolution.
Climate is one of the most important factors controlling soil and vegetation variations.Several climatic parameters can be used in DSM to model soil classes and/or properties.Bioclimatic variables (Table 2), which are derived from the monthly temperature and rainfall data, are useful variables in ecological modeling.They represent annual and seasonal trends in climatic parameters, as well as extreme climatic conditions.There are 19 bioclimatic variables available at https://www.worldclim.org(accessed on 29 August 2021) which were used in this research.One disadvantage of ML algorithms is that, unlike geostatistical techniques, the spatial trends in the data cannot be included in the modeling process, therefore the predictions may contain spatial bias [25,43].Several solutions have been proposed to deal with this issue.For example, it is proposed that including x and y coordinates and distances to the corners and middle of the study area in ML modeling can help resolve this issue [31,44].Another proposed approach is calculating data layers with buffer distances to each soil observation and then using these layers as covariates [25].Here, to take the spatial trend into account in modeling soil textural classes, we used distance layers (Table 2).All covariate layers were resampled to a grid with a 30 meters resolution.

Machine Learning
To model soil textural classes, environmental covariates were used as inputs for a random forest (RF) model.RFs are ensemble learning algorithms used for both classification and regression tasks, and were first introduced by [45].The algorithm uses a treebased approach for learning and rather than a single decision tree being trained, several decision trees are trained.During the training phase, bootstrap samples are taken from the training data set and each sample is used to create and train a decision tree.Explanatory variables (here environmental covariates) are introduced into each tree and the target variable is then estimated.For regression problems, the estimations of all trees in the forest are averaged to obtain one single estimate of the target variable while, for the classification problems, the class with the majority vote is considered as the final prediction.
One advantage of RFs is that the algorithm provides the relative importance of the environmental covariates used in modeling.In tree-based algorithms, explanatory variables are used to split the decision nodes.The most important variables are used as splitting criteria at the root nodes, while less important variables are used at nodes further from the root node.Therefore, the farther the decision node is from the root node, the less important the variable is and vice versa.In other words, in a forest of decision trees, the most important variables are used at most root nodes as splitting criteria.Accordingly, researchers [46] developed a method called minimal depth (MD) that simply determines variable importance by the position of the variables in the decision trees and thus, is based on the decision tree structures.If the mean minimal depth (MMD) for a variable is small, then a large number of observations are divided into groups on the basis of this variable.

Accuracy Assessment
A 10 k-fold cross-validation strategy was employed to evaluate the RF model performance for modeling soil textural classes on both balanced and imbalanced datasets.To perform the validation, the dataset was divided into k folds and at each run, k-1 folds were used to train and the remaining fold was used to validate the model.This process repeats k times; thus, all folds were used for both training and validating the model.
Five validation statistics, including overall accuracy (OA), the kappa index (K-index), precision, recall, and F-score from a confusion matrix were calculated (Equations ( 1)-( 5)).In brief, OA is the ratio of all soil textural classes which classified correctly compared to all the data used.A higher overall accuracy indicates a high model performance.The kappa index is a robust index that considers the probability that a class is classified by chance [47].It simply measures the proportion of all possible cases of presence or absence that are correctly predicted by a model after accounting for chance predictions.A higher kappa index indicates a high model performance [48,49].Precision is the proportion of correctly classified predicted instances, recall is the proportion of those instances that are correctly classified, and the F-score is the harmonic mean of precision and recall (Equations (1)−( 5)).A confusion matrix is a table that represents the relationship between the positive class and negative class predictions [50].In other words, it determines which of the following categories the predictions fall into: true positive (TP), true negative (TN), false positive (FP), and false negative (FN).where P is the expected agreement between the classifier and the true values.

Balanced and Imbalanced Datasets
The dataset was treated using balanced and imbalanced approaches to investigate how these influenced predictions.The number of samples for each soil textural class varied due to different numbers of observations in each class for the imbalanced dataset (Figure 3a).The data were randomly generated and equally divided (600 samples) within the soil textural classes to produce a balanced dataset (Figure 3b).The majority of classes were L (1529 observations) and CL (1487 observations) and these accounted for more than half of the imbalanced dataset (Figure 3a).Balancing the distribution of soil textural classes causes the oversampling in minority soil texture classes (e.g., LS, S, and ZC).However, the number of observations for the C, CL, L, and ZL classes decreased due to under-sampling.The changes in the number of soil textural classes through resampling treatments are shown in Figure 3.

Accuracy Assessment
Resampling with the SMOTE method increased OA, precision, K-index, recall value, and F-score compared to when the imbalanced dataset was used (Table 3).Results clearly indicate that the RF model was trained better by the balanced SMOTE resampled dataset than the original imbalanced dataset.Using balanced data increased the overall accuracy by 15% (improving from 44% to 59%) and the kappa index by 22% (improving from 0.3 to 0.52) compared to the imbalanced dataset (Table 3).According to [51], this value of the kappa index represents a substantial agreement between observations and predictions.By balancing the data, model precision for all textural classes showed an increasing trend except for SCL.Recall values indicated that when balanced textural classes were modeled by RFs, more classes were classified correctly (except for CL).The number of samples in each textural class showed a significant effect on the number of correctly classified classes (i.e., recall values).In other words, the smaller the number of samples in an imbalanced textural class, the greater the number of correctly classified samples in the balanced data.As can be seen in Table 3, classes with a few samples in the imbalanced dataset (i.e., LS, S, SCL, and SL) showed the highest increase in recall values after balancing (17.75, 53, 6.37, and 7.37%, respectively).Recall values also indicate that minority classes that were overlooked (S) or misclassified as majority classes (LS, SCL, and SL) when trained using the imbalanced dataset, were well predicted with recall values of 0.51 to 0.71 using the balanced dataset (Table 3).A similar trend was observed for the F-scores.To our knowledge, in previous studies, only soil taxonomic classes have been studied as imbalanced datasets.In line with our findings, a positive relationship has been reported between the sample size and the prediction accuracy of individual soil classes in previous studies [7,52].A relatively poor prediction accuracy for soil taxonomic classes with a small sample size has been reported in central Iran [53].In modeling reference soil groups of Iran, it was indicated that reference soil groups with a larger number of samples were modeled more accurately by ML algorithms than classes with small sample sizes [36].Researchers [54] used 11 predictive models to model USDA subgroups and showed that over-sampling of the minority classes resulted in an increase in OA for some of the models.In addition, they indicated that the overlooked minority classes in the imbalanced data were predicted by over-sampling and were evident in the final map.Silva et al. [55] tested three models for predicting soil taxonomic classes in the USDA classification system.They found that the RF model which included additional sampling based on photo-interpretation showed the best performance among the studied models.Sharififar et al. [32] used Markov chain random fields in combination with an oversampling strategy to model USDA soil great groups in northwestern Iran.They showed, regardless of the models, that imbalanced data on soil taxonomic classes could affect both prediction and simulation model outputs.Moreover, they reported an improvement in the performance of the RF model after applying resampling techniques.

Covariate Importance
Minimal depths and their mean for only the 15 most important environmental covariates used for modeling imbalanced and balanced soil textural classes are presented in Figure 4. Covariates with lower MMD values show higher abilities for predicting soil textural classes.Distance layers showed a significant contribution in modeling soil textural classes when the imbalanced dataset was used (Figure 4a).This shows the high importance of spatial variance and uncertainties in the predictions, which emerging from different number of classes and also distances in the imbalance dataset.For the balanced data (Figure 4b), the contribution of covariates is different from that obtained for the imbalanced data (Figure 4a).In contrast to when the imbalanced data were used, terrain attributes and remotely sensed data were more important in predicting textural classes than distance layers when the balanced data was used (Figure 4b).The contribution of terrain-based attributes and remotely sensed data to soil texture variations has been reported in several studies [28,[56][57][58].Remotely sensed data are used in DSM mostly as proxies of parent materials and organisms/vegetation [59] which significantly affect the spatial distribution of soil texture.Terrain attributes affect water distribution and redistribution, soil erosion and deposition [60,61], and consequently soil texture.Distance layers and bioclimatic variables showed lower contributions for predicting soil textural class variations after applying the resampling strategy.
Results showed that the most important variables varied significantly when the balanced dataset was used compared to when the original dataset was used (Figure 4a,b).This is because when unbalanced data are used, the number of samples in minority classes and the associated covariates is few, and therefore, these covariates were not considered important explanatory variables by the RF algorithm.In other words, in minority soil classes, the covariate variations do not usually cover soil variations completely.Therefore, by over-sampling, the minority classes and consequently increasing the number of associated covariates, the relative importance of some covariates in predicting textural classes may increase.Overall, results indicate that balancing can increase the relative importance of some important covariates that are overlooked by the model due to the small number of observations in certain textural classes.
A problem which may arise from an imbalanced dataset is selecting the optimal feature subsets.Because a subset of features that is optimal for modeling a minority class may not be optimal for the majority class and vice versa.Therefore, the selected features can be significantly affected when a dataset with imbalanced class distribution is used.This is an important issue that is not considered by traditional feature selection methods and may lead to poor prediction for minority class examples [62,63].By applying balancing techniques, the results of feature selection algorithms will not depend on the class distribution.However, we should acknowledge that the inadequate testing of governing soil forming factors and their influence on soil texture variations in the region cannot be directly inferred from the feature importance analysis.Wadoux et al. [64] stated that pedological knowledge discovery based on ML is to be treated with caution.This is mainly because the predictive power of the features is based on the ML models and accuracy metrics (e.g., RMSE, ME, and R 2 ).

Predicted Maps
The maps of soil textural classes produced using RF models based on the balanced and imbalanced datasets are shown in Figure 5.Ten main USDA soil texture classes were identified in the study area.Importantly, the main differences between the two maps (Figure 5a,b) are because of the using balanced and imbalanced data in modelling approach.For example, the predominant soil textures identified when using imbalanced data were L and CL (Figure 5a), while SL prevailed in the areas with elevations higher than 500 m above level using balanced data (Figure 5b).The visual assessment of these maps further shows that the imbalanced data was best at predicting the distribution of the L and CL soil textures which agrees with the observed dataset shown in Figure 2.This result was expected because the greatest number of the observations were allocated in the two soil classes of L and CL (Figure 2), leading to decrease of the RF power to detect the other soil texture classes (e.g., SL).Nevertheless, the RF trained by balanced data performed better compared to that obtained by imbalanced data.Several other studies have reported that resampling techniques, i.e., over and under-sampling, improved classification accuracy [32,65,66].
We should consider the main purpose of resampling techniques is not improving the accuracy of models but enhancing the predicted maps, particularly minority soil classes.Comparing the maps indicated that soil textures containing significant amounts of sand including LS (coded in dark blue), S (coded in pale blue), and SCL (coded in pale green) are almost excluded from the map produced by the RF model using imbalanced data.While the area with sandy soils is small, it is important to identify it because light-textured soils require specific management strategies for agricultural production and other land uses.Compared to heavy soil textures, sandy soils have coarse particles.According to the balanced data, the light soil textures were particularly found in the coastal areas which were covered by sand dunes, while the imbalanced approach could not delineate these areas properly.This finding again suggests that balanced training data exhibit better classification rates in comparison to the imbalanced data.Furthermore, both approaches showed that the soil texture in the eastern part of the study area had silty soil texture classes.This could be related to regional-scale loess-paleosol deposits which developed due to wind erosion during Pleistocene climate change when there were relatively dry phases [67].In general, the clayey soil textures are mostly located in the areas under rice cultivation where fine soil particles have been moved by floods from highlands to lowlands in the distant past or have originated from in situ weathering of parent materials with a fine particles size.

Conclusions
This study provided some insight into soil texture mapping with imbalanced and balanced datasets.The effect of imbalanced classes in the training dataset when using RF models and employing DSM approaches was the main focus of this study.The effects of balance and imbalance were investigated through pre-treatment of the data.The data treatment with under-and over-sampling improved the prediction of the minority S and LS soil texture classes.
The imbalanced dataset achieved relatively poor accuracies compared to the balanced approach, but no classes were lost.The increase in the accuracy of predictions was substantial with accuracy increases ranging from 9% to 50%, except for SCL whose accuracy decreased by more than half.Zero correct predictions of sand as the most minor class were mainly due to the small number of observations in the dataset.
The RF model was found to produce reasonable outputs according to all accuracy metrics when the SMOTE dataset resampling method was used, but could not detect important covariates in the imbalanced dataset and therefore variation of minority soil classes was not entirely represented.Distance layers and climate showed a significant contribution in modeling soil textural classes in terms of MMD when imbalanced and balanced datasets were used, respectively.
Based on the results, balancing the dataset is a promising alternative to the conventional imbalanced approach of soil textural mapping since computation is not complicated.However, further investigation is needed to improve the accuracy of the applied novel models.Precise prediction of static soil properties in the form of spatial maps is very important for decision-makers.These maps have valuable information to determine appropriate soil management strategies which leads to more sustainable use of available resources.

Figure 1 .
Figure 1.Location of the study area within Iran (a) and location of the soil samples collected within the study area (b) plotted over a 90 m SRTM DEM showing variations in elevation.

Figure 2 .
Figure 2. Distribution of soil textural classes within the study area.

Figure 5 .
Figure 5. Spatial distribution of the most probable soil textural classes produced using: (a) RFs with imbalanced data and (b) RFs with balanced data in the southern Caspian Sea area of Northern Iran (C: clay; ZC: silty clay; ZCL: silty clay loam; SCL: sandy clay loam; CL: clay loam; ZL: silt loam; L: loam; S: sand; LS: loamy sand; SL: sandy loam).

Table 1 .
The percentages of soil texture classes in the original dataset.

Table 2 .
Environmental covariates used for modelling soil textural classes.
a S, L, and M letters indicate Sentinel, Landsat, and MODIS, respectively.

Table 3 .
Validation statistics calculated using the confusion matrix.