Soil Mapping Based on the Integration of the Similarity-Based Approach and Random Forests

: Digital soil mapping (DSM) is currently the primary framework for predicting the spatial variation of soil information (soil type or soil properties). Random forests and similarity-based methods have been used widely in DSM. However, the accuracy of the similarity-based approach is limited, and the performance of random forests is a ﬀ ected by the quality of the feature set. The objective of this study was to present a method for soil mapping by integrating the similarity-based approach and the random forests method. The Heshan area (Heilongjiang province, China) was selected as the case study for mapping soil subgroups. The results of the regular validation samples showed that the overall accuracy of the integrated method (71.79%) is higher than that of a similarity-based approach (58.97%) and random forests (66.67%). The results of the 5-fold cross-validation showed that the overall accuracy of the integrated method, similarity-based approach, and random forests range from 55% to 72.73%, 43.48% to 69.57%, and 54.17% to 70.83%, with an average accuracy of 66.61%, 57.39%, and 59.62%, respectively. These results suggest that the proposed method can produce a high-quality covariate set and achieve a better performance than either the random forests or similarity-based approach alone. S1 (0.058), and slope position (0.038), respectively. Compared the results of covariates the importance ranking of covariates considerably after adding six similarity covariates. Similarity covariates occupied four of the top ten most important covariates, and S1 and S2 were the top two. The results indicate the importance and value of the similarity covariates.


Introduction
Without secure soil resources, we cannot be sure of secure supplies of food, fiber, water, and the diversity of landscape [1]. Soil information is central to the rational use of soil resources and relates to the sustainability of human societies [2][3][4]. Furthermore, soil information may be used in the fields of precision agriculture, environmental change simulation, natural resource management and utilization, global change monitoring, and policy making [5][6][7][8][9]. Therefore, the acquisition of soil spatial information is of important research significance.
Digital soil mapping (DSM) is currently the primary framework for predicting the spatial distribution of soil type or other soil properties [10]. The theoretical basis of DSM is the soil-landscape concept [11], which reflects the relationships between soil information and environmental covariates. Various methods have been proposed towards DSM based on environmental covariate correlation (such as decision trees, support vector machines, random forests, and similarity-based approaches) [12][13][14][15] or spatial autocorrelation (such as simple kriging and block kriging) [16,17], or the combination of these Land 2020, 9,174 3 of 16 Land 2020, 9,174 3 of 18 (TWI), height above the nearest drainage (Hand), plan curvature, profile curvature, relief, slope, slope position, terrain characterization index (TCI), topographic position index (TPI), terrain ruggedness index (TRI), cosine value of the aspect (cosaspect) (Table 1, Figure 2a-l). The slope position was classified into five classes, including ridge, shoulder slope, back slope, foot slope, and valley [31].  Fuzzy slope position [31] SimDTA TCI Terrain characterization index [37] SimDTA TPI Topographic position index [38] SimDTA TRI Terrain ruggedness index [39] SimDTA Cosaspect Cosine value of the aspect ArcGIS 10.1 Figure 1. The study area and soil samples.

Environmental Data
The topography is the dominant factor for spatial distribution in soils across the study area. The area is small, and its parent material and climate are generally homogeneous. A digital elevation model (DEM) with a spatial resolution of 10 m resolution was produced from a topographic map at a scale of 1:10,000. The DEM was interpolated via irregular triangular networks, which comes from contour lines in the topographic map. Based on the digital elevation model, a total of twelve environmental covariates were derived and used as follows: elevation, topographic wetness index (TWI), height above the nearest drainage (Hand), plan curvature, profile curvature, relief, slope, slope position, terrain characterization index (TCI), topographic position index (TPI), terrain ruggedness index (TRI), cosine value of the aspect (cosaspect) (Table 1, Figure 2a-l). The slope position was classified into five classes, including ridge, shoulder slope, back slope, foot slope, and valley [31].

Soil Samples
There were 115 soil samples in the study area ( Figure 1). Soil samples were collected in 2005 from three sampling strategies, including integrative hierarchical stepwise sampling, subjective sampling, and systematic sampling. Thirty-two representative soil samples were collected using the integrative hierarchical stepwise sampling strategy, which aimed to collect representative samples to represent soil spatial variation by clustering environmental covariates and identifying samples based on the clustering results [30]. Slope, topographic wetness index, plan curvature, and profile curvature were used for identifying samples. Forty-four subjective samples were collected using the subjective sampling strategy, which means that experienced soil survey experts located them. Thirty-nine regular samples were collected using the systematic sampling strategy and were distributed on a grid of 1100 m (south-north direction) by 740 m (east-west direction).

Methods
In this section, we describe the methods used in this paper. The methods include (1) Environmental covariates selection; (2) Similarity model; (3) Random forests model; (4) The integration of the similarity-based method and random forests. Finally, we lay out the experiment design for assessing the soil mapping ability of the newly developed integration model. The flowchart of the method is shown in Figure 3.

Soil Samples
There were 115 soil samples in the study area ( Figure 1). Soil samples were collected in 2005 from three sampling strategies, including integrative hierarchical stepwise sampling, subjective sampling, and systematic sampling. Thirty-two representative soil samples were collected using the integrative hierarchical stepwise sampling strategy, which aimed to collect representative samples to represent soil spatial variation by clustering environmental covariates and identifying samples based on the clustering results [30]. Slope, topographic wetness index, plan curvature, and profile curvature were used for identifying samples. Forty-four subjective samples were collected using the subjective sampling strategy, which means that experienced soil survey experts located them. Thirty-nine regular samples were collected using the systematic sampling strategy and were distributed on a grid of 1100 m (south-north direction) by 740 m (east-west direction).

Methods
In this section, we describe the methods used in this paper. The methods include (1) Environmental covariates selection; (2) Similarity model; (3) Random forests model; (4) The integration of the similarity-based method and random forests. Finally, we lay out the experiment design for assessing the soil mapping ability of the newly developed integration model. The flowchart of the method is shown in Figure 3.

Environmental Covariates Selection
Due to the diversity of environmental covariates, it was necessary to select the covariates which are effective in mapping the variation of the targeted soil information. Environmental covariates were selected based on the importance of each covariate through the use of the random forests algorithm. The idea of assessing the significance of covariates or features in random forests is to determine how much each covariate contributes to each tree in the random forests, then take an average, and finally compare the contribution among all covariates [42]. The Gini index can measure the contribution of a covariate [43]. Gini index describes purity, similar to the meaning of information entropy; the smaller the value, the higher the purity [42]. The Scikit-Learn library [44] for Python was applied to the implementation of random forests and environmental covariates selection. For a given covariate set consisting of p covariates E = e 1 , e 2 , . . . , e p , the processes of covariates selection are as follows: (1) Calculate the Gini index of node m in a tree: where K represents the number of classes of node m, p mk is the proportion of category k in node m.
(2) Calculate the change in the Gini index after node m is divided into m 1 and m 2 : (3) The importance of covariate e j in node m is: where w m is the weight of node m, calculated by dividing the number of samples at node m by the total number of samples.
(4) For M, a nodes set of covariate e j in a tree, the importance of covariate e j in the ith tree is: The importance of covariate e j in the random forests composed of n trees is: (6) Normalize the importance scores of all covariates and rank the covariates according to the normalized results.

Similarity-Based Approach
The similarity-based approach assumes that locations with similar environmental conditions have similar soil types or properties [45]. Accordingly, each site in an area contains a covariate similarity vector describing how similar the site is to a set of known soil samples in a study area, with each element in the vector corresponding to a sample. The similarity value to each sample ranges from 0 to 1, with the value of 0 indicating that the environmental conditions between two locations are not similar, and the value of 1 showing that the environmental conditions between two locations are identical.
The calculation of environmental similarity between an unvisited location and a given sample is the key to the similarity-based approach. The calculation process consists of three significant steps. The first step is compilation of the environmental conditions related to the soil at each location. It is Land 2020, 9, 174 7 of 16 essential to employ effective covariates that can indicate the spatial variation of a targeted variable. The second step is collection of a soil sample set with the value of the target variable obtained. The third step is to calculate the similarity between each location in the study area and each of the soil samples [15]: in which S i , S k i , and S k ij denote the similarity at an unvisited location in the soil type scale, sample scale, and covariate scale, respectively. p, n, and m represent the number of soil types, samples of soil type k, and covariates, respectively, and i, j, and k signify unvisited location, sample location, and soil type, respectively. As suggested by Zhu [29], minimum and maximum functions were adopted to integrate the similarity vector. The following step equation shows the calculation of a specified categorical (e.g., parent material) or continuous (e.g., elevation) environmental covariate: where e vi and e vj represent the values of the vth covariate at the location i and j. SD e v is the standard deviation of the vth covariate in the study area and SD e vj is the square root of the mean deviation of the values of the vth environmental covariate at all unvisited sites from that at the sample location j. After obtaining the similarities at a location, the most appropriate soil type at the location can be inferred using the maximum operator, and the soil property can also be predicted using the similarity weighted method proposed by Zhu [29].

Random Forests
Random forests, introduced by Breiman [23], is an extended variant of bagging. Random forests is a model composed of many individual decision trees, but this model is not merely to average the prediction of all trees. The characteristics of random forests are mainly reflected in two aspects: the randomness of sampling and the randomness of feature selection [23]. The first characteristic means that each tree in a random forests model is constructed from a random sample subset. The subset is built by bootstrap sampling with replacement from the original sample set. Samples from different subsets or in the same subset can be repeated, which indicates that some samples will be used multiple times in a tree. For the second characteristic, it means that only a random subset of all the optional features are chosen when splitting each node in each tree. In other words, each splitting process of a decision tree in the random forests does not consider all the features, but randomly selects a feature subset from all the candidate features, and then selects an optimal feature from the subset for partitioning. These characteristics above can increase the diversity of decision trees in random forests, thereby improving the classification performance of the model.
The random forests algorithm consists of four main steps: (1) Select a subset from all the samples as a training set with the reiterative bootstrap sampling method; (2) Build a decision tree with the sample subset; (3) Repeat the two steps above for K times to construct K decision trees for the random forests model; and (4) Predict target variable value with the trained random forests model (multiple trees), the results of which are calculated by the prediction results of different individual trees within the forests. The construction, training, and parameter setting of a random forests model were implemented using the Scikit-Learn Python toolkit [44]. The parameters to be optimized in random forests mainly include the number of decision trees in the forests, the maximum depth of each decision tree, the minimum number of samples required to split a node, the minimum number of samples needed to be at a leaf node, and the number of features to consider when looking for the best split. The optimal parameters of a random forests model will vary among different input samples or feature sets, so model tuning must be performed.

Integration of Similarity-Based Approach with Random Forests
The similarity-based approach and random forests are integrated by taking the outputs of a similarity-based approach as the input features of a random forests model to create the integration method. The purpose of the integration method is to take the advantages of both the similarity-based approach and random forests, provide a high-quality feature set, and improve the accuracy of digital soil mapping. The integration method is different from the similarity-based approach and random forests. The difference between the integration method and the similarity-based method lies in the digital soil mapping model itself. The difference between the random forests and the integration method lies in the feature set. The former uses the selected features only, while the latter uses the selected features and the similarity features produced by the similarity-based approach. The scale of a similarity feature set can be in the soil type scale, sample scale, or covariate scale.

Experiment Design
The experiment was conducted in two ways. First, to verify the effectiveness of the integration method (SB-RF), the SB-RF is compared with the similarity-based approach (SB) and with the random forests (RF) using the selected covariates. The SB-RF, SB, and RF were used to predict soil subgroups in the study area. Since the purpose is to map the soil subgroups, the similarity features of SB-RF in the soil type scale (six similarity features in total) were used in this study. Field soil samples were divided into two parts: training samples and validation samples. The validation dataset contains 39 regular samples collected by systematic sampling. The regular samples were evenly distributed and cover most of the study area, which can objectively evaluate the results of soil mapping. The other 76 representative and subjective samples collected by integrative hierarchical stepwise and subjective sampling were used as the training dataset. Second, to reduce the impact of sampling strategies and sample size on the accuracy of each method or model, the 5-fold cross-validation was adopted for the model evaluation.

Evaluation
The performance of each method was evaluated using the validation samples. To estimate the prediction accuracy, a confusion matrix, and three criteria, including the overall accuracy (OA), the producer's accuracy (PA), and the user accuracy (UA) were calculated: where p and n are the number of soil types and the total number of validation samples, respectively, and n c , N c , and N c denote the correctly classified number, real number, and predicted number of soil type c, respectively. Figure 4 shows the ranked importance of environmental covariates for covariates selection. The larger the value of a covariate, the more important it is. A total of twelve factors were involved in the environmental covariates selection in this study. Out of these covariates, slope position has the highest value (0.15), followed by cosaspect (0.127), elevation (0.126), TRI (0.092), Hand (0.088), TWI (0.079), relief (0.076), and slope (0.075), respectively. Profile curvature, plan curvature, TPI, and TCI in the study area have lower covariate importance values, all of which were less than 0.05. Therefore, these four covariates were removed, and the top eight covariates with the highest importance value were selected for digital soil mapping.

Soil Mapping Results
Soil mapping models, including SB-RF, SB, RF, have been constructed using 76 soil training samples. These models were used to infer the soil subgroups in the study area. The optimal parameters for SB-RF and RF in this study are shown in Table 2, which were generated by grid searching. The mapping results are shown in Figure 5(a-c). The spatial distribution patterns of subgroups in soil maps generated by SB-RF, SB, and RF were different. These soil maps generated by SB, RF were more fragmented in spatial distribution than that generated by SB-RF. It is evident that Typic Hapli-Udic Isohumosols was the most widely distributed soil type in each soil map. The landform positions of soil subgroups Mollic Bori-Udic Cambosols and Typic Bori-Udic Cambosols were close, but their total area values of distribution were different. In addition, the landform positions of Pachic Stagni-Udic Isohumosols, Lithic Udi-Orthic Primosols, and Fibric Histic-Typic Haplic Stagnic Gleyosols in each map were almost the same. The comparison of mapping results and the distribution patterns of each soil subgroup indicate that the spatial distribution of subgroups in the study area was mainly affected by the mapping methods and topographic covariates.

Soil Mapping Results
Soil mapping models, including SB-RF, SB, RF, have been constructed using 76 soil training samples. These models were used to infer the soil subgroups in the study area. The optimal parameters for SB-RF and RF in this study are shown in Table 2, which were generated by grid searching. The mapping results are shown in Figure 5a-c. The spatial distribution patterns of subgroups in soil maps generated by SB-RF, SB, and RF were different. These soil maps generated by SB, RF were more fragmented in spatial distribution than that generated by SB-RF. It is evident that Typic Hapli-Udic Isohumosols was the most widely distributed soil type in each soil map. The landform positions of soil subgroups Mollic Bori-Udic Cambosols and Typic Bori-Udic Cambosols were close, but their total area values of distribution were different. In addition, the landform positions of Pachic Stagni-Udic Isohumosols, Lithic Udi-Orthic Primosols, and Fibric Histic-Typic Haplic Stagnic Gleyosols in each map were almost Land 2020, 9,174 10 of 16 the same. The comparison of mapping results and the distribution patterns of each soil subgroup indicate that the spatial distribution of subgroups in the study area was mainly affected by the mapping methods and topographic covariates.   Table 3 summarizes the validation results of SB-RF, SB, and RF. As can be seen, the SB-RF method achieved the best performance for soil mapping in this study (OA = 71.79%), followed by RF (OA = 66.67%), and SB (OA = 58.97%), respectively. The results illustrate the advantage of the integration of SB and RF and demonstrate that the SB-RF method has better performance than either the RF or SB alone. The comparison between SB and RF indicates that the DSM accuracy of RF is  Table 3 summarizes the validation results of SB-RF, SB, and RF. As can be seen, the SB-RF method achieved the best performance for soil mapping in this study (OA = 71.79%), followed by RF (OA = 66.67%), and SB (OA = 58.97%), respectively. The results illustrate the advantage of the integration of SB and RF and demonstrate that the SB-RF method has better performance than either the RF or SB alone. The comparison between SB and RF indicates that the DSM accuracy of RF is better than that of SB under the same environmental covariates condition. Moreover, the comparison between SB-RF and RF shows that the SB-RF can achieve better performance under the same model condition, and proves that the SB approach can provide a high-quality similarity feature set for the SB-RF method. Table 3. Confusion matrices built with the validation samples for the integration of similarity-based approach and random forests (SB-RF), similarity-based approach (SB), and random forests (RF).

Model
Soil Type *  I  II  III  IV  V  The SB-RF model achieved the highest producer accuracy of Typic Hapli-Udic Isohumosols, followed by RF (80%) and SB (72%), respectively. Similarly, The SB-RF model achieved the highest user accuracy of Typic Hapli-Udic Isohumosols, followed by RF (71.43%) and SB (69.23%), respectively. The Typic Hapli-Udic Isohumosols soil type has the largest number of verification samples, so the validation accuracies of the SB-RF can reflect the reliability of this model. The producer accuracy of Typic Hapli-Udic Isohumosols for RF was smaller than that of SB-RF. It might be explained by the role of similarity features produced by SB. Compared with SB and RF, the SB-RF has the best producer accuracy and user accuracy for Typic Bori-Udic Cambosols, Lithic Udi-Orthic Primosols, and Fibric Histic-Typic Haplic Stagnic Gleyosols. For Pachic Stagni-Udic Isohumosols and Mollic Bori-Udic Cambosols, the producer accuracy and user accuracy of each model vary greatly. The reason for the significant difference among the producer accuracy and the user accuracy of each soil subgroup may be attributed to the difference in the sample size of each soil type. Besides, the performance of each model is also affected by the setting of model parameters, which needs to be further studied in future work. Figure 6 presents the performance of each model for the 5-fold cross-validation. It can be observed that the overall accuracies of SB-RF, SB, and RF ranged from 55% to 72.73%, 43.48% to 69.57%, and 54.17% to 70.83%, with an average accuracy of 66.61%, 57.39%, and 59.62%, respectively. The SB-RF model has a significant improvement in accuracy over the SB and RF, which shows the superiority of the integrated method again. The comparison between the RF and SB-RF demonstrates the effectiveness of the similarity feature set. In general, the accuracy of each model in the case using 5-fold cross-validation, compared with the case using regular samples as the validation dataset, has changed notably. However, the research can still prove the superiority and effectiveness of the proposed integration method.
Primosols, and Fibric Histic-Typic Haplic Stagnic Gleyosols. For Pachic Stagni-Udic Isohumosols and Mollic Bori-Udic Cambosols, the producer accuracy and user accuracy of each model vary greatly. The reason for the significant difference among the producer accuracy and the user accuracy of each soil subgroup may be attributed to the difference in the sample size of each soil type. Besides, the performance of each model is also affected by the setting of model parameters, which needs to be further studied in future work. Figure 6 presents the performance of each model for the 5-fold cross-validation. It can be observed that the overall accuracies of SB-RF, SB, and RF ranged from 55% to 72.73%, 43.48% to 69.57%, and 54.17% to 70.83%, with an average accuracy of 66.61%, 57.39%, and 59.62%, respectively. The SB-RF model has a significant improvement in accuracy over the SB and RF, which shows the superiority of the integrated method again. The comparison between the RF and SB-RF demonstrates the effectiveness of the similarity feature set. In general, the accuracy of each model in the case using 5-fold cross-validation, compared with the case using regular samples as the validation dataset, has changed notably. However, the research can still prove the superiority and effectiveness of the proposed integration method.   Figure 7 exhibits the importance ranking of covariates in the SB-RF model. A total of fourteen covariates, including eight selected environmental covariates and similarity covariates of six soil subgroups (S1, S2, S3, S4, S5, S6), were involved in the calculation of covariate importance. The S5 achieved the best covariate importance value (0.086), followed by S3 (0.085), relief (0.082), S2 (0.081), TRI (0.081), S6 (0.078), slope (0.076), elevation (0.073), Hand (0.069), TWI (0.068), S4 (0.066), cosaspect (0.058), S1 (0.058), and slope position (0.038), respectively. Compared with the results of covariates selection, the importance ranking of covariates has changed considerably after adding six similarity covariates. Similarity covariates occupied four of the top ten most important covariates, and S1 and S2 were the top two. The results indicate the importance and value of the similarity covariates.  Figure 8 shows the predicted map generated by the RF model with soil type similarities as features only. The overall accuracy of the map is 66.67%, which is equal to the accuracy of the RF model with environmental covariates. The average accuracy of the 5-fold cross-validation of the map is 58.72%, which is close to the accuracy of the RF model with environmental covariates of 59.62%. In addition, the spatial distribution patterns of subgroups in the map are similar to the map generated by SB-RF. Based on the comparison above, we can conclude that the modeling accuracy of similarity features is close to that of environmental covariates. Due to the number of similarity features being less than that of environmental covariates, we indicate that the similarity features are more refined.  Figure 8 shows the predicted map generated by the RF model with soil type similarities as features only. The overall accuracy of the map is 66.67%, which is equal to the accuracy of the RF model with environmental covariates. The average accuracy of the 5-fold cross-validation of the map is 58.72%, which is close to the accuracy of the RF model with environmental covariates of 59.62%. In addition, the spatial distribution patterns of subgroups in the map are similar to the map generated by SB-RF. Based on the comparison above, we can conclude that the modeling accuracy of similarity features is close to that of environmental covariates. Due to the number of similarity features being less than that of environmental covariates, we indicate that the similarity features are more refined.

Impact of the Sampling Strategy and the Quality of Soil Samples
To explore the impact of sampling strategies on the performance of the SB-RF method, three SB-RF models were trained with a sample set in different sampling strategies, respectively. Choosing any one for modeling, the other two for verifying the accuracy of the model, repeating the experiment five times can get the results as shown in Table 4. The overall accuracy of SB-RF based on

Impact of the Sampling Strategy and the Quality of Soil Samples
To explore the impact of sampling strategies on the performance of the SB-RF method, three SB-RF models were trained with a sample set in different sampling strategies, respectively. Choosing any one for modeling, the other two for verifying the accuracy of the model, repeating the experiment five times can get the results as shown in Table 4. The overall accuracy of SB-RF based on representative, regular, and subjective samples ranges from 63.86% to 66.27%, 47.37% to 52.63%, and 63.38% to 64.79%, with an average accuracy of 65.31%, 49.21%, and 64.23%, respectively. The accuracy of the SB-RF based on regular samples is significantly lower than that of representative and subjective samples. The model accuracy of the representative samples is slightly higher than that of subjective samples. The comparison of different model accuracy shows that the sampling strategy affects the accuracy of the SB-RF model. The quality of soil samples can affect not only the accuracy of the SB-RF model, but also the evaluation of the prediction of the model. Previous studies have shown that the uncertainty of evaluation results change with the location and number of validation samples [46]. In this study, soil mapping results were evaluated by the regular samples. However, in the process of sample collection, the location of the collected sample set is different from the designed sample set due to the positioning error or the accessibility of the target sample. Consequently, the collected regular samples do not appear to fit a strictly regular distribution over the study area, which will increase the uncertainty of evaluation results.

Applicability and Limitations of the Integrated Approach
The case study showed that the proposed method can effectively take the advantages of both the similarity-based approach and random forests, provide a high-quality feature set, and achieve high accuracy in predicting soil types. Besides the advantages, there are limitations in the method proposed in this paper. First, the integrated method examined only a small area for mapping soil types. Second, the parameter optimization of the model is time-consuming. Third, there are many ways to select features. Different methods get different feature sets. How to select feature sets and the influence of feature selection methods on model accuracy need further study. The integration of DSM methods is a trend. Besides the integration of the similarity-based method and random forests, the integration of other methods (such as a tree-based model and neural network model) is also worth further study.

Conclusions
This paper presented a method for soil mapping by integrating the similarity-based approach and random forests. The presented method was applied to the Heshan study area to verify its effectiveness. The following conclusions can be drawn from this study: (1) The SB-RF method achieved a better performance in accuracy than either the RF or SB alone, which shows the effectiveness and superiority of the integrated method; (2) The similarity covariates produced by the similarity-based approach embedded valuable information, which can effectively improve the mapping accuracy. The sampling strategy affects the accuracy of the SB-RF method.