Extracting Typical Samples Based on Image Environmental Factors to Obtain an Accurate and High-Resolution Soil Type Map

: Soil surveying and mapping provide important support for environmental science research on soil and other resources. Due to the rapid change in land use and the long update cycle of soil maps, historical conventional soil maps (CSMs) may be outdated and have low accuracy. Therefore, there is an urgent need for accurate and up-to-date soil maps. Soil has a high correlation with its corresponding environmental factors in space, and typical samples contain an appropriate soil– environment relationship of soil types. Understanding how to extract typical samples according to environmental factors and determine the implied soil–environment relationship is the key to updating soil maps. In this study, a hierarchical typical sample extraction method based on land use type and environmental factors was designed. According to the corresponding relationship between the soil type and the land use type (ST-LU), the outdate soil map patches caused by changes in land use were excluded, follow by typical samples being extracted according to the peak intervals of the soil–environmental factor histograms. Additionally, feature selection was performed through variance analysis and mutual information, and four machine learning models were used to predict soil types. In addition, the influence of environmental factors on soil prediction was discussed, in terms of variable importance analysis. Using an overall common validation set, the results show that the prediction accuracy using typical samples for learning in the modeling set is above 0.8, while the prediction accuracy when using random samples is only about 0.4. Compared with the original soil map, the accuracy and resolution of the predicted soil maps based on typical samples are greatly improved. In general, typical samples can effectively explore the actual soil–environment knowledge implied in the soil type map. By extracting typical samples from historical soil type map and combining them with high-resolution remote sensing data, we can generate new soil type maps with high accuracy and short update cycle. This can provide some references for typical sampling design and soil type prediction.


Introduction
Humanity is currently facing major challenges, such as climate change, food security, land degradation, and ecosystem sustainability.These issues are closely related to soil function, and there is an urgent need for accurate and up-to-date soil information [1].The soil sphere is an important component of the earth's ecosystem and is the material basis for the existence, evolution, and development of terrestrial ecosystems.Soil survey and mapping is the basis for the direct use and management of soil resources and is also an important support for other soil and environmental science research.Conventional soil mapping is a manual mapping approach based on topographic maps and aerial or satellite images, which experts use based on field investigations to understand soil-landscape relationships.A conventional soil survey is the most reliable and authoritative approach for soil mapping.Soils are landscapes as well as profiles, and conventional soil maps (CSMs) contain a range of expert knowledge reflecting soil-environment relationships [2], which are important data sources and evaluation basis for soil prediction and mapping.However, a national-scale soil survey involves a large workload and strong professionalism and is subject to technical, economic, and time constraints, resulting in a long cycle and a long interval [3].Most CSMs have a history of more than ten years or even decades; therefore, it is necessary to update them.
The soil and landscape are considered to have certain similarities in pedology.From Jenny's soil-landscape model to McBratney's "scorpan-SSPFe" model [4,5], digital soil mapping (DSM) has made great progress.DSM uses machine learning and other methods to mine the relationships between soil and environmental covariates to predict soil properties or types, which can eliminate the trouble of expert knowledge and save a lot of human resources, material resources, and time costs.As such, DSM has been widely used [6][7][8][9][10].Since soil-landscape knowledge does not change over time, mining the knowledge and laws of soil distribution contained in historical samples can greatly improve the efficiency of soil mapping.
Enhancing the quality of input data is the best way to improve soil mapping performance [11].The quality of samples and the selection of key environmental variables largely determine the accuracy and scientific appropriateness of predictive mapping.Soil sample design methods can be divided into two categories: uniform sampling with overall representativeness and typical sampling with specific purpose.The most common methods of uniform sampling comprise the regular grid and variance quadratic tree methods [12,13].The typical sampling schemes mainly include fuzzy classification sampling and multilevel representative sampling [14][15][16].Moreover, conditional Latin hypercube sampling is effective and has been widely used [17][18][19].
As soil type has typical laws in its spatial distribution [2], it is the typical samples that reflect the correct soil-landscape relationship, whereas samples in atypical areas do not represent the correct soil-environment relationship and can confuse soil classification.CSMs may also have mixed and incorrect patches, but this confusion is much less frequent in typical areas.Therefore, while the uniform sampling method may be a good strategy when sampling is very dense, when sampling density is rather sparse it may not be suitable for soil type prediction, because it may miss some important soil type-environmental covariates combinations.In addition, most of the existing typical sampling methods ignore the effects of timeliness and human factors on the accuracy of CSMs.The soil-landscape relationship has certain stability when external environmental conditions are relatively stable, but human factors may fundamentally alter this stable soil process [20].Due to the long periodicity of a conventional soil survey, when environmental conditions change significantly, the soil-landscape relationships implied in CSMs no longer correspond exactly to the actual geographical landscape.Sampling in these non-corresponding areas can seriously affect the quality of the samples and prediction accuracy, so typical samples in the soil map need to be extracted based on land use (LU) and other environmental factors.At present, there are few studies on typical soil samples, and the existing studies have not considered the impacts of land use change and timeliness.
In this study, a method for extracting typical soil samples according to land use type and the environmental factor histogram was proposed.According to the corresponding relationship between soil type and land use type (ST-LU), the outdate soil map patches caused by land use type change were excluded, and then the peak interval of the soil-environmental factor distribution histograms was used to determine the typical areas.Finally, typical samples were extracted hierarchically in different typical areas.In addition, feature selection was performed through variance analysis and mutual information.Four machine learning models were used to predict soil types through mining soil-environment knowledge.We hope to improve the accuracy of soil type mapping through typical sam-pling strategies in order to provide a reference for conventional soil map updating and modern digital soil mapping.

Study Area
The study area is located in the northern part of Jurong City, Jiangsu Province of China, which is a hilly agricultural region with a total area of about 2150 ha (Figure 1).The study area has a north subtropical continental monsoon climate, with an average annual temperature of 15.2 • C and an annual precipitation of 1060 mm.The soil-forming parent material (PM) is mainly tertiary alluvial parent material (Q3al), quaternary alluvial parent material (Q4al), and a small amount of purplish sandstone (J33).Cultivated land and forests are the main land use types, and there are a few natural forests and planted forests.environment knowledge.We hope to improve the accuracy of soil type mapping through typical sampling strategies in order to provide a reference for conventional soil map updating and modern digital soil mapping.

Study Area
The study area is located in the northern part of Jurong City, Jiangsu Province of China, which is a hilly agricultural region with a total area of about 2150 ha (Figure 1).The study area has a north subtropical continental monsoon climate, with an average annual temperature of 15.2 °C and an annual precipitation of 1060 mm.The soil-forming parent material (PM) is mainly tertiary alluvial parent material (Q3al), quaternary alluvial parent material (Q4al), and a small amount of purplish sandstone (J33).Cultivated land and forests are the main land use types, and there are a few natural forests and planted forests.

Data Sources
Pedology is the theory that soils are formed under the combined action of five major soil-forming factors [4], and it provides theoretical support for DSM.According to the importance of various soil-forming factors in the soil-forming process, the dominant and stable environmental factors were selected as predictor variables.The variability of climatic factors such as temperature and precipitation in the study area is very weak, so they were not selected as predictor variables.The soil map is raster legacy data extracted from China's Second National Soil Survey, with a resolution of 100 m.This soil map has rather few soil type classes, and the data used to draw it were gathered from 20 to 30 y ago.The PM map was obtained by digitizing a paper version of the geological map of Yangzhou, China (Figure 1b).

Data Sources
Pedology is the theory that soils are formed under the combined action of five major soil-forming factors [4], and it provides theoretical support for DSM.According to the importance of various soil-forming factors in the soil-forming process, the dominant and stable environmental factors were selected as predictor variables.The variability of climatic factors such as temperature and precipitation in the study area is very weak, so they were not selected as predictor variables.The soil map is raster legacy data extracted from China's Second National Soil Survey, with a resolution of 100 m.This soil map has rather few soil type classes, and the data used to draw it were gathered from 20 to 30 y ago.The PM map was obtained by digitizing a paper version of the geological map of Yangzhou, China (Figure 1b).
We purchased two remote sensing images of the GF2 (19 September 2021) and ZY3-02 (19 November 2019) from China Center for Resources Satellite Data and Application (CRESDA, https://data.cresda.cn/#/home(accessed on 15 January 2022)).Based on GF2, the objectoriented method was used for supervised classification to obtain the LU map, with a total of 6 LUs (Figure 1c).Four types correspond to the soil regions, namely natural forest (L1), paddy field (L2), irrigated land (L3), and planted forest (L6), and the other two types correspond to non-soil regions.In ENVI 5.3, a DEM with a resolution of 2.1 m was generated by constructing a stereo image pair with front-view and forward-view images of ZY3-02.The DEM was then resampled to 30 m and eleven derived topographic indices [9,21] were calculated in SAGA v.8.1.3(System for Automated Geoscientific Analysis, http://www.saga-gis.org(accessed on 13 March 20222)): Aspect, Channel Network Base Level (CNBL), Channel Network Distance (CND), Convergence Index (CI), LS-Factor (LSF), Plan Curvature (PlC), Profile Curvature (PrC), Relative Slope Position (RSP), Slope, Topographic Wetness Index (TWI), and Valley Depth (VD).
We downloaded a Sentinel-2 (S2) image from the European Space Agency (ESA) (https://scihub.copernicus.eu(accessed on 18 January 2022)), with low cloudiness during the flowering-filling period (late August to late September).The Sentinel-2B (Level-1C product) image on 31 August 2021 was downloaded as the source data, and atmospheric correction was performed in Sen2Cor to generate Level-2A product, and it was resampled to 10 m in SNAP.Then, five vegetation indices [22,23] were calculated in ENVI 5.3, including normalized difference vegetation index (NDVI), difference vegetation index (DVI), enhanced vegetation index (EVI), ratio vegetation index (RVI), and soil-adjusted vegetation index (SAVI) [24].Moreover, two soil indices-soil color index (SCI) and soil red index (SRI)-were also calculated [25] as predictors to reflect soil color and hematite characteristics.

Typical Sampling Design
When environmental conditions change significantly, the soil-landscape relationship implied in CSM no longer corresponds to the actual geographical landscape.With relatively stable conditions of climate, topography, and parent material, the anthropogenic factor is the dominant factor in the change in soil types, which is mainly reflected in land use changes.Therefore, the regions where soil type corresponds to land use type (ST-LU) should be first extracted to exclude the influence of erroneous pixels on modeling and prediction, as shown in Figure 2.

Data-Mining Methods
In this section, we introduced four data-mining methods for predicting soil types, including random forest (RF), bagged classification and regression trees (bagCART), bagged flexible discriminant analysis (bagFDA), and neural networks (NNet) [23,26,27].Soil type was extracted in ArcGIS 10.8 as the response variable and environmental covariates as the predictor variables.In R software, the "rf", "treebag", "bagFDA", and "nnet" functions in the "caret" package were used, respectively, to build these ensemble models, and tenfold cross-validation was used to select the optimal models.
RF is an ensemble model combined with the bagging algorithm.The decision tree is used as the base classifier, and the bootstrap method is used for sampling with putback.Multiple base learners are trained to effectively avoid the problem of overfitting of a single model [28].BagCART is an improved classification and regression trees (CART) algorithm that combines CART and bagging algorithms to improve model prediction accuracy and reduce overfitting [23].BagFDA is a discriminant analysis model based on multivariate adaptive regression splines (MARS) and the bagging algorithm.MARS is an adaptive process of regression, which is very suitable for high-dimensional problems [26].NNet is de- The spatial distribution of soil has a certain degree of transition, while the confusion in the edge area between different soil patches is large.It is stable and typical in the middle area of a single patch, and the environmental factors corresponding to the soil in typical areas are also typical.The peak interval of a soil-environmental factor distribution histogram represents the maximum distribution area of the soil type under this environmental factor, making it a representative and typical environmental factor distribution interval.Therefore, according to the frequency distribution histograms of soil-environmental factors, we extracted the peak intervals of soil-environmental factors for each soil type, and the spatial areas corresponding to the peak intervals were considered as typical regions.
To avoid noise due to some inconsistent environmental data and very small areas classified as typical areas, we adjusted the number of intervals to ensure that the area corresponding to the peak interval made sense and was large enough to select samples.Repeated experiments revealed that the spatial area corresponding to the peak interval of each environmental factor should be controlled at 1/5 to 1/4 of the total area of each soil type.The typical regions of each soil type were extracted in turn according to the peak intervals, and then the sample number of each soil type was determined based on the area ratio of soil types.Finally, the typical samples were extracted hierarchically in typical regions of each soil type.

Data-Mining Methods
In this section, we introduced four data-mining methods for predicting soil types, including random forest (RF), bagged classification and regression trees (bagCART), bagged flexible discriminant analysis (bagFDA), and neural networks (NNet) [23,26,27].Soil type was extracted in ArcGIS 10.8 as the response variable and environmental covariates as the predictor variables.In R software, the "rf", "treebag", "bagFDA", and "nnet" functions in the "caret" package were used, respectively, to build these ensemble models, and tenfold cross-validation was used to select the optimal models.
RF is an ensemble model combined with the bagging algorithm.The decision tree is used as the base classifier, and the bootstrap method is used for sampling with putback.Multiple base learners are trained to effectively avoid the problem of overfitting of a single model [28].BagCART is an improved classification and regression trees (CART) algorithm that combines CART and bagging algorithms to improve model prediction accuracy and reduce overfitting [23].BagFDA is a discriminant analysis model based on multivariate adaptive regression splines (MARS) and the bagging algorithm.MARS is an adaptive process of regression, which is very suitable for high-dimensional problems [26].NNet is developed from the perceptron model, which has multiple levels of perceptron, to fit a multivariate log-linear model through neural network.The network structure consists of the input layer, output layer, and implicit layer.

Methods for Evaluating Model Performance
Two metrics were calculated to evaluate the model performance: overall accuracy and Kappa index.Fifty validation samples were randomly extracted from the overall area and were kept apart for validation.The overall accuracy is the sum of the main diagonal components of the confusion matrix divided by the total number of samples.Moreover, the Kappa index is a consistency measure that combines the total number of samples, the number of soil types, and the correctly classified samples [29].
where k is the number of classifications, N is the overall sample size, x jj is the number of correctly classified samples, and x ij and x ji are the numbers of misclassified samples in row i and column j, respectively.

Analysis of Variance and Feature Selection
In order to improve the stability and generalizability of the model, feature selection should be executed before modeling.In this study, the correlations between numerical and categorical variables with soil type were evaluated through ANOVA and mutual information, respectively.Two hundred samples in the ST-LU regions were extracted for ANOVA, and environmental factors with p < 0.05 were selected as predictor variables.The mutual information of LU and ST as well as PM and ST were 0.67 and 0.09, respectively.
Considering that PM is one of the most important soil-forming factors, both LU and PM were used as predictor variables.Finally, we selected six topographic indices (Elevation, CNBL, PlC, RSP, Slope, TWI) and five vegetation indices (GF2DVI, S2RVI, S2SRI, ZY302EVI, ZY302SCI), as well as LU and PM as the predictor variables.The types and ANOVA of predictor variables are shown in Table 1.

Training Samples Acquisition
Typical samples were extracted according to the method described in Section 2.3.To prevent the typical area from being too small when the peak intervals of the environmental factors intersected, which would affect the quality of the samples, we selected only those environmental factors whose soil-environment histograms met or were close to the normal distribution.We also selected the first four variables with the highest correlation in the terrain factors and the first two variables with the highest correlation in the vegetation indices.Meanwhile, the pixels that satisfied at least one factor's peak interval in the topography indices and vegetation indices were extracted as the typical regions.
Considering the randomness of sampling during modeling, it is necessary to ensure that the number of samples for each soil type is not too small.In this study, the number of typical samples for each soil type was set to at least 10, and the total number of samples was set to 200.At the same time, 200 random samples were taken from the whole soil area as a control.A simple random sampling method was used for sampling in ArcGIS, and the minimum spatial distance between all samples was set to 100 m to reduce the spatial autocorrelation.Furthermore, as the area of S3 (board-slurry small-silt soil), S6 (gray-horse-liver soil), and S9 (blue-mud-strip) is too small, accounting for less than 1% of the total area, S3 and S6 were grouped into similar soil types-S2 (small-silt soil) and S5 (horse-liver soil)-while S9, with the larger area, was retained but was not involved in the modeling or prediction.According to the area ratio of soil types, stratified sampling was carried out in typical regions, and the typical regions and modeling samples are shown in Figure 3.In addition, the mutual information values of ST and LU as well as ST and PM in the typical samples were 0.69 and 0.16, respectively, while those in the random samples were only 0.04 and 0.08, indicating that typical samples could greatly improve the correlation between soil and environmental covariates.
carried out in typical regions, and the typical regions and modeling samples are shown in Figure 3.In addition, the mutual information values of ST and LU as well as ST and PM in the typical samples were 0.69 and 0.16, respectively, while those in the random samples were only 0.04 and 0.08, indicating that typical samples could greatly improve the correlation between soil and environmental covariates.

Model Evaluation
Four machine learning methods were applied for modeling in R in order to determine the optimal parameters and the final model.For the purpose of evaluating the generalization ability of the model, fifty samples were extracted from the ST-LU regions for validation.After data standardization, the sample data were brought into the models for prediction, and the results are shown in Table 2.
From Table 2, it can be concluded that the overall accuracy of the typical sample (TS) models is above 0.8, and the Kappa index is basically above 0.7, achieving very good prediction results.However, in the random sample (RS) models, the overall accuracy of the calibration set is around 0.5, and the accuracy of the validation set is basically below 0.5.The prediction accuracy of the model using typical samples for learning is significantly higher than that of the model using random samples, indicating that the typical samples contain an appropriate soil-environment relationship and that machine learning models are able to capture this complex knowledge.In addition, the accuracy of the RS models was significantly lower than that in the calibration set, and the Kappa index of the RS models was very low.This is because the soil-environment relationships were not well captured by the set of covariates and corresponding profiles in the random samples.In other words, random samples missed important soil-environment combinations, which

Model Evaluation
Four machine learning methods were applied for modeling in R in order to determine the optimal parameters and the final model.For the purpose of evaluating the generalization ability of the model, fifty samples were extracted from the ST-LU regions for validation.After data standardization, the sample data were brought into the models for prediction, and the results are shown in Table 2. From Table 2, it can be concluded that the overall accuracy of the typical sample (TS) models is above 0.8, and the Kappa index is basically above 0.7, achieving very good prediction results.However, in the random sample (RS) models, the overall accuracy of the calibration set is around 0.5, and the accuracy of the validation set is basically below 0.5.The prediction accuracy of the model using typical samples for learning is significantly higher than that of the model using random samples, indicating that the typical samples contain an appropriate soil-environment relationship and that machine learning models are able to capture this complex knowledge.In addition, the accuracy of the RS models was significantly lower than that in the calibration set, and the Kappa index of the RS models was very low.This is because the soil-environment relationships were not well captured by the set of covariates and corresponding profiles in the random samples.In other words, random samples missed important soil-environment combinations, which were better captured by sampling based on typical profiles and landscapes.As a result, the prediction performance outside the random sampling calibration set is very low.
Among the four TS models, RF and bagFDA have high prediction accuracy in both the calibration and validation sets.RF achieves the highest prediction accuracy, as shown by the validation set, with an overall accuracy and Kappa index of 0.86 and 0.78, respectively.Therefore, RF has the best prediction performance and the strongest generalization ability in the entire region, with bagFDA coming second.Overall, the performances of the four TS models were rather comparable, and all RS models performed worse than the TS models.Moreover, among the RS models, some were fully unable to predict soil types, suggesting that they were not adapted at all to soil type prediction.

Feature Importance Analysis
In order to better understand the influence of each environmental factor on the soil type, the importance of variables was analyzed for the two best-performing models.As shown in Figure 4, cross-entropy was chosen as a performance indicator to rank the importance of the variables.Considering the uncertainty of the evaluation, more than 10 evaluations were performed and sorted by their means.LU and elevation are the dominant variables in both models, and the absence of LU can significantly increase the cross-entropy of the models, indicating that LU has the highest variable importance, followed by elevation.This demonstrates that LU and topography are the two dominant factors influencing soil type, which is particularly significant in farming areas with high anthropogenic influence.Moreover, CNBL is the third most important variable due to the consistency of the paddy field and river distribution, a factor that should be focused on in farming areas.In addition, the bagFDA model relies too much on one variable, which could introduce greater instability into the model.The RF model, on the other hand, is relatively less dependent on variables and has a better generalization ability.According to the RF model, the variable importance in the different soil types can be generated and expressed in percentage of importance (Figure 5).According to the RF model, the variable importance in the different soil types can be generated and expressed in percentage of importance (Figure 5).According to the RF model, the variable importance in the different soil types can be generated and expressed in percentage of importance (Figure 5).We can conclude that L2 (paddy field) is the most important variable affecting soil classification and plays a decisive role in the prediction of S2 and S16 (dry-farming loess).As L2 is the most anthropogenically influenced type of LU, it is the main soil-forming factor for paddy soils, which leads to L2 being the main environmental variable to We can conclude that L2 (paddy field) is the most important variable affecting soil classification and plays a decisive role in the prediction of S2 and S16 (dry-farming loess).As L2 is the most anthropogenically influenced type of LU, it is the main soil-forming factor for paddy soils, which leads to L2 being the main environmental variable to distinguish paddy soils (S2 and S5) and yellow-brown earths (S16) in cultivated soils.Moreover, elevation holds the greatest importance in S15 (dry-hard loess) because S15 is generally distributed at relatively high locations, such as the top of a hill or the foot of a slope, making elevation the main environmental variable that distinguishes S15 from other soil types.In addition, GF2DVI, S2SRI, and ZY302EVI also play an important role among the soil types because of the strong correlation between the vegetation index and LU.This also reflects the fact that multi-temporal images can differentiate soil types based on the phenological changes of vegetation and have a good application prospect in soil classification.
In this study, a chi-squared test and analysis of variance were also performed on the first two important variables (LU and elevation) based on typical samples (Figure 6).The p values of LU and elevation were close to 0, and the p values of L2 and L3 were very small, which have an important impact on soil classification.In addition, the effect values of LU and elevation reached 0.61 and 0.84, respectively, so we can infer that elevation has better robustness in ST prediction, and LU has a slightly worse effect in areas less affected by human influence, but it still has a high effect size (0.61).In particular, elevation has a high significance among various soil types except for S2 and S5 (S14 (yellow sand soil) and S15), which is consistent with the above analysis results and also shows that elevation has an important indicative role in soil classification.
of LU and elevation reached 0.61 and 0.84, respectively, so we can infer that elevation has better robustness in ST prediction, and LU has a slightly worse effect in areas less affected by human influence, but it still has a high effect size (0.61).In particular, elevation has a high significance among various soil types except for S2 and S5 (S14 (yellow sand soil) and S15), which is consistent with the above analysis results and also shows that elevation has an important indicative role in soil classification.

Soil Spatial Distribution Mapping
The conventional soil map and environmental covariates data were brought into the final models to generate prediction maps with a resolution of 10 m, as shown in Figure 7.For the TS models, the prediction maps of four data-mining methods have similar soil distribution patterns, and all have high prediction accuracy.Among them, bagFDA has fewer fine-grained patches and maintains better patch integrity, which is related to LU being its main predictor variable.The main differences between the four TS models are in the two soil types, S5 and S14, which may be due to the lack of relevant variables leading to confusion in classification.
Compared to the original soil map (Figure 3a), the soil distribution predicted with the TS models has a clear geographical regularity and is consistent with the distribution of the actual field landscape.Due to the inclusion of environmental factors, such as LU, elevation, and RSP, which can reflect the detailed characteristics of soil in the micro-landscape and micro-topography, the use of machine learning model can improve the prediction performance through mining the soil-landscape relationship.

Soil Spatial Distribution Mapping
The conventional soil map and environmental covariates data were brought into the final models to generate prediction maps with a resolution of 10 m, as shown in Figure 7.For the TS models, the prediction maps of four data-mining methods have similar soil distribution patterns, and all have high prediction accuracy.Among them, bagFDA has fewer fine-grained patches and maintains better patch integrity, which is related to LU being its main predictor variable.The main differences between the four TS models are in the two soil types, S5 and S14, which may be due to the lack of relevant variables leading to confusion in classification.
Compared to the original soil map (Figure 3a), the soil distribution predicted with the TS models has a clear geographical regularity and is consistent with the distribution of the actual field landscape.Due to the inclusion of environmental factors, such as LU, elevation, and RSP, which can reflect the detailed characteristics of soil in the microlandscape and micro-topography, the use of machine learning model can improve the prediction performance through mining the soil-landscape relationship.
With the assistance of high-resolution remote sensing images and DEM, the prediction accuracy of soil maps was greatly improved.This demonstrates that a large part of the soil-environment knowledge embedded in CSMs has long-term stability and scientific validity (e.g., [30,31]), although legacy soil maps may need to be updated (e.g., [32]).In our study, through combining data of typical samples from CSMs with modern terrain and remote sensing data, accurate and scientific soil mapping could be carried out.
In terms of different sample design methods, the soil prediction accuracy and distribution patterns of the TS models are significantly better than those of the RS models.Most of the regions in the RS prediction maps are S2 and S5, and two soil types were missing in the prediction results of the NNet model.Moreover, the soil distribution in the RS prediction maps does not correspond to the actual landscape.Since the mutual information values obtained regarding LU, PM and ST in typical samples were 0.69 and 0.16, respectively, while those of the random samples were only 0.04 and 0.08.The correlation between random samples and environmental covariates was low, which indicated that the random samples could not represent the correct and typical soil-environment relationship, resulting in very poor prediction performance.With the assistance of high-resolution remote sensing images and DEM, the prediction accuracy of soil maps was greatly improved.This demonstrates that a large part of the soil-environment knowledge embedded in CSMs has long-term stability and scientific validity (e.g., [30,31]), although legacy soil maps may need to be updated (e.g., [32]).In our study, through combining data of typical samples from CSMs with modern terrain and remote sensing data, accurate and scientific soil mapping could be carried out.
In terms of different sample design methods, the soil prediction accuracy and distribution patterns of the TS models are significantly better than those of the RS models.Most of the regions in the RS prediction maps are S2 and S5, and two soil types were missing in the prediction results of the NNet model.Moreover, the soil distribution in the RS prediction maps does not correspond to the actual landscape.Since the mutual information values obtained regarding LU, PM and ST in typical samples were 0.69 and 0.16, respectively,

Discussion
CSMs were determined based on typical profiles and landscapes.Typical samples reflect typical and representative soil-landscape relationships, that is, the expert knowledge implied in CSMs (e.g., [33,34]).Some samples with random uniform sampling will fall in the atypical area, and so they cannot represent the correct expert knowledge, leading to the misclassification of or missing soil types.Therefore, in order to predict soil types, it is necessary to extract typical samples based on soil-environment relationships.In addition, the soil-environmental factor histograms correspond to numerical variables, while categorical variables such as LU and PM often have a greater impact on soil types, especially anthropogenic factors (e.g., LU), which can fundamentally affect soil development.Extracting the logically corresponding regions according to ST-LU is another approach that can be used for obtaining relevant samples (typical samples) based on expert knowledge.Overall, extracting typical samples using ST-LU and the soil-environmental factor histogram can effectively improve the representativeness and typicality of samples and the accuracy of soil classification.
The regions with high prediction uncertainty were mainly located in transition zones, and this is because topographic factors were not enabled to discriminate soil types and due to soil-landscape relationships not being clearly captured.Therefore, adding landform types better reflecting the geomorphology and landforms to the predictor variables (e.g., [35,36]) might further improve the accuracy of classification (e.g., [37][38][39]).In addition, compared to numerical variables, which are based on image pixels, LU and landform types are based on patch units, which can effectively attenuate the pretzel phenomenon (finegrained patches "noise") of soil prediction maps and maintain the spatial integrity of soil patterns.Nevertheless, the accuracy of land use maps derived from remote sensing should be estimated via validation (e.g., [40][41][42]) as errors in LU classification may propagate to soil type predictions.
Regarding the environmental covariates of soil mapping, in addition to topographic and climatic factors, LU has an important impact on soil type, which is particularly important in agricultural areas (e.g., [43]).Furthermore, according to the degree of influence and the time length of land use and cover change (LUCC) time series (e.g., [44,45]), we can explore the soil formation process and temporal changes.In addition, categorical variables such as LU and landforms (e.g., [46]) can be regarded as the result of the combined effects of environmental factors, which are important for soil prediction.

Conclusions
In this study, the outdated pixels in the historical CSM were excluded according to the correspondence between soil type and LU.Then, typical regions and typical samples were extracted according to the peak interval of the soil-environmental factor histograms.Finally, the map of the soil types was updated using the environmental covariates of high-resolution remote sensing images.Compared with the historical CSM and the prediction maps based on random samples, the prediction maps based on typical samples have higher prediction accuracy.At the same time, resolution is greatly improved when in comparison to the early CSMs.Therefore, typical samples contain appropriate soil-environment relationship of soil types, and typical sampling is a very suitable strategy for soil type mapping, which can provide a reference for updating conventional soil type maps and for digital soil mapping.
LU and topography are important environmental covariates for soil prediction.By adding categorical variables, such as LU and landforms, which can be seen as the result of the comprehensive effect of environmental factors, we can improve the prediction accuracy of soil type and reduce the noise caused by continuous variables.In addition, long-term series LUCC should also be considered in soil mapping.Based on the changes of LUCC, we can explore the soil formation process over a long period of time.This would allow us to better understand the spatial and temporal changes of soil.

Figure 2 .
Figure 2. The flow path of typical samples extraction.

Figure 2 .
Figure 2. The flow path of typical samples extraction.

Figure 4 .
Figure 4. Permutation based on variable importance measures for the explanatory variables.Note: The box plot indicates the model error due to 10 variable importance measures, and the bar height indicates the mean of the 10 importance measures.

Figure 4 .
Figure 4. Permutation based on variable importance measures for the explanatory variables.Note: The box plot indicates the model error due to 10 variable importance measures, and the bar height indicates the mean of the 10 importance measures.

Figure 4 .
Figure 4. Permutation based on variable importance measures for the explanatory variables.Note: The box plot indicates the model error due to 10 variable importance measures, and the bar height indicates the mean of the 10 importance measures.

Figure 5 .
Figure 5.The variable importance of different soil types in the TS_RF model.

Figure 5 .
Figure 5.The variable importance of different soil types in the TS_RF model.

Figure 6 .
Figure 6.Analysis of the first two important variables based on typical samples.

Figure 6 .
Figure 6.Analysis of the first two important variables based on typical samples.

Figure 7 .
Figure 7.The spatial distribution maps of soil type predictions.

Figure 7 .
Figure 7.The spatial distribution maps of soil type predictions.

Table 1 .
Types of predictor variables and analysis of variance.
Note: ***, significant correlation at the 0.001 level; *, significant correlation at the 0.05 level.The analysis values of LU and PM are mutual information with soil type (ST).CNBL, channel network base level; PlC, plan curvature; RSP, relative slope position; TWI, topographic wetness index.DVI, difference vegetation index; RVI, ratio vegetation index; EVI, enhanced vegetation index; SRI, soil red index; SCI, soil color index.LU, land use; PM, parent material.

Table 2 .
The overall accuracy and Kappa index of different sampling methods and models.
Note: Models with the highest prediction accuracy are marked in bold, and models with the second highest accuracy are marked in bold italics.