Mapping at 30 m Resolution of Soil Attributes at Multiple Depths in Midwest Brazil

The Midwest region in Brazil has the largest and most recent agricultural frontier in the country where there is no currently detailed soil information to support the agricultural intensification. Producing large-extent digital soil maps demands a huge volume of data and high computing capacity. This paper proposed mapping surface and subsurface key soil attributes with 30 m-resolution in a large area of Midwest Brazil. These soil maps at multiple depth increments will provide adequate information to guide land use throughout the region. The study area comprises about 851,000 km2 in the Cerrado biome (savannah) in the Brazilian Midwest. We used soil data from 7908 sites of the Brazilian Soil Spectral Library and 231 of the Free Brazilian Repository for Open Soil Data. We selected nine key soil attributes for mapping and aggregated them into three depth intervals: 0–20, 20–60 and 60–100 cm. A total of 33 soil predictors were prepared using Google Earth Engine (GEE), such as climate and geologic features with 1 km-resolution, terrain and two new covariates with 30 m-resolution, based on satellite measurements of the topsoil reflectance and the seasonal variability in vegetation spectra. The scorpan model was adopted for mapping of soil variables using random forest regression (RF). We used the model-based optimization by tuning RF hyperparameters and calculated the scaled permutation importance of covariates in R software. Our results were promising, with a satisfactory model performance for physical and chemical attributes at all depth intervals. Elevation, climate and topsoil reflectance were the most important covariates in predicting sand, clay and silt. In general, for predicting soil chemical attributes, climatic variables, elevation and vegetation reflectance provided to be the most important of predictive components, while for organic matter it was a combination of climatic dynamics and reflectance bands from vegetation and topsoil. The multiple depth maps showed that soil attributes largely varied across the study area, from clayey to sandy, suggesting that less than 44% of the studied soils had good natural fertility. We concluded that key soil attributes from multiple depth increments can be mapped using Earth observations data and machine learning methods with good performance.


Introduction
Soil is a foundation for production systems, having capabilities for stabilizing the ecosystem [1].Reliable spatial soil information can improve natural capital assessment, becoming important for food production, in large countries or emerging economies where the major demographic growth is expected [2].Soil mapping is expensive and time-demanding, consequently performing adequate maps in large areas takes several years and require significant economic resources.Such fact is observed in countries like Brazil, which is covered by small scale soil maps, mostly developed by Brazilian government institutions using RADAMBRASIL (Radar on Amazon and Brazil) project data (1:1,000,000 or nominally 2 km) [3].In this case, such maps are not capable of supporting any decision making in regional or local scales.
The huge volume of quantitative (pedometric) data required in the production of soil attribute maps, for large areas, limits the feasibility of conventional (traditional) manual (expert-based) soil mapping [3,4].Several key soil factors are still not fully represented by classical environmental covariates, being necessary to develop new covariates that provide improved proxies for describing spatial soil variations.Advances in Earth observation (satellite images and products), digital elevation models and digital soil mapping (DSM) frameworks, based on machine learning and cloud-based computing, might be a solution to the lack of adequate soil data [5].An Earth observation product that has raised attention in DSM is the satellite image.Such data can retrieve medium-to high-resolution information and are easily acquired.Recently, studies have employed multi-temporal images in soil assessment and mapping.Such data provides measurements of topsoil reflectance, which are directly related to clay content, organic matter, mineralogy, moisture and soil color [6].The synergy between satellite images and DSM is described by Diek et al. [7], who performed a multi-temporal composite from the Airborne Prism Experiment (APEX).By overlapping images, the authors doubled the amount of bare surface pixels in the scene and presented an enhanced spatial representation of soil surface.Later, Diek et al. [8] developed a method for identifying the least-vegetated pixels (e.g., barest pixel) in a dense Landsat time series.Such data was used to estimate soil attributes and evaluate the contribution of remote sensing (RS) to conventional and digital soil-mapping procedures.Similarly, Rogge et al. [9] proposed the Soil Composite Mapping Processor (SCMaP), which is an approach able to use per-pixel compositing to address the issue of limited soil exposure.Another bare surface composition technique was proposed by Demattê et al. [10], called the Geospatial Soil System (GEOS3).These authors validated the method by comparing the bare surface data (described as SySI) to laboratory spectral measurements, and found a canonical correlation of 0.93.Later, Fongaro et al. [11] used such composite images to digitally map soils from southeastern Brazil.These authors described an expressive enhancement in clay content's digital mapping when employing SySI and terrain derivatives.The R 2 and root mean squared error (RMSE) improved from 0.64 and 93.44 g kg −1 to 0.83 and 65.36 g kg −1 , respectively.Finally, Mendes et al. [12] indicated that besides surface layer mapping, SySI can also aid in the prediction of soil subsurface attributes.
The prediction in DSM is usually based on machine learning techniques, which fit models for the spatial prediction of soil variables (i.e., maps of soil attributes and classes at different resolutions) [4].While machine learning supports the soil spatial predictions [13], cloud-based computing provides a superior architecture for the execution of such complex algorithms [14].These techniques are very attractive, once it result in the automation of processes, reducing overall soil data production costs, combining statistics, data science, soil science, physical geography, remote sensing (RS), geoinformation science and a number of other sciences [3,5,10,13].
A brief search in literature regarding the terms "soil" and "machine learning" resulted in more than 72,000 publications, from which 7200 items were published in the first half of 2019 and 4000 discuss random forest (RF) algorithms.The RF algorithm was first introduced by Breiman (2001) and became a standard nonparametric classification and regression tool.The method establishes prediction rules based on various types of predictor variables, without making any prior assumption on the form of their association with the response variable [15].The RF is one of most popular algorithms in DSM, being employed in several soil mapping studies [5,[16][17][18][19].Many inter-comparisons between machine learning algorithms are described in literature, and in most cases, authors indicated RF as the most adequate algorithm for digitally mapping soils.Keskin et al. [20] compared many models to quantify stochastic and/or deterministic components of soil carbon (C) pools.The prediction performance indicated the RF as the best algorithm.The covariables that best described variations in C pools were the biotic and hydro-pedological ones.Lithologic and climatic factors had a reasonable influence in C predictions, while topographic factors did not contribute to soil C modeling.Similarly, Nussbaum et al. [21] evaluated six approaches for digitally mapping 14 soil attributes at four depths.They found small differences in predictive performances, but RF was often the best among all methods.Hengl et al. [22] mapped 14 soil properties from African soils, combining quality-controlled point data and a large number of covariates.The random forest was the best method, outperforming linear regression with an average decrease of 15%-75% in RMSE across soil properties and depths.
In Brazil, there are no soil attribute maps with complete coverage across the Brazilian Midwest, which could support management and policy decisions.This region has the largest and most recent agricultural frontier in Brazil [23], which contributes about 34% and more than 10% to the agricultural production and gross domestic product of the country, respectively [24].Thus, we intend to produce up-to-date maps of surface and subsurface key soil attributes in a large extension of the Midwest of Brazil.These maps at multiple depth increments might provide adequate information to conduct account for the multi-functionality of soil in the region.Therefore, our aims are to (a) define composite images (described hereafter as SySI and SyVI), which describes the reflectance variability of bare surfaces and natural vegetation; (b) employ SySI and SyVI along with terrain derivatives, geologic and climate variables as covariates in the digital mapping of key soil attributes in the Midwest Brazil; (c) evaluate the performance of the random forest algorithm, implemented in a cloud-based computing, to describe the spatial variability of soils from the study site; (d) identify the model covariates that were most relevant to describe the soil variability in Midwest Brazil.We expect that Earth observation data and machine learning, coupled with Brazilian available legacy soil datasets, to promote a favorable framework to produce accurate soil predictions for this important agricultural region.We assume that it is possible to map physical and chemical soil attributes for three standard depth intervals (0-20, 20-60 and 80-100 cm) with 30 m-resolution across the Midwest region in Brazil.

Study Area and Soil Data
The study area comprises about 851,000 km 2 in the Cerrado biome (savanna) in the Brazilian Midwest (Figure 1), with extensive plateaus covered by Cerrado vegetation and gallery forest.The climate is tropical humid, which has two well-defined seasons, wet in summer and dry in winter, with annual precipitation ranging from 1200 to 1800 mm.According to the 1:1,000,000-scale map of pedology [25], the Ferralsols, Lixisols, Plinthosols, Arenosols and Regosols [26] are dominant soils of the region, which developed from highly diversified lithologies, consisting of volcanic, metamorphic, and sedimentary rocks, who reworked surface materials [27].
We obtained soil data from 7908 sites of the Brazilian Soil Spectral Library (BSSL) [28] and 231 of the Free Brazilian Repository for Open Soil Data (FEBR) [29].The BSSL started in 1995 as a collaborative network formed by several institutions all over Brazil.The FEBR contains legacy soil observations data collected by Brazilian government agencies since the 1960s.
We selected nine soil attributes for mapping in the study area (Figure 1), such as sand, silt and clay contents, organic matter (OM = organic carbon × 1.72), pH measured in water (pH H 2 O) and in potassium chloride (pH KCl), cation exchange capacity (CEC = Ca 2+ + Mg 2+ + K + + H + + Al 3+ ), and base saturation (V% = Ca 2+ + Mg 2+ + K + × 100 ÷ CEC) and aluminum saturation (m% = Al 3+ × 100 ÷ Ca 2+ + Mg 2+ + K + + Al 3+ ) at 0-20 cm, 20-60 and 60-100 cm depth intervals.These soil attributes are commonly used (as key criteria) to guide agricultural recommendations, to evaluate the locations most suitable for farming and delineation of soil management zones.According to [30], maximum rooting depth of crops by far can exceed 100 cm soil depths.Tus, soil attributes from 0 to 100 cm depth can affect plant growth and yield.When exploring the complete dataset, we checked for possible duplicated data and typos.To remove outliers from the dataset before modelling, we used more than one condition by nesting IF functions in Microsoft Excel.For example, to remove sand, silt and clay contents smaller or greater than 1000 g kg −1 [=IF(SUM(Sand;Silt;Clay)=1000);"OK";"REMOVE")], or testing IF the relationships (V% vs. pH vs. m%, OM vs. CEC) were coherent.A large proportion of the data had information based on the laboratory method of Embrapa [31], while the remaining were transformed to the same standard units.
[=IF(SUM(Sand;Silt;Clay)=1000);"OK";"REMOVE")], or testing IF the relationships (V% vs. pH vs. m%, OM vs. CEC) were coherent.A large proportion of the data had information based on the laboratory method of Embrapa [31], while the remaining were transformed to the same standard units.
Finally, we performed a chord diagram based on Pearson correlation to check weighted relationships between soil attributes using the circlize package version 0.4.8 [32] in the R software [33].In that diagram, each soil attribute is represented by a fragment on the outer part of the circular layout, where the size of the connections is proportional to the value of the correlation.
The framework used for digital mapping of soil attributes (Figure 2), was fully implemented via the cloud-based platform of Google Earth Engine (GEE) [14] and the R environment for statistical computing [33].
. Figure 1.Spatial distribution of soil observations displayed over a 1:1,000,000-scale map of the main soil classes of the study area [25].Soil classes were defined according to World Reference Base [26].Finally, we performed a chord diagram based on Pearson correlation to check weighted relationships between soil attributes using the circlize package version 0.4.8 [32] in the R software [33].In that diagram, each soil attribute is represented by a fragment on the outer part of the circular layout, where the size of the connections is proportional to the value of the correlation.
The framework used for digital mapping of soil attributes (Figure 2), was fully implemented via the cloud-based platform of Google Earth Engine (GEE) [14] and the R environment for statistical computing [33].

2.2.Preparing Soil Covariates
Soil covariate layers can be used as predictors ("independent variables") in the statistical modelling.Their preparation is time and resources consuming, involving a huge image processing to transform large environmental databases into relevant predictors for machine learning of soil attributes.Therefore, efforts to produce appropriate predictors to explain the spatial distribution of soil attributes (at detail and generalization) increases the accuracy of the models.Various covariates (e.g.climate, terrain attributes and RS data) representing soil state factors have been widely used in statistical models to predict soil texture, bulk density, organic carbon, nutrients (Ca, Mg, K, Na, N, P), available water capacity, pH and CEC [4, 17,22,[34][35][36][37].
For mapping the selected soil attributes, a group of covariates were obtained (Table 1) to act as proxies for describing soil (s), climate (c), vegetation (o), relief (r), parent material (p), age of surface (a) and spatial position (n) in the scorpan prediction model [3], using the GEE [14].A robust approach, for fitting single models using input data at multiple resolutions is to combine covariates (with different grid cell size) to a single, common resolution.However, no information gain is yield to the

Preparing Soil Covariates
Soil covariate layers can be used as predictors ("independent variables") in the statistical modelling.Their preparation is time and resources consuming, involving a huge image processing to transform large environmental databases into relevant predictors for machine learning of soil attributes.Therefore, efforts to produce appropriate predictors to explain the spatial distribution of soil attributes (at detail and generalization) increases the accuracy of the models.Various covariates (e.g., climate, terrain attributes and RS data) representing soil state factors have been widely used in statistical models to predict soil texture, bulk density, organic carbon, nutrients (Ca, Mg, K, Na, N, P), available water capacity, pH and CEC [4, 17,22,[34][35][36][37].
For mapping the selected soil attributes, a group of covariates were obtained (Table 1) to act as proxies for describing soil (s), climate (c), vegetation (o), relief (r), parent material (p), age of surface (a) and spatial position (n) in the scorpan prediction model [3], using the GEE [14].A robust approach, for fitting single models using input data at multiple resolutions is to combine covariates (with different grid cell size) to a single, common resolution.However, no information gain is yield to the downscaled covariate.We defined the target grid resolution to 30 m that was in accordance with the majority of covariates used.We consider the inverse distance weighting (IDW) interpolator the most appropriate in our case to downscale the 1 km covariates (climate and geology) to 30 m resolution.This is because IDW attenuates the influence of distant points by its use of inverse distance weight given an assumption of positive spatial autocorrelation [38], and also because it is easy to be implemented and available in GEE [14].Knowing that soils were formed in response to different forming processes operating over different ranges of distances or scales [3,39], the use of multi-resolution covariates may help the prediction models to capture the multi-scale soil spatial variations.

Climate Data
Annual temperature average, range and seasonality, and annual precipitation and seasonality values were obtained from the WorldClim dataset [40] at a spatial resolution of 1 km, and then were downscaled to 30 m pixel resolution by IDW.These data layers derived from numerous weather stations data interpolated by thin-plate smoothing spline, using latitude, longitude, and elevation as independent variables [40].The WorldClim is the highest resolution continuous climate database available for the study region.

Relief and Geology Data
We derived local terrain attributes, including elevation, slope, aspect, horizontal and vertical curvature and topographic position index (TPI) from the 30 m ALOS digital elevation model [41] within GEE.Slope, aspect and curvatures, were calculated from the partial derivatives of terrain using a 3 × 3 moving window [42].The TPI was calculated by subtracting the elevation in meters at a given location (or cell) to the mean elevation of all cells within a neighborhood specified by a radius of 3 km.Highly positive values are associated with peaks and ridges, while highly negative values are associated with valley bottoms and sinks.We obtained the density of geological lineaments by counting the meters of structural lines obtained from a 1:1,000,000-scale map [27] in 1 km grids, and then transformed to raster and downscaled to 30 m pixel resolution by the IDW method.

Landsat-Derived Data
Data.The Landsat program has been observing the Earth continuously from 1972 through the present day.We used Landsat surface reflectance data (Tier 1, Collection 1) of different sensors covering the study area from 1982 to 2019, including the Thematic Mapper (TM, Landsat 4-5), the Enhanced Thematic Mapper Plus (ETM+, Landsat 7), and the Operational Land Imager and Thermal Infrared Sensor (OLI/TIRS, Landsat 8) with 16 days revisiting time and 30 m resolutions [43,44].Considering these products are gridded into common characteristics (resolution, projection, spatial extent, scale values and spectral ranges), we performed an inter-sensor harmonization to combine their separated collections into a single dataset.The bands of each sensor, positioned in equivalent spectral regions, were matched into a common name (e.g., Blue, Green, Red, NIR, SWIR 1 , SWIR 2 and LST) using the specific band number (Table A1).The quality assessment bands were used to remove cloudy and cloud shadow pixels.We calculated the land surface temperature (LST, in degrees celsius scaled from 0 to 10,000) for each image in three steps: (1) we calculated the normalized difference vegetation index (NDVI, Equation (1)); (2) estimated the land surface emissivity (LSE, Equation (2)) using the NDVI-based method [45]; (3) and converted the brightness temperature (BT) data to LST using the Stefan-Boltzmann law expressed in Equation (3) [46].This approach enabled to obtain LST from the available Landsat data in GEE.

NDVI =
NIR − Red NIR + Red (1) SySI.We implemented the Geospatial Soil Sensing System (GEOS3) [10] into GEE to generate a 30 m synthetic soil image (SySI) using the harmonized Landsat data.The GEOS3 is a data mining algorithm that uses classifications rules to identify soil at pixel level on denser satellite time series.The rules are a set of spectral indices and thresholds that mask out non-soil pixels by flagging soil pixels as a valid value and the remaining pixels as unavailable information (NA).We used NDVI (Equation ( 1)), normalized burn ratio 2 (NBR2, Equation (4)) and soil spectral tendency (Equation ( 5)).The spectral indices thresholds were defined as −0.15 < NDVI < 0.25 and −0.15 < NBR2 < 0.15.Thus, selected soil pixels were aggregated into a single composite (SySI) by computing band-to-band the median of the reflectance values, over the time series.For our study, SySI represents the soils surface of agriculture areas and other natural surfaces with low vegetation cover and rock outcrops, when the vegetation was absent or almost absent, typical for savanas.The GEOS3 has also been implemented in different regions in Brazil for mapping soil variables [11,12,47].Similar approaches were developed to produce bare soil composites based on Landsat data and accurately employed for soil mapping and management in Germany [9] and the Swiss Plateau and Europe [8].
SyVI.To take advantage of the spatio-temporal variation of vegetation that might be linked to soil distribution, the GEOS3 [10] was adapted into GEE to produce a 30 m synthetic vegetation image in the wet (SyVI w ) and dry (SyVI d ) seasons by constraining the harmonized Landsat data.We constrained the wet and dry seasons from November to March and from May to September, respectively, between 1982 and 1994 when natural vegetation predominated over the landscape.In this work, SyVI represents potential natural vegetation (PNV), without or with minimal human intervention, that could be used as an indirect method for estimating soil variables [3].The PNV classification rules were constructed by combining the NDVI (Equation ( 1)), NBR2 (Equation (4)), the vegetation spectral shape index (VSI, Equation ( 6)) and soil index (SI, Equation ( 7)).The VSI and SI were elaborated by visual interpretation of the spectral shape of different types of vegetation collected from Landsat images using the MapBiomas dataset as a reference [23].To retrieve PNV reflectance in the study area, the thresholds were adjusted to NDVI ≥ 0.20, NBR2 ≥ 0.18, VSI < 11,000 and SI > 2. Therefore, selected PNV pixels were aggregated into a single composite (SyVI w , SyVI d ) by computing band-to-band the median of the reflectance values over the time series for both seasons.
Kriging.To achieve 100% coverage of the products, we predicted the gaps by the ordinary kriging interpolation method in GEE.Thus, the spectral values were randomly sampled from the composites (SySI, SyVI w and SyVI d ) using two observation per km 2 .For each band, we fitted the spherical model to the empirical semivariogram to obtain the parameters (range, sill, nugget and maximum distance) and make the spatial prediction of the values [48].Finally, we overlaid the composites on top of kriged images and merged them to obtain continuous images, which preserved the original values and incorporated the kriged where a gap occurred.The spatially continuous images (original + kriged), named SySI, SyVI w and SyVI d , were used as covariates for mapping soil attributes (Table 1).
Quality.We assessed the kriging results by sampling band-to-band 1 value km −2 on overlapped areas between the synthetic image (original) and kriged (non-merged to synthetic image) and calculating the Pearson's correlation for the seven spectral bands.We checked the quality of the continuous composites (original + kriged) by assessing the reflectance values on the spectral profile and the soil line method [49], and the spatial consistency with land cover classification [23].The soil line uses a scatterplot to display the reflectance between Red and NIR spectral regions.Both methodologies can be used to analyze the spectral patterns of the composites and to determine if they are consistent with the classical patterns of soils and vegetation.

Random Forest (RF) Regression
For DSM, we have implemented a state factor approach (scorpan) [3] to model the distribution of soil attributes (Table 2), taking all data on factors of soil formation (Table 1) into account at the same time and letting the decision tree algorithms reveal the patterns.Under that DSM framework, we select RF regression for soil predictions.RF is a tree-based machine learning algorithm which consists of many decision or regression trees where each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the data [50].The output of the model is an average of all the regression trees.The RF method is popular in DSM because it has proven to be efficient mapping soil attributes across a wide range of data scenarios and scales of soil variability [4, 16,17,21,22,35,51].

Calibration and Model Tuning
To generalize patterns and to minimize possible artifacts in the final maps, the covariates (Table 1) were smoothed computing the median values within a 4 × 4 moving window.At each soil observation (Table 2), the values were extracted and used as input data for calibrating RF regressions [50] using the ranger package version 0.11.1 [52] in the R software [33].Usually, most modeling studies employed default hyperparameters, which can prompt models to under or over fit and misinterpretations of results.Thus, to improve the performance of RF models [15], we performed a grid search for (optimal) tuning hyperparameters investigating a range of values, where mTry was 6, 24, 33, minimum node size was 5, 20, 50.The mTry regulates the number of variables that can be randomly sampled in each split of the trees.The minimum node size controls the tree depth by setting the minimal number of samples for the terminal nodes.We used 500 trees for stable a variable estimates [15].

Validation and Variable Importance
In order to evaluate the models' performance for the prediction of each soil attribute at each of the three depths a 10-fold cross-validation was used.The observations were split into 10-folds by using the caret package version 6.0-84 [53].According to Padarian et al. [13], the k-cross-validation is a more stable method, where the dataset is partitioned into k groups or folds, where k − 1 groups are used for training and 1 group for validation, repeating the training k times, each with a different validation group.For each predictive model, we derived the RMSE, coefficient of determination (R 2 ) and ratio of the performance to inter-quartile distance (RPIQ = (Q3 − Q1)/RMSE), where Q1 and Q3 are the 1st (25%) and 3rd (75%) quartiles.The RPIQ is based on prediction error and quartiles, which better represents the spread of the population and easier comparable across model validation studies.Generally, smaller values of RMSE and larger R 2 and RPIQ indicate better model performance [54].We selected the optimized model by the minimum RMSE of the 10-fold cross-validation [15,51].
To quantify the most influential covariates used in the models, the scaled permutation importance was calculated for each soil attribute prediction [50], which were graphically displayed using the folds estimates.

Prediction of Continuous Soil Attributes
The optimized models were used to predict the spatial distribution of soil variables (Table 2) using RF optimized hyperparameters in GEE.In this study, the error or inaccuracy were not spatially examined as maps, because the GEE did not supported bootstrapping technique and the GEE' RF in probability mode only works for binary (presence/absence) datasets [14].
For practical assessment and to validate the results to their possible parent materials, we grouped lithologies using the predicted soil attribute maps.Thus, we obtained a new outcome containing geological domains from a pedological viewpoint.Lithological data was obtained from a legacy geological 1:1,000,000-scale map [27] and used (each geometry) for sampling the mean value from 0 to 100 cm depth interval of predicted soil attributes.Afterward, we clustered the lithologies (geometries) into geological domains using the averaged soil value, where the main lithotypes were identified according to metadata (table data) of the geological map.

Summary and Relationships between Soil Attributes
The observations aggregated into the soil dataset (showed in Figure 1 and summarized in Table 2) covered the main peological classes that occurred.Overall, the mean clay content ranged from around 271 g kg −1 at the surface to 313 g kg −1 in the 60-100 cm depth interval.At the surface, clay content ranged from 10 to 920 g kg −1 , while at deeper layers, the maximum values were 930 and 950 g kg −1 (Table 2).There is relatively little silt in the studied soils, and the mean values does not vary much with depth.Silt content ranged from around 77 g kg −1 (0-20 cm) to 67 g kg −1 (60-100 cm).The silt data at the three depths was positively skewed.The average sand content ranged from 652 g kg −1 at the surface to 619 g kg −1 in the 60-100 cm depth.At all depth intervals, values ranged from no sand to 975 g kg −1 (Table 2).
The average OM content ranged from around 21 g kg −1 at the surface to 9 g kg −1 in the 60-100 cm depth.At all depth intervals, the mean values of pH H 2 O were greater than pH KCl, ranging from 5.6 (0-20 cm) to 5.3 (60-100 cm).The average pH KCl ranged from 4.9 at the surface to 4.8 in the 60-100 cm depth.At the surface, CEC ranged from 2 to 641 mmol c kg −1 , while at deepest, the maximum values were between 1 and 582 mmol c kg −1 (Table 2).At all depth intervals, V and m% values ranged from 0 to 100%, with averages varying inversely with depth (Table 2).
Correlation between soil attributes had similar patterns for each depth interval (Figure 3a-c).Sand and clay presented the highest negative correlation and a little lower for silt at all depth intervals.The pH H 2 O, pH KCl and V% positively correlated in the three depths, and negatively with m%.OM positively correlated with CEC, and V negatively with m% at each of the three depth intervals.Chemical attributes weakly correlated among each other at shallow depths became stronger with increasing depth intervals (Figure 3d), whereas the strongest slightly decreased at deeper depths.

Synthetic Soil Image (SySI) and Synthetic Vegetation Image (SyVI)
The harmonized Landsat data and the data mining algorithms implemented in GEE enabled to obtain a SySI with 443,000 km 2 (52%) coverage, which was later kriged to close the gaps.The bare soil frequency (data not presented) at each locations ranged from 1 to 1303 pixels and average of 12 pixels between the 1982 and 2019 years.For potential natural vegetation from 1982 to 1994, we obtained a SyVI w with 814,175 km 2 (95.2%) and a SyVI d with 847,426 km 2 (99.1%) coverage during the wet and dry seasons, respectively.The PNV frequency (at every locations) in the wet season ranged from 1 to 85 pixels, and mean values of 6 pixels.During the dry season, the PNV frequency had average values of 19 pixels, ranging from 1 to 185 pixels.The kriged images had satisfactory correlation (Pearson) with the originals, which presented average values of 0.66 (SySI), 0.78 (SyVI w ) and 0.81 (SyVI d ) for the seven spectral bands (Figure 4a-c).The full coverage SySI, SyVI w and SyVI d with the NA gaps interpolated by kriging (original + krikeg) were displayed in Figure 4a-c.The soil line for bare soil (Figure 4d) had an adjustment near to the 1:1 trend line, with highly correlated values (R 2 of 0.95), while raw (unprocessed) pixels extracted from a median composite (between 2017 and 2019) had a scatter distribution.For PNV, the soil line had clustered values with lower reflectance intensities (Figure 4e-f) compared to the raw pixels sampled from the median composite (between 2017 and 2019) over croplands mapped by MapBiomas [23].The mean NIR reflectance was higher for SyVI w (2,471) than for SyVI d (2,123), while the mean Red reflectance was higher for SyVI d (611) than for SyVI w (536).
The spectral signature for bare soil (Figure 4g) had a constant ascendant pattern from Blue to SWIR 1 regions, while the PNV (Figure 4h-i) had an opposite overall pattern, ascending from Blue to NIR and then descending to SWIR 2 .The SySI averaged a LST of 38.7 • C (Figure 4g), higher than for the SyVI w and SyVI d , with mean values of 22.58 and 23.04 • C (Figure 4h-i), respectively.
Remaining soil covariates (Table 1) were placed in the Appendix A as Figure A1.

Model Assessments
Table 3 shows the performance of optimized RF regression models on calibration ( cal ) and validation ( 10cv ) sets.Predicted vs observed scatterplots from 10-fold cross-validation derived from the models of sand, silt and clay were placed in the Appendix A as Figure A2, while the remaining soil attributes are displayed in Figure A3.We obtained decreasing RMSE and increasing RPIQ with increasing depth interval, both in calibration and validation data.The relatively low positive values of RMSE 10cv suggested that the soil variables were slightly overestimated for all the models.On average, RPIQ 10cv and R 2 10cv increased slightly from 0-20 to 60-100 cm depth, while decreased for silt and CEC.Sand and clay presented the best model' predictive capacity with the highest RPIQ 10cv (from 3.8 to 4.3), followed by m% > pH KCl > OM > V% > pH H 2 O > CEC > silt, ranging from 1.2 to 3.0 (Table 3).
Overall, the amount of variation explained by the spatial prediction models in validation were reasonable at all depths, with higher values for sand and clay (R 2 10cv from 0.81 to 0.85) followed by silt (R 2 10cv from 0.64 to 0.66).Chemical attributes were best explained for pH KCl (R 2 10cv from 0.19 to 0.64), m% (R 2 10cv from 0.26 to 0.56), OM (R 2 10cv from 0.30 to 0.53) and CEC (R 2 10cv from 0.40 to 0.48).The poorest performances were for V% (R 2 10cv from 0.18 to 0.36) and pH H 2 O (R 2 10cv from 0.21 to 0.35) (Table 3).We observed that R 2 10cv and RPIQ 10cv had a positive relationship in most models (Figure 5), where higher values indicate greater robustness in predictive capability.Models with poor performance exhibited a scatterplot (predicted vs. observed) with higher dispersion and weaker trend, while good models showed more distributed values following a stronger linear trend (Figures A2  and A3). .

Best Predictors
Figure 6 shows the permutation importance (%) of all the 33 covariates in RF models for the spatial prediction of 9 soil attributes at three depth intervals.From a general view (global values, Figure 6), the results indicated that the most important covariates were elevation, the five climate layers, SWIR2-NIR-Blue reflectance bands derived from SySI, ranging their estimates from 22% to 42%.The importance values did not vary much with depth, except for OM and CEC, which had slight differences.That is because the regional patterns from the coarser covariates could help the RF models in stratifying the region at the coarser level, while the more detailed information from the finer resolution covariates can represent the variability within the regional patterns.

Best Predictors
Figure 6 shows the permutation importance (%) of all the 33 covariates in RF models for the spatial prediction of 9 soil attributes at three depth intervals.From a general view (global values, Figure 6), the results indicated that the most important covariates were elevation, the five climate layers, SWIR 2 -NIR-Blue reflectance bands derived from SySI, ranging their estimates from 22% to 42%.The importance values did not vary much with depth, except for OM and CEC, which had slight differences.That is because the regional patterns from the coarser covariates could help the RF models in stratifying the region at the coarser level, while the more detailed information from the finer resolution covariates can represent the variability within the regional patterns.
Elevation, climate and soil reflectance derived from SYSI (NIR, SWIR 2 and LST) were the most important covariates in predicting sand, clay and silt (Figure 6).In general, for predicting soil chemical attributes, climatic variables, elevation and SYSI (NIR and SWIR 2 bands) seemed to be the most important, while for OM it was a combination of climatic dynamics and reflectance bands derived from SYSI, i.e., SWIR 2 , Blue and SWIR 1 .
Furthermore, the results indicated that PNV reflectance and temperature derived from SyVI w and SyVI d , geological lineaments density, topographic position index and slope were mid important (from 12% to 21%) for whole soil attributes at all depths, with slightly higher values for the chemicals such as OM, pH, CEC and V% (Figure 6).In all cases, the least important were aspect, horizontal and vertical curvatures, which had an average importance of less than 10%.Elevation, climate and soil reflectance derived from SYSI (NIR, SWIR2 and LST) were the most important covariates in predicting sand, clay and silt (Figure 6).In general, for predicting soil chemical attributes, climatic variables, elevation and SYSI (NIR and SWIR2 bands) seemed to be the most important, while for OM it was a combination of climatic dynamics and reflectance bands derived from SYSI, i.e.SWIR2, Blue and SWIR1.
Furthermore, the results indicated that PNV reflectance and temperature derived from SyVIw and SyVId, geological lineaments density, topographic position index and slope were mid important (from 12% to 21%) for whole soil attributes at all depths, with slightly higher values for the chemicals such as OM, pH, CEC and V% (Figure 6).In all cases, the least important were aspect, horizontal and vertical curvatures, which had an average importance of less than 10%.

Soil Maps at Multiple Depths
Figure 6a-c shows the maps of sand, silt and clay contents (g kg -1 ) in each of the three depth intervals.These maps were made publicly available for download as integer GeoTIFF format at 250 m-resolution [55].The soils of the study region were dominated by high to moderate amounts of sand, moderate clay and little silt.Sand and clay maps were inversely distributed in the region (Figure 7a, c), due to their negative correlation (Figure 3).The largest sand contents were located southeast of the study area, decreasing gradually to the north and severely to the east.The silt and

Soil Maps at Multiple Depths
Figure 6a-c shows the maps of sand, silt and clay contents (g kg −1 ) in each of the three depth intervals.These maps were made publicly available for download as integer GeoTIFF format at 250 m-resolution [55].The soils of the study region were dominated by high to moderate amounts of sand, moderate clay and little silt.Sand and clay maps were inversely distributed in the region (Figure 7a,c), due to their negative correlation (Figure 3).The largest sand contents were located southeast of the study area, decreasing gradually to the north and severely to the east.The silt and clay maps followed a very similar spatial distribution (Figure 7b,c), due to their positive correlation (Figure 3).There was more clay and silt in the east highlands of the study area, while a decreasing value was observed on the west lowlands.
In general, mean sand content decreased with depth from around 522 g kg −1 in the surface to 467 g kg −1 in the deepest layer, while at the same depths, mean clay content increased from 336 to 400 g kg −1 .Average silt content remained relatively uniform with increasing depth (Figure 7).
For each depth, a map representing the sum of the sand, silt and clay contents (Figure 7d) were used to show where the estimates of soil texture diverged from 1000 g kg −1 .On average, 87% of the predicted summed for the three depths ranged from 800 to 1200 g kg −1 .Under and overestimates in soil texture were visually more related to the spatial patterns of the silt map (Figure 7b).Maps of the soil chemical attributes (Table 2) for all depths were placed in the Appendix A, as Figure A4.
clay maps followed a very similar spatial distribution (Figure 7b, c), due to their positive correlation (Figure 3).There was more clay and silt in the east highlands of the study area, while a decreasing value was observed on the west lowlands.

Soil Data
The soil data (BESB and FEBR) aggregated into depth intervals yielded consistent information for mapping of the selected key soil attributes (Table 2).Data showed that soil attributes and their relationship among each other were depth-dependent for different soil depth increments.In general, depth gradients in soil attributes can be assigned to the influence of soil forming factor [3].The pedogenic processes caused different correlations among soil properties at different depths [56], thus providing differences in nutrient release for vegetation to which they may reply back with different root systems.Similar patterns were described by Goebes et al. [57], who distinguished between stable (e.g., sand, silt and clay) and dynamic (e.g., soil pH, nutrient contents, base saturation) soil attributes throughout the whole soil profile.
Despite soil observations being spatially dense and represented the main soil classes that occurs in the region (Figure 1), there are still some gaps in terms of spatial coverage.Natural vegetation and pasturelands are still under-represented.Indeed, there are many soil observations still unavailable in the region that could become open access and be used to increase spatial coverage and improve predictions [29].Considering our large study area, we performed relatively low-cost mapping because of the use of legacy soil observations and comparatively fewer new soil observations, free RS-based covariates and the open-source R software and Google Earth Engine cloud-based platform, for data processing and visualization.

Machine Learning
RF was fairly satisfactory for DSM across the large extent of the study area (~851,000 km 2 ), where the soil observations cannot be fully representative and the relationship between soil attributes and covariates is usually complex and non-linear [4,50].Therefore, the regression models used covariates that captured spatial patterns from broader to more local levels across different landscapes [5].Our validation results were similar to or even higher to those obtained in other DSMs using machine learning and cross-validation [16,17,19,21,22,34,58].Most DSM studies recently described in literature also found tree-based models as the best option for soil spatial predictions [16,21,22].Performance in these cases usually vary between 0.3 to 0.5 (R2), with clay content being the best predicted attribute.Variable importance indicated satellite images [16,59], elevation and climate data [16,17,60] as relevant covariates.
The good performance of the models showed that RF optimization was able to generate robust and accurate spatial predictions.This approach agreed with Probst et al. [15], who provided different optimization strategies and reported that tuning the RF hyperparameters improved the performance of regression models.Sand, silt and clay had the best performances because they are stable soil attributes, and the chemicals are dynamic along the soil profile [4].The pH and nutrient contents may change relatively quickly (within years) related to biological processes, vegetation cover and management practices [57].Gomes et al. [17] mapped soil organic carbon at five standard depths (from 0 to 100 cm) for Brazilian territory, where RF showed the best performance for all depths, with the highest performance at 30-60 cm for validation (R 2 = 0.33).Bui et al. [60] reported similar performances for topsoil (R 2 = 0.49) and subsoil (R 2 = 0.36) when used analogous covariates and data mining for mapping soil organic carbon in Australia.
The better model performance in lower layers are related to soil conditions at such depths.A possible factor impacting surface-subsurface predictions are the agricultural practices, where soil management could be increasing the system's complexity [12].While the chemical and physical weathering are more intense and active in surface, alterations in depth tend to be less intense [59].This suggests that the models for topsoil were more influenced by climatic variables, i.e., precipitation and temperature, which lowered the performances.Therefore, subsurface soils usually have conditions closer to the ones observed in pristine areas, and could have a better relationship with soil forming factors and covariates considered in our study.

Interpretation of Covariate Layers
We did not perform covariates selection (elimination) because this approach could generated additional load of interpretation to the project, and because RF can be used to fit models with large number of covariates [5].For instance, Nussbaum et al. [21] evaluated six approaches for DSM of several soil variables (totaling 48 responses) using from 300-500 environmental covariates where RF models had the highest overall performances.Miller et al. [61] demonstrated that the best performing model was produced when using multi-resolution covariates, compared to a single resolution, for modeling the distribution of soil attributes at surface and subsurface.Relevant covariates for soil prediction had large importance values, whereas covariates not associated with the soil attributes showed values close to zero (Figure 6).
Our results showed that direct measurement (where soil areas are exposed) of topsoil reflectance patterns by RS was a strong contributor to soil mapping.The topsoil reflectance from SySI (Figure 4a) was important for the spatial prediction of soil attributes at the rooting depth of crops [30] in Midwest Brazil (Figure 6).That was possible because the spectral patterns of SySI can provide valuable information on pedogenic processes, which are useful for understanding and predicting soil variation [56].The SySI also can indicate the soil weathering products, which cause spatial variations in the soil color and temporally stable soil attributes, such as mineralogy and texture [10,62].Thus, complementary RS data can improved prediction models, as reported by Loiseau et al. [16] where adding RS covariates increased the R 2 and decreased the bias of the clay content estimation on bare topsoil layers (e.g., 0-30 cm).
It is recognized within the soil science community that vegetation plays significant roles on soil formation [3].However, Savin et al. [63] stated that the use of vegetation patterns from RS for soil interpretation is insufficiently studied.Some previous works found that the spectral response of vegetation in natural conditions can be confidently used as an indirect indicator of soil attributes [64][65][66].Our results pointed out that PNV (from SyVI w and SyVI d ) was influential for modelling soil attributes at all depths, especially for chemical variables (Figure 6).The soil-vegetation connection can be due to the spatial and seasonal differences in reflectance intensities between wet and dry conditions (Figure 4b-c), that showed relations between the spectral patterns of natural vegetation and soil attributes from 0 to 100 cm depth (Figure 6).Since vegetation is temporally dynamic, the relationships are largely controlled by available soil moisture and, to a lesser extent, chemical soil properties such as pH and fertility [64].Thus, average seasonal spectral patterns of vegetation provide a better indication of soil variables than only a single snapshot of surface reflectance, and it is probably the best way to effectively represent the cumulative influence of living organisms on soil formation [4].
Climate, relief and geology played significant roles in model prediction (Figure 6) because they can significantly influence the soil-vegetation feedback, as described by [3].The climate and geologic heterogeneity of the study region affected soil patterns at the macroscale (regional), followed by relief (especially elevation), which moderates many of the macroclimatic regimes, and landforms, at the meso and microscales [39,42].Das Sumit [67] demonstrated that geological lineaments density was strongly related to drainage density, soil texture and soil depth, controlling the movement of groundwater through soil.
Landforms affect surface water dynamic and exposure to radiant solar energy, which directly influence soil-forming processes [42].Within a landform, there exists slight differences in local edaphic conditions, such as soil texture and mineralogy, and soil moisture and temperature regimes [39].These local conditions provides the most significant alterations of the soil reflectance patterns [10,62] and segregation of the plant communities [39], which could be captured and measured by the SySI, SyVI w and SyVI d at the finest (local) resolution.Generally, the Keys to Soil Taxonomy [68] uses the same differing criteria to define families of soils.
Individual relationships between soil variables and environmental covariates can also be interpreted and understood in terms of pedological knowledge.For instance, higher SWIR reflectance may be associated with high amounts of sand in soil and hence lower CEC; higher precipitation and cooler temperatures frequently increase the OM content due to the speed of accumulation is higher than the speed of decomposition.For a large number of soil attributes, however, relationships are not clearly linear and often many soil covariates are equally important [4].

Reliability and Interpretation of Soil Maps
Soil attributes largely varied across the area (Figures 7 and A4).This can be due to the tropical climate that exposed the parent materials of the studied soils (with different resistances) to intense weathering [69].We identified clayey and nutrient-poor soils throughout the southeast, covering 24% of the studied area (Figure 8).The region developed upon sedimentary rocks (mostly argillite, siltite, arenite) which formed smooth hills, and over ferruginous laterite crusts supporting residual lowered plateaus in continuous dissection process [70].The soils from these rocks (geological domain 1) had the highest clay contents with lowest chemical fertility and, in some cases, can be very acidic and contain ferruginous concretions (typically reddish color) that hinder the farming.Nevertheless, it is possible to observe several cropland areas distributed on soils of this domain [23], probably after undergoing soil chemical correction.Sandy soils were expressive in the region, comprising 32% of the studied area (Figure 8).The lowest occurrence was in the northeast with 6% of sandy soils developed from sedimentary rocks (domain 6).They widely occurred in the southwest and midwest, developed over metavolcanosedimentary rocks (14%) and sedimentary and acid-subacid volcanic rocks (12%), domains 7 and 8 respectively.Such geological domains were mainly formed by arenite, conglomerate, siltite, calcareous, metaconglomerate, quartzite, phyllite, orthogneiss, andesite and rhyolite, which naturally tend to generate flattened reliefs such as smooth hills and plateaus [70].These lithologies generally develop sandy soils with low chemical fertility (Figure 8).However, its high permeability and smooth reliefs facilitate agricultural mechanization after soil acidity correction and fertilization.Clayey and medium textured soils with the best chemical conditions covered 44% of the area, widely distributed along the central portion over domains 2 to 5 (Figure 8).These domains were represented by basic-ultrabasic volcanic rocks such as basalt, diabase and gabbro, and sedimentary rocks such as argillite, siltite and calcarenite [27].According to the geodiversity of the region [70], the areas also were constituted by an association of metamorphosed volcanic and sedimentary (metavolcanosedimentary) rocks frequently containing amphibolite, serpentine, dunites and peridotites, metacarbonates, phyllite and paragneiss.All these lithologies reworked nutrient-richer surface materials, which released nutrients into the soil and provided better fertility conditions (see domains 2 and 3 in Figure 8).Furthermore, granitoids occurred sparsely mixed with calcareous and schist (see domain 4 in Figure 8), providing more dissected relief than the neighboring lands, such as hills and low mountains that hinder the agricultural mechanization [27].In those areas, higher elevations with denser vegetation had larger soil OM contents (Figures A1 and A4), mainly due to cooler and wetter climate regimes and lesser human disturbance, which promoted accumulation processes [69].In the floodplain areas over domain 5 (Figure 8), fertility conditions may be linked to the good fertility of the areas that surround it, from where it receives a high volume of water, sediments and wastes [70].
Sandy soils were expressive in the region, comprising 32% of the studied area (Figure 8).The lowest occurrence was in the northeast with 6% of sandy soils developed from sedimentary rocks (domain 6).They widely occurred in the southwest and midwest, developed over metavolcanosedimentary rocks (14%) and sedimentary and acid-subacid volcanic rocks (12%), domains 7 and 8 respectively.Such geological domains were mainly formed by arenite, conglomerate, siltite, calcareous, metaconglomerate, quartzite, phyllite, orthogneiss, andesite and rhyolite, which naturally tend to generate flattened reliefs such as smooth hills and plateaus [70].These lithologies generally develop sandy soils with low chemical fertility (Figure 8).However, its high permeability and smooth reliefs facilitate agricultural mechanization after soil acidity correction and fertilization.

Possible Applications of the Maps
It is important to note that there are currently no detailed soil attribute maps with complete coverage over Midwest Brazil and that their production costs money.Our soil attribute maps can be used for different purposes, at different spatial scales from farm, local to regional.They provide a first complete assessment of key soil attributes across the Midwest region, and can be used to, for example, as input data in biological-chemical-physical modelling and in assessments of dynamic environmental processes.Together with other information, the maps can be used to obtain basic information for the implementation of colonization projects, rural subdivisions, integrated studies of micro-basins, local planning for the use and conservation of soils in areas projected for the development of agricultural, livestock and forestry projects, as well as civil engineering.The maps can also guide future soil sampling for inventory at different scales.

Conclusions and Final Considerations
We have demonstrated that key soil attributes from multiple depth increments can be mapped using Earth observation data and machine learning with good performances.These maps had a satisfactory performance for physical (0.64 > R 2 10cv > 0.85) and chemical (0.18 > R 2 10cv > 0.64) attributes at all depth intervals (0-20, 20-60 and 60-100 cm), being spatially consistent with the main lithologies from which they originated.
The methodological approach was able to capture the spatial distribution of nine soil variables.The predicted soil maps suggest that less than 44% of the studied soils had good natural fertility.Nevertheless, its dominant smooth reliefs facilitate agricultural mechanization, which allow the soil pH correction and fertilization.
Although we had representative soil observations, chemical attributes were particularly more challenging to map because to their high dynamic, with their concentration changing in a short space of time due to many natural and human-induced factors.
Our results support the notion that multi-resolution covariates derived from the topsoil and natural vegetation reflectance are important predictors of soil variability together with relief and climate data.
Since covariates widely used in digital soil mapping are globally available, such as elevation and climate data, this approach may be useful for other initiatives where obtaining the soil (SySI) and vegetation (SyVI) covariates is feasible, that is, locations around the world with bare soil and natural vegetation occurring with enough coverage within the considered satellite time series.

Figure 1 .
Figure 1.Spatial distribution of soil observations displayed over a 1:1,000,000-scale map of the main soil classes of the study area [25].Soil classes were defined according to World Reference Base [26].

Figure 2 .
Figure 2. Digital soil mapping framework used for generating soil attribute maps.

Figure 2 .
Figure 2. Digital soil mapping framework used for generating soil attribute maps.

30 Figure 3 .Figure 3 .
Figure 3. Chord diagram based on Pearson correlation (r) among all measured soil attributes at (a) 0-20 cm, (b) 20-60 cm and (c) 60-100 cm depth intervals; and (d) overall correlation with depth intervals.Blue and red colors symbolize positive and negative correlations, respectively.3.2.Synthetic Soil Image (SySI) and Synthetic Vegetation Image (SyVI)The harmonized Landsat data and the data mining algorithms implemented in GEE enabled to obtain a SySI with 443,000 km 2 (52%) coverage, which was later kriged to close the gaps.The bare soil frequency (data not presented) at each locations ranged from 1 to 1303 pixels and average of 12 pixels between the 1982 and 2019 years.For potential natural vegetation from 1982 to 1994, we obtained a Remote Sens. 2019, 11, x FOR PEER REVIEW 12 of 30

Figure 4 .
Figure 4. Soil covariates obtained by data mining and statistics of Landsat data.a) SySI (RGB: Red, Green, Blue), b) SyVIw and c) SyVId (RGB: SWIR1, NIR, Red) subsets.Soil line charts for d) SySI vs raw pixels, e) SyVIw vs wet season crops obtained from raw pixels and f) SyVId vs dry season crops obtained from raw pixels.Minimum, average and maximum spectra collected from g) SySI, h) SyVIw and i) SyVId.The visualization of the images was adjusted by stretching the range of pixel values between 2% and 98%.Optical bands are positioned in the mean spectral range from 485 to 2215 nm, and the thermal band at 11,450 nm.̅ : average of Pearson's correlation from the seven spectral bands, performed by sampling 1 value km -2 on overlapped areas between the synthetic image (original) and kriged (non-merged to synthetic image).

Figure 4 .
Figure 4. Soil covariates obtained by data mining and statistics of Landsat data.(a) SySI (RGB: Red, Green, Blue), (b) SyVI w and (c) SyVI d (RGB: SWIR 1 , NIR, Red) subsets.Soil line charts for (d) SySI vs raw pixels, (e) SyVI w vs wet season crops obtained from raw pixels and (f) SyVI d vs dry season crops obtained from raw pixels.Minimum, average and maximum spectra collected from (g) SySI, (h) SyVI w and (i) SyVI d .The visualization of the images was adjusted by stretching the range of pixel values between 2% and 98%.Optical bands are positioned in the mean spectral range from 485 to 2215 nm, and the thermal band at 11,450 nm.r: average of Pearson's correlation from the seven spectral bands, performed by sampling 1 value km −2 on overlapped areas between the synthetic image (original) and kriged (non-merged to synthetic image).
Remote Sens. 2019, 11, x FOR PEER REVIEW 14 of 30 RMSEcal: Root Mean Square Error of calibration; RMSE10cv: Root Mean Square Error of 10-fold crossvalidation; RPIQcal: Ratio of the Performance to Inter-Quartile distance of calibration; RPIQ10cv: Ratio of the Performance to Inter-Quartile distance of 10-fold cross-validation; R 2 cal: Coefficient of determination of calibration; R 2 10cv: Coefficient of determination of 10-fold cross-validation.

Figure 5 .
Figure 5. Performance indicators of optimized models used in the soil predictions by depth interval.

Figure 5 .
Figure 5. Performance indicators of optimized models used in the soil predictions by depth interval.

Figure 6 .
Figure 6.Permutation importance (%) of covariates for prediction soil attributes at 0-20 cm (A), 20-60 cm (B) and 60-100 cm (B) depth intervals.The mean values were calculated from the importance obtained by the 10-fold used in cross-validation.CEC: cation exchange capacity; m%: Aluminum saturation; V%: Base saturation.Global represents averaged importance values for all soil attributes.

Figure 6 .
Figure 6.Permutation importance (%) of covariates for prediction soil attributes at 0-20 cm (A), 20-60 cm (B) and 60-100 cm (B) depth intervals.The mean values were calculated from the importance obtained by the 10-fold used in cross-validation.CEC: cation exchange capacity; m%: Aluminum saturation; V%: Base saturation.Global represents averaged importance values for all soil attributes.

Figure 7 .
Figure 7. Maps and mean lat/long distribution chart of (a) sand, (b) silt and (c) clay content (g kg -1 ) for different depth intervals (0-20 cm, 20-60 cm and 60-100 cm).The sum of the sand, silt and clay contents [Sand+Silt+Clay] for each depth interval appears in (d).The visualization of the images was adjusted by stretching the range of pixel values between 2% and 98%.

Figure 7 .
Figure 7. Maps and mean lat/long distribution chart of (a) sand, (b) silt and (c) clay content (g kg −1 ) for different depth intervals (0-20 cm, 20-60 cm and 60-100 cm).The sum of the sand, silt and clay contents [Sand+Silt+Clay] for each depth interval appears in (d).The visualization of the images was adjusted by stretching the range of pixel values between 2% and 98%.

30 Figure 8 .
Figure 8. Geological domains and summary values of predicted soil attributes averaged from the three depth intervals.The geological domains were obtained by clustering lithologies using the averaged soil data.The main lithotypes were identified within each domain according the geological map [27].

Figure 8 .
Figure 8. Geological domains and summary values of predicted soil attributes averaged from the three depth intervals.The geological domains were obtained by clustering lithologies using the averaged soil data.The main lithotypes were identified within each domain according the geological map[27].

Figure A1 .
Figure A1.Environmental covariates used in the Random Forest modelling of soil attributes data.Terrain features derived from ALOS digital elevation model: (a) Elevation in meters, (b) Slope in degrees, (c) Aspect in degree, (d) Topographic Position Index, (e) Horizontal Curvature and (f) Vertical Curvature in meters.(g) Geological Lineaments Density representing meters of structural features per km 2 , derived from legacy maps of the Geological Survey of Brazil (CPRM).Climate data obtained from WorldClim: (h) Annual Precipitation in mm, (i) Coefficient of variation of the Precipitation Seasonality, (j) Annual Mean Temperature, (k) Temperature Annual Range and (l) Temperature Seasonality in ºC.

Figure A1 .
Figure A1.Environmental covariates used in the Random Forest modelling of soil attributes data.Terrain features derived from ALOS digital elevation model: (a) Elevation in meters, (b) Slope in degrees, (c) Aspect in degree, (d) Topographic Position Index, (e) Horizontal Curvature and (f) Vertical Curvature in meters.(g) Geological Lineaments Density representing meters of structural features per km 2 , derived from legacy maps of the Geological Survey of Brazil (CPRM).Climate data obtained from WorldClim: (h) Annual Precipitation in mm, (i) Coefficient of variation of the Precipitation Seasonality, (j) Annual Mean Temperature, (k) Temperature Annual Range and (l) Temperature Seasonality in • C.

Figure A2 .
Figure A2.Predicted vs. observed (a) sand, (b) silt and (c) clay contents by depth intervals of 10-fold cross-validation derived from optimized random forest regression.

Figure A2 . 30 Figure
Figure A2.Predicted vs. observed (a) sand, (b) silt and (c) clay contents by depth intervals of 10-fold cross-validation derived from optimized random forest regression.Remote Sens. 2019, 11, x FOR PEER REVIEW 25 of 30

Figure A4 .
Figure A4.Maps of (a) organic matter, (b) pH H 2 O, (c) pH KCl, (d) cation exchange capacity, (e) base saturation and (f) aluminum saturation predicted at three depth intervals (0-20 cm, 20-60 cm and 60-100 cm).The visualization of the images was adjusted by stretching the range of pixel values between 2% and 98%.

Table 1 .
List of Covariates used for Digital Mapping of Soil Attributes.
VNIR: Visible and Near infrared spectral range (~450-900 nm); SWIR: Shortwave infrared spectral range (~1550-2350 nm); TIR: Thermal infrared spectral range (~10,400-12,500 nm). a Lines counted in grids of 1 km 2 , transformed to raster and interpolated to 30 m pixel resolution by IDW method.b Interpolated to 30 m pixel resolution by Inverse Distance Weighted method; CV: coefficient of variation.

Table 3 .
Hyperparameters and Performance of the Optimized Models used for Spatial Predictions of Continuous Soil Attributes at Distinct Depth Intervals.