Accurate and Precise Prediction of Soil Properties from a Large Mid-Infrared Spectral Library

Diffuse reflectance spectroscopy (DRS) is emerging as a rapid and cost-effective alternative to routine laboratory analysis for many soil properties. However, it has primarily been applied in project-specific contexts. Here, we provide an assessment of DRS spectroscopy at the scale of the continental United States by utilizing the large (n > 50,000) USDA National Soil Survey Center mid-infrared spectral library and associated soil characterization database. We tested and optimized several advanced statistical approaches for providing routine predictions of numerous soil properties relevant to studying carbon cycling. On independent validation sets, the machine learning algorithms Cubist and memory-based learner (MBL) both outperformed random forest (RF) and partial least squares regressions (PLSR) and produced excellent overall models with a mean R2 of 0.92 (mean ratio of performance to deviation = 6.5) across all 10 soil properties. We found that the use of root-mean-square error (RMSE) was misleading for understanding the actual uncertainty about any particular prediction; therefore, we developed routines to assess the prediction uncertainty for all models except Cubist. The MBL models produced much more precise predictions compared with global PLSR and RF. Finally, we present several techniques that can be used to flag predictions of new samples that may not be reliable because their spectra fall outside of the calibration set.


Introduction
Soil is an essential part of the natural environment that not only influences the distribution of plants, animals and landforms but also plays key roles in providing ecosystem services necessary for mankind, including climate regulation, soil fertility and fiber and food production [1,2].Anthropogenic activities have greatly altered the composition and functioning of soils [3,4].Quantifying the impact of anthropogenic activities on soil carbon sequestration and loss requires at least sporadic monitoring of soil physical, chemical and biological properties that are most relevant to controlling soil carbon cycling rates.However, current technologies for monitoring and characterizing most soil properties are expensive and often time-consuming.For example, the total cost of standard soil characterization procedures at the US National Soil Survey Center is about $2500 per pedon with processing times of 6-12 months [5].As a result, there is an increasing need to develop rapid and cost-effective techniques to characterize soil resources.
Diffuse reflectance spectroscopy (DRS) has been demonstrated to be a viable alternative for rapidly characterizing and measuring soil properties compared with time-consuming and expensive conventional soil laboratory analysis [6].Visible (vis; 400-700 nm), near-infrared (NIR; 700-2500 nm), and mid-infrared (MIR; 2500-25,000 nm) regions have been used widely to characterize soil minerals and organic matter at the global [5,7,8], regional [9,10], national [11,12] and local scale [13].Vis-NIR can be preferable to MIR due to its low instrumental cost and potential for field deployment [14].Therefore, for soil surveys that require high sampling density with basic soil properties measurement (for, e.g., organic carbon, soil texture), vis-NIR may be preferable.In contrast, the MIR region contains strong molecular vibrations of most important soil minerals and organic components [7,15].As a result, models built using MIR databases often perform better compared with the vis-NIR database for many soil properties, particularly for more minor soil constituents [6,16].
Accurate quantification of soil properties requires building predictive models by selecting the most diverse calibration set and applying the model to a new set of samples (validation set) that were not used during the calibration process [17].The most common calibration methods for building these models are based on partial least squares regression (PLSR) [12,18,19].PLSR is particularly useful for building models that contain a large number of predictor variables by taking into account the correlation between spectra and soil attributes [20].The spectra are decomposed into a set of eigenvectors and scores, followed by regression analysis of soil attributes.Since PLSR performs decomposition and regression simultaneously, it is very useful in soil spectroscopy because it successfully deals with collinearity and is computationally faster when dealing with large predictor variables [21].
Some difficulties appear when calibration models derived from large and spatially diverse spectral libraries are used to predict soil properties of a small area [22].PLSR models built using diverse spectral libraries can introduce extraneous information when predicting soil properties from a small area, and this subsequently results in large prediction errors [23].The extraneous information is due to the intrinsic nonlinearity of very heterogeneous spectral libraries [24].As a result, linear PLSR models are not able to effectively deal with complex and heterogeneous spectral libraries [25,26].To address this shortcoming, different approaches focusing on enhancement of the predictive power of local samples have been developed [27][28][29].These include adding new samples relevant to the local site, selecting a subset of spectrally similar samples to make predictions and using a distance-weighted matrix to impart more influence to the spectra closer to the prediction samples [30][31][32].Locally based approaches (e.g., Ramirez-Lopez et al. [24]) have been developed to specifically deal with this issue by only selecting a small number of the most spectrally similar samples to build a PLSR calibration model.
Recently, machine learning (ML) algorithms, such as artificial neural networks (ANN), support vector machines (SVM), Cubist, random forest (RF) and memory-based learning (MBL), have been increasingly used to model and map soil properties at local to global scales [19,25,30,33,34].Because ML methods are able to deal with complex nonlinear relationships between the predictor and response variables [35], their application to soil spectroscopy often results in improved calibration and validation results for most soil properties compared with PLSR [36].Among different ML methods used to make predictions, SVM, ANN, Cubist and MBL have been found to perform better than PLSR for most of the soil properties [16,37,38].SVMs are kernel-based algorithms that follow data transformation into a high-dimensional space to construct a hyperplane by maximizing the distance to the nearest data point of any of the input classes [39].ANNs use a supervised learning approach to fit the relationship between predictor and response variables by adjusting the weights through an optimization process until the error between the observed and predicted value is minimized [40].The Cubist model is a classification and regression tree approach, where the prediction is based on intermediate linear models that are subsequently formed at each tree node [38,41].MBL is a data-driven approach, conceptually similar to Cubist, where linear (e.g., PLSR) models are built to predict each new sample on the basis of a set of locally stable samples [42].Although SVM, ANN, Cubist and MBL perform better compared with other machine learning (e.g., random forest) and linear multivariate methods (PLSR, multivariate adaptive regression splines: MARS), inconsistent results in terms of model performance have been reported for some soil properties.For example, Wijewardane et al. [16] used an MIR spectral library to show superior model performance using ANN compared with PLSR for all soil properties (9 out of 12) except clay, silt and sand.In contrast, PLSR models were superior compared with ANN when predicting soil pH at soil moisture levels below 20% using a vis-NIR spectral library [43].Similarly, a comparison between PLSR and Cubist using MIR data showed that Cubist outperformed the former in the prediction of total carbon and clay, but the prediction of cation exchange capacity was better using a PLSR model [38].Inconsistencies in the prediction performance of machine learning and multivariate linear regression for some soil properties can be attributed to (1) the size of the spectral library used to the build calibration model; (2) the lack of standardized methodology applied during sample preparation, spectrum acquisition and pretreatment; and (3) the representativeness of the samples used to build calibration models across heterogeneous soil conditions.
Although it has been suggested that, for calibration, MIR spectra outperform vis-NIR spectra for many soil properties, the vis-NIR research community has invested more resources in developing techniques for dealing with large and complex spectral libraries [36,42].Relatively few studies have used large MIR spectral libraries to predict soil and mineral properties, and fewer have tested different statistical approaches.Madari et al. [44] used the MIR spectra of 1135 soil samples collected from diverse locations in Brazil to predict total carbon, while Terhoeven-Urselmans et al. [8] used 971 soil samples on the basis of globally distributed soil profiles to build calibration models for different soil properties.Baldock et al. [11] acquired MIR spectra of 20,495 soil samples collected from 4526 locations across Australia and found that PLSR models developed for different regions outperformed a single national PLSR model for inorganic, organic and total carbon.Recently, Wijewardane et al. [16] used an MIR spectral library containing 20,000 soil samples in the US and found that machine learning approaches were necessary to build successful models for most soil properties.
Existing studies using soil spectroscopy have focused on developing accurate models by assessing model performance with independent samples using validation statistics, such as R-square, root-mean-square error (RMSE), bias and ratio of performance to deviation (RPD) [11,36,45].While these validation statistics are useful in determining the overall model performance, uncertainty estimates of each new sample in the validation sets are often not provided [11,16,28].Reporting uncertainty of individual samples in the validation sets is useful for determining the precision and trustworthiness of individual predictions and considered a good analytical practice [46].It is also important for many applications relevant to understanding soil carbon cycling from the carbon accounting perspective [47].Common approaches used to estimate uncertainty using PLSR include U-deviation, as implemented in the Unscrambler software, and jackknifing, which is based on ordinary least squares regression [48,49].U-deviation is an empirically derived formula that considers the variance in validation sets and the error in the product of scores and loadings to provide uncertainty estimates for each new prediction [50].Jackknife is a resampling technique aimed at retrieving a set of regression coefficient vectors that provide information about the variability and the standard error of regression coefficients [51,52].Although uncertainty estimation of multivariate regression models has been implemented in several commercial software (e.g., Unscrambler, MATLAB), relatively few studies have reported the uncertainty estimates of each new prediction.Likewise, uncertainty estimates using machine learning (e.g., random forest) are mostly accomplished by using quantile regression, where a full conditional distribution of response variables is generated to retrieve the 95% confidence interval using the range between the 2.5 and 97.5 percentiles of the distribution [53].
In this study, we utilized the USDA National Soil Survey Center's Kellogg Soil Survey Laboratory (NSSC KSSL) MIR spectral library and associated soil characterization database, which now includes >50,000 MIR spectra collected on soils from the United States.We present our evaluation of the capability of this large spectral library to make routine predictions of several soil properties that are critical to understanding soil carbon cycling.We hypothesized that machine learning approaches, because of their ability to find patterns in complex data, outperform traditional PLSR models.Therefore, three machine learning approaches (MBL, Cubist and RF) and a global PLSR model were tested for their ability to produce accurate and precise (i.e., low uncertainty) predictions.

The NSSC-KSSL Spectral Library
The soil samples used in this study are extracted from the existing soil spectral library at the USDA NSSC-KSSL.While the current NSSC-KSSL soil characterization database contains >110,000 soil samples, the current (as of June 2018) MIR spectral library consists 15,118 pedons, representing 54,211 samples that have some associated analytical data (Figure 1).The 54,211 samples represent the wide range of land use and geologic conditions, allowing us to test several statistical approaches relevant to developing robust calibration models.All data are publicly available via request from the USDA National Soil Survey Center.In this study, we focused on 10 soil physical and chemical properties, including organic carbon (OC, %), which was measured as total carbon by elemental analysis minus any inorganic carbon measured manometrically; calcium carbonate (CO 3 , %) measured by manometer; cation exchange capacity (CEC, cmolc kg −1 ) and exchangeable calcium (Ca, cmolc kg −1 ) measured by displacement with ammonium oxalate, buffered at pH 7; clay (%) by sedimentation, pH in 1:1 water suspension; bulk density (BD, g/cm −3 ); dithionite citrate extractable aluminum (Al, %); acid oxalate extractable iron (Fe, %); and organic carbon density (OCD, kg m −3 ) calculated as the product of OC × BD × (1 − CF), where CF is coarse fragments (>2 mm) by volume.Detailed lab protocols can be found elsewhere [54].
The existing soil spectral library at the USDA NSSC-KSSL consists of BD data obtained with two different measurement techniques [55]: (1) clod and (2) core methods.The clod method involves extraction of an intact soil clod followed by the determination of the dry weight of the soil.The volume of the clod is measured after equilibration at 1/3 bar water tension.In contrast, the core method involves extracting soil samples using a known volume of a metal cylinder.The extracted samples are then oven-dried, sieved and weighed to determine the BD of the fine earth fraction.The clod has been the standard soil survey method but is only successful in soils with a fair degree of coherence; therefore, the method is biased away from sandy or organic-rich topsoils.In this study, we evaluated the models using clod, core and combined (clod and core together) methods.
The MIR spectra were acquired on air-dried and ground soil samples (to pass an 80-mesh sieve) using a Bruker Vertex 70 FTIR spectrometer with an HTS-XT high-throughput accessory.The sampling plate in the spectrometer consists of standardized 96-well microplates.On each 96-well microplate, 23 samples were prepared in quadruplicate, while the four remaining blank spots were used for reference readings.At each spot, 32 co-added scans were collected at a resolution of 4 cm −1 .Both the sample MIR and reference scans are available in OPUS format, which was converted to text file for developing calibration models using an open-source programming platform [56].

Pre-Processing of MIR Spectra and Analytical Data
The four replicates of spectra corresponding to each sample were averaged to create a dataset containing both laboratory measurements of soil properties and MIR spectra.Spectra were truncated to 6000-600 cm −1 , and regions showing atmospheric CO 2 features (2389-2268 cm −1 ), which are not associated with changes in soil samples [57], were removed.Spectra were then baseline corrected using the baseline-offset transformation before developing the calibration models.Since multivariate regression methods intrinsically require the data to be normally distributed, we assessed different transformation approaches prior to building multivariate PLSR models.We first chose two analytical data with normal (pH) and non-normal (OC) distributions.The normality of these two analytical data was explored by building histograms and calculating skewness and kurtosis (Figure S1 in Supplementary Materials).We then built PLSR models using box-cox-, square-root-and log-transformed and untransformed analytical data by dividing the data into calibration (80%) and validation (20%) sets to assess the prediction performance of near-normally distributed pH and non-normally distributed OC.On the basis of this analysis, we found that PLSR models built with square-root-transformed analytical data were, on average, superior (low RMSE and high R 2 ) compared with box-cox-transformed and untransformed data, particularly when PLSR calibration models were applied to independent validation sets (Table S1; Figure S2).However, for other calibration methods (random forest, Cubist and MBL), none of these data transformations improved the prediction performance of pH and OC (Table S1).Although machine learning methods, such as random forest, Cubist and MBL, do not require the data to be normally distributed, for consistency, analytical data were square-root-transformed prior to developing calibration models using all methods (Figures S3 and S4).

Sample Selection, Outlier Detection and Model Performance Assessment
Large datasets invariably have erroneous data.Those errors can occur in both analytical and spectral data.Therefore, to detect outliers in the spectral library, we followed a two-step procedure.In the first step, we built PLSR models and tested their performance using all samples in the spectral library.Then, we defined a criterion to detect outliers based on the predictive performance of the model.In this study, outliers are defined as those sample points that fall outside a defined standard deviation threshold from the fitted (one-to-one) line.The standard deviation threshold was optimized such that it removed a maximum of 1% of the samples that fell beyond the threshold value in both the calibration and validation datasets (Table 1; Figure S5).This method of detecting outliers is predicated upon being able to build reasonably reliable PLSR models but is superior to any method that focuses only on the spectral data.
In the case of CO 3 , we first built a random forest classification model to detect the presence/absence of CO 3 according to results from a fizz test using 1 N HCl.Random forest performed better (out-of-bag error rate of 12%) in classifying the presence/absence of CO 3 (N = 12,205 for presence and N = 6387 for absence) than PLSR model.For further analysis, we only used samples that had detectable CO 3 according to the random forest output.BD clod and BD core measure weight per unit volume of the <2 mm fraction, with volume measured after equilibration at 1/3 bar water tension in the case of the clod method and at field moisture in the case of the core method.BD all combines information from both the clod and core method, providing a much wider range of bulk density estimates.In the case of carbonates (CO 3 ), we first built a random forest classification model to detect the presence/absence of CO 3 (N = 18,592) according to the results of a fizz test.Then, we removed outliers and built multivariate and machine learning models on samples that had detectable CO 3 (N = 12,205).Abbreviations-BD: bulk density; CEC: cation exchange capacity; OCD: organic carbon density.
Following outlier detection, we divided the MIR spectral library into calibration and validation datasets using a Kennard Stone (KS) algorithm [58].The KS algorithm is a deterministic approach that uses Euclidean or Mahalanobis distance to select a set of samples uniformly distributed in principal component (PC) space.Using the prospectr package in R, we selected 80% of the most representative samples in the library to calibrate the model.The remaining 20% of the samples were assigned to validation datasets to assess the predictive performance of the MIR models.The performance of the MIR models was assessed using bias, R 2 , RMSE and RPD on independent validation sets (detailed in Section 2.5).

Partial Least Squares Regression (PLSR)
Partial least squares regression (PLSR) is a multivariate regression technique widely used in chemometrics to study the relationship between highly collinear multi-dimensional predictor variables and a response variable.The PLSR algorithm follows a linear multivariate model that selects orthogonal (latent) factors to maximize the covariance between predictor (e.g., MIR data) and response (e.g., soil properties) variables [20,59].As we deal with 2737 predictor variables for every response variable, the PLSR procedure helps to reduce the number of predictor (spectra) variables to a few independent variables that explain the most variation in the MIR spectra.In this study, global PLSR models were developed using the pls package [59], which decomposes the predictor (X) and the response (y) variables into scores (T) and loadings (P and q) as follows: where E and f are the residuals associated with the predictor and response variables, respectively.The final regression model used to predict soil properties is of the following form: where β 0 is the intercept of the global PLSR model; β 1 , β 2 ... β n are the regression slopes for n latent variables (principal components); and t 1i , t 2i ... t ni are the scores from principal component 1 to n for response variable i.
In global PLSR, the optimal number of principal components (PC) to retain in the final model is determined using a one-sigma heuristic approach [60].In this approach, the number of PCs corresponding to the initial best model (lowest RMSE) is determined using 10-fold cross-validation by examining models with up to 20 PCs.Then, the model with the fewest components that still falls within one standard error of the residuals (observed-predicted) of the overall best model is retained as the optimal number of PCs.This was accomplished using the selectNcomp function in pls package [59].

Memory-Based Learning (MBL)
Memory-based learning (i.e., local modeling) is a lazy learning approach in which, for each new target spectrum requiring a prediction of a given response variable, a new target function is fitted using a relatively small subset of spectrally similar samples found in a large reference set [24].The spectrally similar samples (neighbors) can be found by using different similarity search methods, such as the Pearson's correlation coefficient between spectra or Mahalanobis distances in the principal component (PC) space (or PC distances).Common methods to fit the local target functions include PLSR, Gaussian process [24] and weighted-average PLSR [61].In this study, we used PLSR models to fit the local target functions, while neighbor searches were performed using Mahalanobis distances in the PC space.The number of PCs to retain was selected using an optimized principal component approach (oPC-M) [24].MBL was conducted using the resemble package in R. Modifications to the MBL were made to output spectrally similar neighbors based on Euclidean distance in the PC space and provide uncertainty estimates of each new prediction.

Random Forest (RF)
Random forest is an ensemble machine learning technique that was developed as an extension of classification and regression trees (CART) to improve the prediction performance of the model [62].The model-building process is the same as that for CART, where a recursive partitioning of the dataset is done to explore the relationship between the response and predictor variables [63].Unlike CART, numerous trees are generated by using a subset of predictor variables, and all the responses are aggregated to get one single prediction.While generating each tree, a bootstrap sample (samples with a replacement) of the original data is selected, and the performance of each tree is validated using one-third of the samples that were not used for building that tree (out-of-bag error estimate).Random forest was implemented in R using the ranger package [64], where the number of trees to build is provided using the num.trees parameter.The number of trees can substantially affect the accuracy of the random forest model and depend on the number of response variables and predictor variables.When observations are sufficiently larger than the number of predictor variables, fewer trees are required and vice versa.For example, Hengl et al. [65] indicated that 150 trees are optimal to achieve a stable random forest model, beyond which a trade-off between the computation time and model accuracy offers no additional benefits.Given that the number of response variables is much larger than the number of predictor variables, we used 150 trees as a compromise between accuracy and computational time in this study.

Cubist
Cubist is a machine learning algorithm that constructs model trees using the CART approach [63], with linear models at each terminal nodes [66].Unlike other decision trees (for example, random forest) that retrieve the final model on the basis of discrete values, Cubist uses a set of multivariate models associated with a set of rules at each terminal node.The final prediction is based on the linear model that satisfies the set of conditions associated with the predictor variables [67].The splitting criteria are also slightly different between CART and Cubist, with CART using the variance and Cubist using the standard deviation as a measure of error [68].The Cubist model was implemented in R using the Cubist package [67], which is an extension of Quinlan's M5 model tree.Cubist requires setting up a committees parameter, where iterative model trees are created in sequence according to the number of committees using a boosting-like scheme.However, as a compromise between accuracy and the computational time required to run the Cubist, we did not implement the boosting-like algorithm, and we set the number of committees to 1 during the course of this study.

Model Performance
In this study, we assessed the overall accuracy and precision of each model using the coefficient of determination (R 2 ), ratio of performance to deviation (RPD), root-mean-square error (RMSE) and model bias.The R 2 , RPD, RMSE and bias were computed for each property and each modeling approach according to the following equations: where y i is the observed value of the ith sample measured by conventional laboratory methods, ŷi is the predicted value of the ith sample, n is the number of samples and σ is the standard deviation of the observed soil property in the calibration or validation set.We further defined criteria to assess the relative fit of the model using RPD: (1) RPD ≥ 2.0 (excellent models); (2) 1.4 ≤ RPD < 2 (fair models); and (3) RPD < 1.4 (non-reliable models) [69].However, it is important to note that the criteria to define excellent and poor models using RPD are rather arbitrary, and there is no statistical basis on how these classification thresholds are determined [70].We feel it is more important that a user evaluates model performance with respect to the objective of the study.

Individual Prediction Uncertainty
In addition to overall model performance, characterizing the prediction uncertainty of each sample is crucial in order to know the quality or reliability of each new prediction when relying purely on spectral-based predictions [70].Therefore, we estimated the uncertainty associated with each new prediction in the validation sets using two different approaches.In the case of the PLSR and MBL models, we used the U-deviation method, as implemented in the Unscrambler software package (CAMO) and subsequently improved by De Vries and Ter Braak [50].The U-deviation method is an empirically derived formula that takes into account the residual variance in the validation sets and the error in the product of scores and loadings [50].Mathematically, PLSR and MBL make new predictions using the following relationship: where t pred,i is the score for predicted sample i, and q T is the y-loadings from the calibration set.Thus, U-deviation can be estimated from the error in y-mean and the error in the product of score and loadings.The final equation for estimating U-deviation is as follows: where σ valid is the y-residual variance in the validation set, σ X is the x-residual variance in the calibration set, σ X,valid is the average x-residual variance in the validation set, H 0 is the leverage of the prediction sample and can be seen as the distance of each sample in the validation sets to all the samples in the calibration set, A is the number of principal components used to build the model and n is the number of samples in the validation set.
In the case of random forest, we used the infinitesimal jackknife to provide the standard error associated with each new prediction in the validation datasets.The general approach of the jackknife follows a resampling technique whereby prediction is computed by omitting one observation at a time.These individual predictions are then combined to provide an estimate of the variance or bias correction [71,72].In contrast, rather than remove one observation at a time, the infinitesimal jackknife approach assigns a weight to each observation to provide the estimate variance [73].The infinitesimal jackknife approach has been considered to give more stable predictions compared with the jackknife approach [72].Therefore, we used the infinitesimal jackknife approach to provide uncertainty estimates of each new prediction using the ranger package [56].Unfortunately, assessing the prediction uncertainty in the Cubist model using a jackknife or other bootstrapping approach is currently computationally prohibitive with the large datasets used in this study.
Although the model uncertainty was estimated for all soil properties and summarized as the mean deviation of the validation set, we only show detailed uncertainty estimates for two soil properties (OC and BD) using the validation sets.These two soil properties were selected on the basis of the best (OC) and worst (BD) model performance in the validation sets using the local PLSR model.

Trustworthiness of New Predictions
Individual prediction uncertainty estimates were also used to assess the trustworthiness of new predictions.When an MIR-based prediction model is used independently of laboratory data, there must be a way of flagging when predictions should not be trusted because the new sample falls outside of the calibration space [74].To determine the trustworthiness of the predicted soil property using PLSR and MBL models, we first determined the outliers in the spectra using F-ratios (F) [75]: where M is the number of samples in the calibration set, i denotes the spectral residuals for the ith sample in the calibration sets and j denotes the spectral residuals of all samples in the calibration set with the ith sample excluded.The F-ratio and M − 1 degrees of freedom are used to derive the probability distribution such that spectra with a probability of greater than 0.99 are flagged as an outlier in the analysis: In the case of the PLSR model, a prediction was considered unreliable if the spectra fell outside the calibration space with an F-ratio greater than 0.99.However, since MBL uses spectrally similar neighbors to make predictions for new samples, we defined an additional criterion to detect samples yielding unreliable predictions.New predictions were considered unreliable if the new sample used spectrally similar neighbors with an F-ratio greater than 0.99 to make the prediction.Since the RF model uses a different approach to make predictions and does not rely on the relative location of spectra in orthogonal space, we detected outliers using the probability distribution of the relative deviation (uncertainty/predicted value) from the validation sets.Samples were flagged as untrustworthy when the probability of the relative deviation was greater than 0.99.
Data processing, analysis and prediction were carried out using high-performance Google Cloud computing (https://cloud.google.com)with 52 GB RAM, Google bucket unlimited hard disk space and 8 cores running on a Debian GNU/Linux 9 operating system and R 3.2.3platform.Multiple VM instances were created in Google Cloud to speed up data processing, analysis and prediction.Total computational time from scratch was approximately 121 CPU hours.The script used for processing, analysis and prediction of all soil properties is freely available in the WHRC GitHub (https://github.com/whrc).

Exploratory Analysis of the KSSL MIR Library
Many of the soil properties in this study deviate substantially from the normal distribution, as indicated by their high skewness values (a measure of symmetry) and kurtosis (a measure of heavy or light tail relative to the normal distribution) (Table 1).Clay content, bulk density and pH appear to be normally distributed, as their skewness values are close to 0, and their kurtosis values are close to 3. On the other hand, CO 3 , CEC, exchangeable Ca and the extractable phases of Al and Fe are skewed and heavily tailed, indicating a non-normal distribution.Across different soil horizons, OC was found to be the highest in the O horizon with a mean value of 35.7 wt %, while carbonates (CO 3 ) were the highest in the B, B/C and C horizons with mean values of 7.8, 8.5 and 9.2 wt %, respectively.While half of the samples did not have a recorded soil order, all soil orders except oxisols (n = 6) and gelisols (n = 214) are represented by >500 samples (Table 2).

Overall Model Performance
Results show that excellent calibration models were obtained with RPD ≥ 2.0 for all soil properties using the Cubist and RF models (Table 3).Global PLSR also produced excellent calibration models for all soil properties except Fe (fair model; RPD = 1.5).When the performance of the calibration models was assessed using R 2 , RF produced calibration models with R 2 ≥ 0.95 for all properties.Cubist also produced calibration models with R 2 ≥ 0.95 for all properties except Fe (R 2 = 0.88) and BD (R 2 = 0.88, 0.87 and 0.88 using clod, core and combined clod and core approaches, respectively).The global PLSR models produced calibration models with R 2 ≥ 0.94 for OC, CO 3 and CEC, while it had slightly lower R 2 for clay, OCD, Ca, Al, pH and BD (using core and combined approach) (0.80 ≤ R 2 < 0.90).The calibration models with the lowest R 2 were obtained for BD (R 2 = 0.77 using the clod approach) and Fe (R 2 = 0.58), respectively.The optimal number of principal components selected in the global PLSR models using the one-sigma approach was 20 for all soil properties except BD, for which 19 and 16 components were selected for the clod and core methods, respectively, and 18 components were used for the combined BD model.Since MBL produces new calibration models for the prediction of each sample, there are no calibration statistics for the local model.Ratio of performance to deviation (RPD) is calculated as the ratio of standard deviation of the observed soil property to the root-mean-square error in the prediction, and mean deviation (MeanDev) is the average of the uncertainty estimates for all samples in the validation sets.All predictions were back-transformed prior to calculating goodness-of-fit statistics.
When these models were tested on an independent validation set, MBL (Figure 2) demonstrated excellent performance, with RPD ≥ 2.0 for all soil properties.Cubist (Figure S6) and RF (Figure S7) also showed excellent model performance, with RPD ≥ 2.0 for all soil properties except Fe and BD using the combined clod and core methods (fair models; 1.7 ≤ RPD ≤ 1.9).The global PLSR (Figure S8) produced fair models when tested on an independent validation set for pH (RPD = 1.9),Fe (RPD = 1.7) and BD using the clod-only and combined clod and core methods (RPD = 1.8 and 1.7, respectively).When model R 2 was used to test the predictive performance of all models, Cubist (Figure S6) and MBL (Figure 2) predicted all soil properties with R 2 ≥ 0.90 except OCD, pH, Fe and BD (Tables 3  and 4).The random forest models (Figure S7) showed the greatest drop in performance between calibration and validation sets, with R 2 values dropping by 0.16 and 0.26 for the more difficult to predict properties (pH and Fe, respectively).In the case of BD, model fit (R 2 ) dropped by 0.19, 0.18 and 0.25 using the random forest models (clod, core and combined approaches, respectively).Cubist also demonstrated a tendency to overfit for these three properties (pH, Fe and BD using all three methods).On the other hand, the global PLSR models (Figure S8) held up to the independent validation data and produced very similar performance statistics (Tables 3 and 4).For the properties with the best predictions (OC, CO 3 , CEC and clay)-i.e., properties for which all models produced R 2 > 0.92-the RMSE values suggested that MBL outperformed the Cubist, PLSR and RF models.The RMSE values for RF and PLSR were 35-86% higher, while those of Cubist were 3-9% higher than the RMSE values for MBL for the top four properties (with R 2 > 0.92).For other soil properties, such as Ca, Al, pH and Fe, the RMSE values for the RF and PLSR models were 23-150% higher than the RMSE values using the MBL and Cubist models.For bulk density, which is a more difficult to predict soil property, evaluation of model performance on independent validation sets indicated that MBL (Figure 3) and RF (Figure S9) were slightly superior compared with the Cubist and PLSR models (RMSE improved by 7-21%) using combined samples (clod and core) (Tables 2 and 3).However, the MBL, Cubist and RF models showed a similar performance (RMSE = 0.21) in the case of BD estimated using the core method.For BD estimated using the clod method, MBL and RF slightly outperformed the Cubist and PLSR models (10-20% improvement in RMSE).

Absolute Model Error and Prediction Uncertainty
While the RPD, R 2 and RMSE values give an indication of overall model performance, examination of the absolute error (observed-predicted values) in the validation sets is particularly diagnostic.A comparison of absolute model error for OC shows that MBL predicted 77% and 94% of the samples, with absolute model error ≥ 0.2 wt % and ≥ 1.0 wt %, respectively (Figure 4a).Cubist produced very similar results, with 79 and 94% of the samples below the two thresholds.In the case of poorly predicted BD, MBL was still the superior model, predicting 58% and 85% of the samples with absolute model error ≥ 0.1 g/cm 3 and ≥ 0.2 g/cm 3 , but was comparable to the absolute model error based on Cubist and RF (Figure S10).Despite an R 2 > 0.98 for the OC validation set using MBL and Cubist, about 6% of these samples were still predicted with poor accuracy (defined as absolute error > 1.0%).Figure S11 illustrates that these poor predictions can occur for almost any measured value of OC, but the relative error (model error/observed) was generally much lower for higher values.Assessment of the prediction uncertainty for these soil properties indicated that the MBL model consistently provided a low prediction error compared with the PLSR and RF models.The mean prediction ± prediction error for OC in the independent validation sets was 5.17 ± 0.08, 5.06 ± 0.26 and 5.07 ± 0.39 (wt %) using the MBL, PLSR and RF models, respectively.Likewise, in the case of BD, mean prediction ± prediction error was 1.30 ± 0.02, 1.30 ± 0.08, 1.29 ± 0.08 (g/cm 3 ) using the MBL, PLSR and RF models, respectively.Overall, our results show that the uncertainty estimates for OC using MBL (observed mean = 5.18 wt %) and BD (observed mean = 1.32 g/cm 3 ) were within 1.5% and 2.3% of the predicted mean (5.17 wt % for OC and 1.3 g/cm 3 for BD), respectively.Unfortunately, there is currently no method to assess the prediction uncertainty in Cubist.In the case of MBL, we further determined the samples with the highest and lowest prediction error for OC from the independent validation sets (N = 8498).The largest prediction error in the local model was often associated with picking neighboring samples that are farther apart in orthogonal space (Figure S12).For example, results show that the average distance between the neighbors and the prediction sample was 0.15 units for the sample with the largest prediction error.In contrast, the average distance in the case of the prediction sample with least prediction error was 0.10 units, indicating that prediction errors are low when samples that are spectrally similar are used to make predictions of any new samples.
We also assessed the trustworthiness of each new prediction in the validation sets by detecting outliers using the probability distribution of the relative deviation for RF and F-ratio (Equations ( 9) and ( 10)) for the MBL and PLSR models.The RF model flagged 86 samples, while the MBL and PLSR models flagged 33 and 7 samples, respectively, out of 8512 samples in the validation sets for OC (Figures 4 and S13).In the case of BD, the RF model flagged 34 samples, while the MBL and PLSR models flagged 33 and 10 samples, respectively, out of 3306 samples in the validation sets (Figures S10  and S13).The samples flagged as having been unreliably predicted were distributed across the entire distribution of the absolute error and prediction uncertainty (Figures 4 and S10), indicating that poorly predicted samples (high absolute error) were not flagged as outliers and did not have large prediction errors using all three models.Additionally, results show that the absolute error was significantly related to prediction uncertainty for OC but not for BD (Figures 4 and S10).

KSSL MIR Library and Its Non-Normal Distribution
In this study, we utilized a large nationally comprehensive soil MIR library (Figure 1) available through the USDA NSSC-KSSL soil database, which contains 15,118 unique pedons representing 50,893 samples with varying amounts of associated analytical data.As emphasized by Wijewardane et al. [16], the soil properties in the database mostly follow a non-normal distribution (Table 1).One way to improve the prediction of soil properties with data that follow a non-normal distribution is to transform the data using a variety of approaches, such as box-cox, natural log and square root transformation [8,11,30,45,76,77].For example, Waruru et al. [77] applied natural log and square root transformation to highly skewed and slightly skewed soil properties, respectively.Baldock et al. [11] showed that square-root-transformed OC performed better than log-transformed OC by improving R 2 and reducing the model bias in both the calibration and validation sets.In this study, since some soil properties were highly skewed compared with others, we developed calibration models and tested their performance on independent validation sets using three different transformations (square root, box-cox and log) for two soil properties representing high skewness (OC) and near-normal distribution (pH) (Figures S1 and S2).The results from this exercise indicate that, for skewed data (OC), square root transformation produced the best PLSR model performance.For normally distributed data (pH), there was no improvement in performance.Additionally, for the machine learning (RF and Cubist) and local models (MBL), transformation was not necessary (Table S1).

Model Performance for a Range of Soil Properties
Our results indicate that MIR spectroscopy combined with a highly heterogeneous database can provide excellent predictions of most soil properties, with validation RPD ≥ 2.0, using both machine learning and regression approaches (Tables 3 and 4).Only a few soil properties have RPD < 2.0 using the Cubist, RF and global PLSR models (Fe and BD all for Cubist and RF models; Fe, BD core and BD all for PLSR model).Using R 2 to assess the prediction performance also produced similar results, with validation R 2 ≥ 0.98 for OC and CO 3 and validation R 2 ≥ 0.83 for CEC, clay, OCD, Ca and Al, regardless of the methodology used to develop the calibration models.For pH, Fe and BD (particularly for the clod and combined methods), the model choice made a significant difference in the quality of predictions (Tables 3 and 4).Better prediction results were attributed particularly to strong absorption bands associated with soil mineral and organic bonds in the MIR region [6,7].Additionally, due to the large spatial variation in soil properties, the relationship between spectra and soil attributes can be complex, requiring a large sample size to adequately represent the distribution of samples across space [5].Models built using a large spectral database can help to minimize calibration errors and better predict soil attributes from independent datasets [45].However, only a few large-scale vis-NIR databases and even fewer MIR libraries are available [78].In addition, these spectral libraries have rarely been considered as an operational tool for the assessment and prediction of soil properties [10].
When the entire spectral library was used to build a single global model (global PLSR), soil properties such as OC and CO 3 were accurately predicted with an R 2 ≥ 0.98 (Table 3).Other studies using an MIR library to build global PLSR models have reported similar results for these properties.Baldock et al. [11] found a good prediction of OC (R 2 = 0.93) and CO 3 (R 2 = 0.93) in Australian soils.Similarly, McCarty et al. [15] reported a better prediction of OC (R 2 = 0.82) and CO 3 (R 2 = 0.87) using 273 samples from diverse locations in the central United States, while Grinand et al. [79] predicted OC with an R 2 of 0.89 and CO 3 with an R 2 of 0.97 in European soils using >1700 samples.The better PLSR performance in this current study in the prediction of OC and CO 3 is likely associated with the number of samples used to build the calibration models [25].For example, using 10 different calibration intensities (from 10 to 100%), Clairotte et al. [25] found that the standard error of the validation samples decreased when the size of samples used to build the calibration model increased from 10 to 100%.Given the size of the MIR library used in this study, it is not surprising that the calibration models performed better when predicting OC and CO 3 compared with previous studies.
Assessment of the models on the basis of RMSE values presents a more nuanced picture, with greater discrepancies between different models (Table 3).While OC in an independent validation set was predicted with an R 2 of 0.99 by all models, the RMSE was 80% higher for the global PLSR model compared with Cubist and MBL.This increase in RMSE for the global model is most likely attributed to the fact that, for the prediction of any particular sample, there is a lot of extraneous information (i.e., samples with highly dissimilar properties) contained in the global PLSR model [76].MBL, a nonlinear modeling approach, was designed to handle this exact problem [24] and had lower RMSE values for all properties compared with the global PLSR (Table S3).The superior performance of local PLSR compared with global PLSR is a result of local PLSR searching for spectrally similar neighbors in the PC space [24,80], thereby allowing the effective removal of uninformative and less relevant samples for each new prediction in the large spectral library.Using local regression, previous studies have found typically better prediction compared with global PLSR [24,25], although this is not always the case [26].For example, the addition of new samples improved the prediction performance of CO 3 but led to the inaccurate prediction of other soil properties [26].The improved model performance of CO 3 was due to the addition of new samples that were rich in carbonates.
Machine learning models can also handle this extraneous (nonlinear) information shortcoming in global PLSR models by finding patterns in the spectral library and sub-setting the data accordingly [37].The Cubist and RF models generally outperformed the global PLSR models.Cubist consistently produced better prediction results with higher fitting R 2 and lower RMSE than RF for all soil properties except BD.Previous studies have demonstrated that the Cubist modeling approach is slightly superior in the prediction of soil properties compared with other machine learning approaches, such as support vector machines [10].This is because Cubist is a form of piecewise linear decision trees that are able to make very efficient spectral variable selections and handle nonlinear relationships between dependent and independent variables [81].The Cubist model constructs a regression tree in which each terminal node contains multivariate linear models instead of discrete values.The prediction of a soil property of a new sample is achieved first by identifying where the sample falls in the tree node, and then a linear model is fitted using all combinations of the samples within that node [82].However, several studies have demonstrated that calibration models developed using Cubist tend to show a good model fit at low values, while the residuals increase at higher values [38,66].Doetterl et al. [66] suggested that large residuals at high values are not a general feature of Cubist but are associated with a limited sampling density at high concentration values.In this study, we did not find any large residuals at higher values for all soil properties, which is possibly related to the inclusion of enough samples with high values.Interestingly, RF developed superior calibration models for all soil properties compared with Cubist, but the application of the calibration models to independent validation sets resulted in a significant decrease in performance compared with the Cubist model.This might possibly be associated with more continuous predicted values in the case of the Cubist algorithm, because each linear model at the terminal node in Cubist allows for a smoother transition in the prediction between trees [67].However, in RF, a random subset of training data are selected through bagging, and the final prediction is the average of all individual tree outputs [62].

The Importance of Estimating Prediction Uncertainty
While the RMSE values reported in Table 3 give an indication of the overall model performance over a large range of each soil property (Table S3), those RMSE values do not give an indication of the uncertainty in individual predictions [68].For example, all models produced excellent fits for OC (R 2 = 0.99), but the RMSE values were 0.70-1.27wt %, which would be considered unacceptably high for most management applications.Therefore, we used approaches to provide uncertainty estimates for each prediction using the global PLSR, MBL and RF models.Here, there is a clear difference in model performance, with MBL producing significantly narrower prediction intervals compared with global PLSR and RF (Table 3, Figure 4).
Both the MBL and PLSR models use a similar approach to uncertainty estimation, but the models using MBL yielded a much narrower range of uncertainty estimates.This is because MBL searches for the most spectrally similar neighbors corresponding to each new prediction sample in the spectral library, thus introducing low residual variance to the calibration subset.The U-deviation uses four major parameters to estimate the prediction uncertainty: (1) x-residuals in the calibration set; (2) x-residuals in the validation set; (3) y-residuals in the validation set; and (4) the average distance of the validation samples to all the samples in the calibration sets used to make predictions [50].Of the four parameters, the y-residuals in the validation set are the most difficult to estimate in the case of MBL, because this model uses different calibration subsets for predicting samples in the validation sets.As a result, the y-residual variance was estimated using a leave-group-out cross-validation.In the case of global PLSR, estimating the prediction uncertainty using U-deviation is fairly simple, because there is only one single model to make predictions for all samples in the validation sets.Therefore, y-residual variance is simply the variance in the response variable of the validation sets.
Prediction uncertainty using RF was higher than the global PLSR for OC but similar to the PLSR model for BD (Figures 2 and 3).This difference may be due to the use of a different approach to estimating the prediction uncertainty.In the case of RF, we used the non-parametric infinitesimal jackknife approach, which is a modified jackknife approach [72].In the jackknife approach, each observation is omitted to recompute the prediction of the remaining observations.This process is repeated for all the observations to produce an estimate of the variance.In the infinitesimal jackknife approach, rather than omitting one observation at a time, the observation is given a slightly lower weight compared with the other observations.Compared with the global PLSR and MBL uncertainty estimates, the infinitesimal jackknife approach relies more on the variance between the prediction and the observation.
Prediction uncertainty and outlier detection have two important functions in operationalizing soil spectroscopy as a routine analysis tool: (1) they give a true estimate of the uncertainty of new predictions and (2) they allow for an assessment of the trustworthiness of a new prediction.This second function is critical in that there needs to be a way to assess when a new sample has not been predicted well and should be sent to the lab for regular analysis [81].In this study, a new prediction was defined as trustworthy when the spectra corresponding to each new prediction had an F-ratio ≤ 0.99 using the MBL and PLSR models.In the case of the RF model, the trustworthiness was assessed using the probability distribution of the relative deviation (uncertainty/predicted value), with values ≤ 0.99 considered representative of a reliable prediction.Typically, MBL model predictions were the most trustworthy on the basis of analyzing the absolute error in validation set prediction (Figures 4 and S12).High trustworthiness in the case of MBL is likely associated with picking up k spectrally similar neighbors in the reference sets that are highly relevant to a sample in the validation sets.In contrast, prediction developed using the PLSR models was less trustworthy, particularly when the spectral library used to build the calibration models contained samples that cover a wide range of geographic and climatic conditions, thereby introducing high variance and bias to the calibration spectra.For example, using the PLSR model, only 51% of the samples were predicted with absolute error ≥ 0.2 wt % (Figure 4).In the case of RF, the jackknifing approach provides estimates of bias and variance by removing each sample at a time from the dataset, and therefore uses less information (fewer samples) than other bootstrapping techniques [83].As a result, the prediction uncertainty estimated using jackknifing is more conservative and provides estimates with larger uncertainties compared with other approaches [84], but it is typically considered unbiased [85].Additionally, since random forest uses an ensemble learning method for the prediction of a new sample, the trustworthiness of each new prediction is independent of the location of the samples in the calibration space (Figure S13).

Best Model Performance
Overall, our results indicate that MBL and Cubist outperformed the global PLSR and RF models.In large and complex datasets, the relationship between soil properties and spectra can be highly nonlinear [14,86].As a result, both MBL and Cubist were able to better predict soil physical and chemical properties.The results of this study are consistent with previous studies reporting their best model performance using local models or Cubist [25,87,88].In particular, MBL models are able to better manage nonlinearity and extraneous information in the spectra by using spectrally similar neighbors in the reference sets to fit a new target function for each sample in the validation set [89].This allows MBL to remove unrelated and extraneous samples during calibrations [87].
While the performance of Cubist and MBL on independent validation sets was essentially a draw across all properties (Table 3, Figure 4a), MBL has two advantages, which we believe make this model a superior choice for developing soil spectroscopy into a robust and routine predictive tool.First, there is no unique calibration model, meaning that as the spectral library grows, there is no need to periodically update calibration models.Second, an estimate of uncertainty can be calculated for each prediction, and this is something that has not yet been implemented in Cubist.A prediction interval provides both a true estimate of the precision of an individual prediction and a way of assessing the trustworthiness of that new prediction when there are no lab data.
Although MBL and Cubist outperformed the global PLSR and RF models, the high computational demands for building MBL and Cubist can impede the application of these models, particularly when dealing with large and complex datasets.The PLSR model was approximately 9.0 and 5.0 times faster than MBL and Cubist, respectively, while the PLSR and RF models have similar computational times.The higher computational time associated with MBL is not surprising, because the MBL function requires locating spectrally similar neighbors in PC space to fit a local target function for each sample in the validation sets [24].Regardless of the high computational demand associated with MBL, the improvement in RMSE and narrow prediction interval compared with the global PLSR and RF models indicates that the MBL should be used as a predictive tool when dealing with large and complex datasets.

Conclusions
The MIR library used in this study primarily comprises samples from the US soil survey conducted over broad geographic locations, representing a range of climate, land use and geologic conditions from the United States.We demonstrated that, when combined with sophisticated chemometric tools, this MIR spectral library can provide accurate and precise predictions of numerous soil properties of new samples collected from at least the broad soil distribution of the USA.These results are particularly promising given how labor-intensive many of the traditional methodologies are for the properties examined in this study.The results of this paper form a basis for moving forward with an operational system to provide routine predictions with uncertainty estimates of these soil properties.In order to turn DRS into a routine operational tool, we suggest that three sets of information be provided: (1) overall model accuracy assessment (R 2 and RMSE) on an independent validation set; (2) precision or confidence intervals about a new prediction and (3) an outlier assessment for flagging new samples that fall outside of the calibration space and predictions that might be untrustworthy.Importantly, this system, as validated in this study, is only appropriate for new samples scanned on the same instrument used to create the spectral library.For this tool to be of general use to any soil scientist with a diffuse reflectance FTIR spectrometer, additional research is necessary to establish the need for and the approach to effectively implementing calibration transfer between spectrometers.

Figure 1 .
Figure 1.Locations of the soil pedons with mid-infrared (MIR) spectra available through the USDA National Soil Survey Center's Kellogg Soil Survey Laboratory (NSSC KSSL) soil database for samples from the United States.Soil pedons with an exact GPS location = 4744; soil pedons with county centroids = 10,302.

Figure 2 .
Figure 2. Comparison of the memory-based learner (MBL) model predictions with an independent validation set of laboratory-measured Al, Ca, CEC, clay, CO 3 , Fe, OC, OCD and pH.

Figure 3 .
Figure 3.Comparison of the MBL model predictions from an independent validation set with observations obtained using combined clod and core (a), clod-only (b) and core-only (c) methods for measuring bulk density (BD).

Table 4 .
Calibration and validation results of soil bulk density using PLSR and machine learning (memory-based learner (MBL), Cubist and random forest (RF)) models.Prediction of soil bulk density was assessed by fitting the model using data from two analytical methods (clod-volume measured at 1/3 bar water tension and core-volume measured at field soil moisture condition) and the two analytical methods combined (BD all ).All predictions were back-transformed prior to calculating goodness-of-fit statistics.

Figure 4 .
Figure 4. Absolute model error and uncertainty estimates (deviation) of independent validation sets for OC using the MBL, PLSR, RF and Cubist models.The plot in the top panel (a) shows the cumulative rank of the absolute difference between the predicted and observed values (N = 8512).The numbers in parentheses are the % of samples above 0.2 and 1.0 wt % of absolute error.Only absolute error is shown for Cubist.The figures in the bottom panel (b-d) show the relationship between absolute model error and deviation using the MBL, PLSR and RF models.The black cross symbols are the samples that were flagged as untrustworthy predictions using the MBL, PLSR and RF models.

Table 1 .
Summary statistics of soil properties available through the NSSC KSSL MIR library.

Table 2 .
Distribution of samples in the NSSC KSSL MIR library as a function of soil order and horizon.

Table 3 .
Calibration and validation results of nine soil properties using partial least squares regression (PLSR) and machine learning (memory-based learner (MBL), Cubist and random forest (RF)) models.