Using Mid-Infrared Spectroscopy to Optimize Throughput and Costs of Soil Organic Carbon and Nitrogen Estimates: An Assessment in Grassland Soils

: Low-cost techniques, such as mid-infrared (MIR) spectroscopy, are increasingly necessary to detect soil organic carbon (SOC) and nitrogen (N) changes in rangelands following improved grazing management. Speciﬁcally, Adaptive Multi-Paddock (AMP) grazing is being implemented to restore grassland ecosystems and sequester SOC often for commercialization in C markets. To determine how the accuracy of SOC and N predictions using MIR spectroscopy is affected by the number of calibration samples and by different predictive models, we analyzed 1000 samples from grassland soils. We tested the effect of calibration sample size from 100 to 1000 samples, as well as the predictive ability of the partial least squares (PLS), random forest (RF) and support vector machine (SVM) algorithms on SOC and N predictions. The samples were obtained from ﬁve different farm pairs corresponding to AMP and Conventional Grazing (CG), covering a 0–50 cm soil depth proﬁle along a latitudinal gradient in the Southeast USA. Overall, the sample size had only a moderate inﬂuence on these predictions. The predictive accuracy of all three models was less affected by variation in sample size when >400 samples were used. The predictive ability of non-linear models SVM and RF was similar to classical PLS. Additionally, all three models performed better for the deeper soil samples, i.e., from below the A horizon to the –50 cm depth. For topsoil samples, the particulate organic matter (POM) content also inﬂuenced the model accuracy. The selection of representative calibration samples efﬁciently reduces analysis costs without affecting the quality of results. Our study is an effort to improve the efﬁciency of SOC and N monitoring techniques.


Introduction
The ability to rapidly screen thousands of soil samples with a cost-effective and reproducible methodology is an extremely attractive prospect for improving soil carbon and health. Therefore, high-throughput technologies are needed as a low-cost alternative capable of accelerating screening and monitoring soil more effectively. Mid-infrared spectroscopy (MIR) as a tool for soil analysis is a promising alternative for scaling up conventional laboratory assays in vast areas with a high carbon sequestration potential in order to mitigate climate change impacts [1,2].
Grassland soils have a significant potential for soil organic carbon (SOC) sequestration at scale [1]. They also stock high amounts of nitrogen (N) per unit of SOC [3]. N availability may limit SOC sequestration in these soils depending on environmental conditions [4,5]. Over the years, several methods have been proposed to estimate a wide range of soil properties in grasslands and rangelands. Remote sensing, for example, has been widely applied to monitor temporal and spatial patterns [6,7]. Unfortunately, the use of remote technology to predict soil properties still has some limitations, resulting in inaccurate values with substantial uncertainties when estimating SOC and N stocks [8,9]. This is mostly due to visible near-infrared (VNIR) and shortwave (SWIR) spectral ranges, which have a low spatial resolution and can be affected by atmospheric distortion leading to an extremely low signal-to-noise ratio [10]. On the other hand, MIR has demonstrated the potential to achieve high accuracy for C and N predictions even in soils formed with different parent materials [1,2]. However, the precision and performance of MIR at the project scale have not been tested to quantify SOC and N content under different land management practices.
Given the huge demand for these analyses to improve soil health and promote ecosystem service markets, rigorous testing of MIR approaches is needed to optimize throughput and the cost of quantifying SOC and N in response to changes in management practices. For example, this could help compare the SOC sequestration potential by the adoption of regenerative grazing management, such as Adaptive Multi-Paddock (AMP), with conventional grazing (CG) [11,12]. Despite the well-known potential of MIR spectroscopy to obtain highly accurate soil C and N information, research on sample size and adequate calibration set size has not received much attention. It has been challenging to get clear and consistent guidelines for the optimal calibration set size, given that predictive accuracy is also partly determined by sampling design (i.e., spatial scales, sampling depth, land use) as well as the choice of model algorithms [13,14].
Consequently, over the last few decades, chemometrics/machine learning tools have been increasingly applied to spectral data in order to maximize the models' predictive accuracy for estimating soil parameters. The predictive power of emerging machine learning algorithms can offer substantial gains in accuracy relative to conventional chemometric methods. Partial least squares (PLS) is a linear and commonly chemometric technique used to estimate different soil properties. However, PLS is sometimes less accurate when using non-linear data [15,16]. Therefore, greater attention to non-linear modeling techniques is being paid, as such methods offer greater flexibility over linear methods, owing to their ability to capture more complicated relationships between specific spectral reflectance signatures and soil properties [17].
Additionally, a strategy for selecting an optimal calibration sample size is critical in order to build models with a satisfactory predictive ability. This is mainly because larger datasets may not always reduce model uncertainty. While many samples are often required to obtain robust calibrations in order to detect changes in SOC or N stock following improvements in management [18], analyzing a large number of samples for C and N on an elemental analyzer slows throughput and increases costs. Furthermore, measurements can be limited due to low helium supply [19]. Therefore, this study aims to optimize calibration sample sizes by evaluating how SOC and N estimation accuracy is affected by the number of calibration samples using different predictive models. The specific objectives of this study are to: (1) identify the optimal conditions for building SOC and N calibrations (e.g., sample size, sampling locations) in grazed pasture soils, (2) evaluate different machine learning algorithms for SOC and N predictions, and (3) compare the cost-effectiveness of using an optimal calibration dataset without affecting the model's accuracy.

Study Sites and Soil Sampling
The study sites and sampling details have been previously presented by [12]. Briefly, sites represented a latitudinal gradient from Adolphus, Kentucky, through Woodville, Mississippi ( Figure 1). The samples were collected in different soil series as defined in USDA Soil Taxonomy including Emory silt loam, Hartsell fine sandy loam, Loring silt loam, Trimble gravelly silt loam, and Cumberland gravelly/non-gravelly, ranging from loam to silty clay loam. The selection of the most representative five pairs of neighboring AMP and CG farms was based on the farms that most closely represented our definition of AMP grazing with a neighbor practicing CG, which is the most common and representative grazing. At each farm, 42 soil cores were sampled following the VM0021 "Soil Carbon Quantification Method" [20], for a total of 420 cores. At each farm pair, soil cores were distributed in Environments 2022, 9, 149 3 of 18 two representative catenae on a common soil type across paired farms, with three sampling zones (e.g., upper, middle, and lower slope) per catena (Supplemental Figure S3), and seven cores per sampling zone. Soil cores (1 m deep) were collected with a Giddings hydraulic probe mounted on an ATV using 5 cm in diameter sleeves, and further separated by depth in the laboratory. For this study we used the A horizon (0 to approx. 20 cm; topsoil) and the increment depths below A to 30 and 30-50 cm. loam to silty clay loam. The selection of the most representative five pairs of neighboring AMP and CG farms was based on the farms that most closely represented our definition of AMP grazing with a neighbor practicing CG, which is the most common and representative grazing. At each farm, 42 soil cores were sampled following the VM0021 "Soil Carbon Quantification Method" [20], for a total of 420 cores. At each farm pair, soil cores were distributed in two representative catenae on a common soil type across paired farms, with three sampling zones (e.g., upper, middle, and lower slope) per catena (Supplemental Figure S3), and seven cores per sampling zone. Soil cores (1 m deep) were collected with a Giddings hydraulic probe mounted on an ATV using 5 cm in diameter sleeves, and further separated by depth in the laboratory. For this study we used the A horizon (0 to approx. 20 cm; topsoil) and the increment depths below A to 30 and 30-50 cm.

Soil Analysis
Soil C and N concentrations were initially determined by dry combustion on a Costech ECS 4010 elemental analyzer (Costech Analytical Technologies, Valencia, CA, USA) on 2 mm sieved, finely ground and oven-dried soil samples as described in detail in [11]. In the few soils positive to the fizz test, soil inorganic C concentrations were quantified using an acid pressure transducer connected to a voltage meter [21] and subtracted from the total C concentration to determine SOC concentrations.
Soil organic matter physical fractionations were separated by size and density [22] on the 2 mm sieved samples composited by sampling zone and depth layer, to obtain a light particulate organic matter (POM), heavy POM, and a mineral associated organic

Soil Analysis
Soil C and N concentrations were initially determined by dry combustion on a Costech ECS 4010 elemental analyzer (Costech Analytical Technologies, Valencia, CA, USA) on 2 mm sieved, finely ground and oven-dried soil samples as described in detail in [11]. In the few soils positive to the fizz test, soil inorganic C concentrations were quantified using an acid pressure transducer connected to a voltage meter [21] and subtracted from the total C concentration to determine SOC concentrations.
Soil organic matter physical fractionations were separated by size and density [22] on the 2 mm sieved samples composited by sampling zone and depth layer, to obtain a light particulate organic matter (POM), heavy POM, and a mineral associated organic matter (MAOM) as described in [12]. All fractions were analyzed for %C and %N on an elemental analyzer as described above for the bulk soils. For the purposes of this study, the light and heavy POM C values were used.

MIR Measurements
For spectral analysis, air-dried soil samples (<2 mm) were ground using a mortar and a pestle and subsequently analyzed using a Digilab FTS 7000 spectrometer (Varian, Inc., Palo Alto, CA, USA) with a Pike AutoDIFF diffuse reflectance autosampler (Pike Technologies, Madison, WI, USA) for spectral analysis. The MIR (4000-400 cm −1 ) pseudo absorbance was obtained using a KBr background and deuterated triglycine sulfate detector. Each spectrum was made of 64 co-added scans and 4 cm −1 resolutions.

Sample Selection
A Kennard-Stone (KS) algorithm was used to split the whole dataset (1612 spectra) into a training set (1000 spectra), and a validation dataset of 612 spectra ( Figure 2). The KS was performed to ensure two subsets that follow the statistical distribution of the original dataset [23]. From the training subset, ten sample sizes of an increasing number of calibration samples, ranging between 100 and 1000, were randomly obtained to avoid optimistically biased performance estimates. Five replicates of each sample size were generated using the R package dplyr [24].
Environments 2022, 9, x FOR PEER REVIEW 4 of 18 matter (MAOM) as described in [12]. All fractions were analyzed for %C and %N on an elemental analyzer as described above for the bulk soils. For the purposes of this study, the light and heavy POM C values were used.

MIR Measurements
For spectral analysis, air-dried soil samples (<2 mm) were ground using a mortar and a pestle and subsequently analyzed using a Digilab FTS 7000 spectrometer (Varian, Inc., Palo Alto, CA, USA) with a Pike AutoDIFF diffuse reflectance autosampler (Pike Technologies, Madison, WI, USA) for spectral analysis. The MIR (4000-400 cm −1 ) pseudo absorbance was obtained using a KBr background and deuterated triglycine sulfate detector. Each spectrum was made of 64 co-added scans and 4 cm −1 resolutions.

Sample Selection
A Kennard-Stone (KS) algorithm was used to split the whole dataset (1612 spectra) into a training set (1000 spectra), and a validation dataset of 612 spectra ( Figure 2). The KS was performed to ensure two subsets that follow the statistical distribution of the original dataset [23]. From the training subset, ten sample sizes of an increasing number of calibration samples, ranging between 100 and 1000, were randomly obtained to avoid optimistically biased performance estimates. Five replicates of each sample size were generated using the R package dplyr [24].

Model Calibration
Prior to applying the machine learning algorithms, the spectra were preprocessed using the Savitzky-Golay smoothing filter method. The partial least squares (PLS), random forest (RF), and support vector machine (SVM) models were trained comparatively to develop calibration models for predicting C and N content. The machine learning analyses were implemented using the Caret package in the R software [25], using the default method for optimizing hyperparameters. These algorithms were selected because they differ in their linear and non-linear functional capabilities. PLS is a linear regression model

Model Calibration
Prior to applying the machine learning algorithms, the spectra were preprocessed using the Savitzky-Golay smoothing filter method. The partial least squares (PLS), random forest (RF), and support vector machine (SVM) models were trained comparatively to develop calibration models for predicting C and N content. The machine learning analyses were implemented using the Caret package in the R software [25], using the default method for optimizing hyperparameters. These algorithms were selected because they differ in their linear and non-linear functional capabilities. PLS is a linear regression model that can work efficiently with spectral data at a lower computational capacity [26]. While SVM and RF have non-linear fitting capabilities, they differ in computational demand [27,28].

Model Validation
To objectively reflect the models' predictive performance, the validation was carried out using an external dataset (610 samples), which was not involved in the model training process. Therefore, we assessed the model's accuracy as a function of the number of calibration samples while keeping the validation dataset constant. This type of validation is more rigorous in order to avoid over-optimistic calibrations that may be unable to cope with unknown samples [29]. Different metrics were computed to quantify the overall model performance; root mean square error (RMSE), coefficient of determination R-square (R 2 ), mean absolute error (MAE), and Nash-Sutcliffe efficiency coefficient (NSE). As regards these statistics, the best model should have the highest R 2 and NSE, and the lowest RMSE and MAE. The metrics RMSE and MAE are scale-dependent metrics with the same unit of measurement as the dependent variable, whereas NSE and R metrics are dimensionless metrics. Compared with MAE, the RMSE give a relatively high weight to large errors because the errors are squared before averaging. The sensitivity of the RMSE to outliers is the most common concern with the use of this metric; however, RMSE tends to become larger than MAE as the sample size increases [30,31]. The R 2 value close to 1 indicates that the predicted values may fit the measured data, whereas NSE shows how well the predicted data fit to the 1:1 line. When NSE = 1, it indicates a perfect match of the model to the measured data [32,33]. The metrics were calculated as follows: where y i andŷ i correspond to the measured and predicted values, respectively; y i is the mean value of the measured value and n is the number of measured data. ANOVA was conducted to test statistical differences between medians of sample sizes. Subsequently, Fisher's Least Significant Differences test (LSD) was applied when the Kruskal-Wallis results had statistical significance. After testing the effect of sample size, we selected the smallest size that was able to provide more accurate estimates for SOC in order to evaluate the impact of sampling depths and farm soils on C estimates. Model residuals (residual = measured − predicted) were standardized by dividing them using standard deviation [34]. The positive and negative values of the standardized residuals also indicate whether the expected values might be over-or underestimated by the model. The effect of particulate organic matter (POM) on the model accuracy was evaluated by plotting the standardized residual corresponding to the C model against the heavy sand-sized and free light POM content.

Descriptive Analysis of Soil Data
This dataset covers a broad range of SOC and N concentrations and climatic conditions (Table 1). In AMP grazing sites, SOC ranged from 1.28 ± 0.86 to 1.87 ± 1.36 and from 0.13 ± 0.09 to 0.21 ± 0.14 for N in the 0-20-cm depth increment. SOC and N concentrations were higher in farm pair 5 relative to the other in 0-20 cm depth increments. For these soil profiles (farm pair 5), the climate is slightly warmer and wetter with MAT of 19 • C and MAP of 1649 mm. For CG soils, lower values of SOC ranging 1.03 ± 0.63 and 1.4 ± 0.87 Environments 2022, 9, 149 6 of 18 were found in the topsoil (0-20-cm) compared with AMP grazing. In the deepest soil depth increment (30-50 cm), SOC ranged from 0.26 ± 0.42 to 0.13 ± 0.07, while the N concentrations differed slightly among sites. Table 1. Soil and climate characteristics in the neighboring adaptive multi-paddock (AMP) and conventional (CG) grazed pairs included in this study, data are from [11]. Mean and standard deviation of soil organic carbon (SOC), nitrogen (N), and bulk density (BD) are shown for each depth increment. Climate variables correspond to mean annual precipitation (MAP) and mean annual temperature (MAT).

Location
MAT (  This high soil variability along the latitudinal gradient ( Figure 1) encompassed soilspecific spectra characteristics in each farm pair (Supplementary Figure S1). The PCA score plots of the MIR spectra generated a clear separation of the five farm pairs into clusters, indicating high heterogeneity in the soil composition of the spectral database (i.e., organic matter, iron oxides and clay), while there were spectral similarities between CG and AMP soils in each pair. Pair 3 appears to show a greater separation than the other 4 farm pairs.
The calibration and validation datasets of the measured SOC and N concentration followed a right-skewed distribution (Supplementary Figure S2). In fact, most of the

Model Comparison and Influence of Training Sample Size
The performance of the PLS, SVM, and RF models for predicting SOC (Figure 3) and N ( Figure 4) values improved rapidly when the training dataset was increased to 400 samples. The standard deviation of all the different sample sizes was higher in a small-sized set (100-200 samples), both for SOC and N. We can observe a rapid decrease in RMSE and MAE in the 100 to 400 sample range. RMSE and MAE have similar values for all datasets, with RMSE sometimes being slightly larger. The prediction accuracy did not improve with over 400 samples for any model. A similar pattern of results was observed when the three different models were used to predict N (Figure 4). In addition, all models underestimated SOC when concentrations were above 2.0% ( Figure 5). The same behavior was observed for N concentrations above 0.20% ( Figure 6). The effect of sample size differed among models; for example, according to RMSE, R-square, and Nash-Sutcliffe criteria, the PLS model performed better than the RF and SVM models when sample size increased (Figures 3 and 4). The RF model yielded more accurate C and N predictions with sample sizes in the range of 400-600. However, when the sample size used for calibration was higher (n = 1000), the prediction was more scattered, tending to strongly overestimate both SOC and N (Figures 5 and 6).     Since the performance of different modeling approaches plateaued at approximately 400 samples, we tested the models calibrated using 400 samples as our "optimum" for analyzing residuals. To test the statistical significance as the sample size increased from 100 to 400, the metrics values of the three statistical approaches were evaluated with pairwise comparison (p-value < 0.05) ( Table 2). PLS, RF, and SVM models performed similarly when the sample size was approximately 400 samples for SOC and N. However, the predictive ability of PLS and SVM algorithms improved significantly, increasing the number of calibration samples from 100 to 400. With 400 samples, the mean values of RMSE were 0.35 ± 0.02, 0.39 ± 0.04 and 0.39 ± 0.02 for SOC, as well as 0.042 ± 0.004, 0.046 ± 0.007, 0.041 ± 0.001 for N, using PLS, RF, and SVM, respectively. The values of the R-square of the three algorithms were 0.90, representing a good fit between the predicted and measured values. Nash-Sutcliffe coefficients exceeded 0.8 for PLS at 400 samples, tending to be less likely to over-underestimate predicted values of SOC and N. Since the performance of different modeling approaches plateaued at approximately 400 samples, we tested the models calibrated using 400 samples as our "optimum" for analyzing residuals. To test the statistical significance as the sample size increased from 100 to 400, the metrics values of the three statistical approaches were evaluated with pairwise comparison (p-value < 0.05) ( Table 2). PLS, RF, and SVM models performed similarly when the sample size was approximately 400 samples for SOC and N. However, the predictive ability of PLS and SVM algorithms improved significantly, increasing the number of calibration samples from 100 to 400. With 400 samples, the mean values of RMSE were 0.35 ± 0.02, 0.39 ± 0.04 and 0.39 ± 0.02 for SOC, as well as 0.042 ± 0.004, 0.046 ± 0.007, 0.041 ± 0.001 for N, using PLS, RF, and SVM, respectively. The values of the R-square of the three algorithms were 0.90, representing a good fit between the predicted and measured values. Nash-Sutcliffe coefficients exceeded 0.8 for PLS at 400 samples, tending to be less likely to over-underestimate predicted values of SOC and N. Table 2. Mean and standard deviation of performance metric in terms of average root mean square error (RMSE), coefficient of determination (R-square), mean annual error (MAE) and Nash-Sutcliffe coefficient as a function of sample size using partial least squares (PLS), random forest (RF) and support vector machine (SVM) models based on five replicates randomly selected of each sample size. For each method, five calibrations were built for each sample size. The entire dataset included 1000 samples.

Influence of Sampling Depth and Farm Site on Model Accuracy
The plateau of the performance metrics was shown in Figures 3 and 4, where the calibrations exhibit only slight changes for sample sizes larger than 400. Each residual obtained by model calibrations using 400 samples was plotted to assess model fit at different depths ( Figure 7). Since the training algorithms retained comparable predictive abilities for SOC and N, the following results were based only on C values. The models satisfactorily predicted SOC at 10-50 cm depth. However, high positive residuals were observed in the upper A horizon (0-10 cm), which had higher and more variable SOC concentrations (Table 1). This suggests that all models underestimated C content in topsoils with high SOC. In the upper A horizon, the SVM algorithm outperformed other algorithms. However, SVM could not improve the accuracy of the predicted values for SOC concentrations between 0.5-1.0%, especially those at lower (10-50 cm) depths. It is noteworthy that grazing practices and farm sites exhibited a moderate effect on SOC prediction results (Supplementary Figure S3). Because of the tendency of all models to underestimate SOC content in the upper A horizon, model residuals for SOC content were analyzed against heavy sand-sized OM (heavy POM) and free light POM (Figure 8), since POM is typically a SOC fraction that increases with high SOC values [3]. The results indicate that SOC prediction accuracy decreases as POM increases, both in heavy and light fractions, despite having somewhat different dynamics (Figure 8). These relationships were characterized by linear (R 2 = 0.35) and exponential growth (R 2 = 0.21) behavior in heavy POM and free light POM, respectively. Figure 7. Effect of sampling depth on soil organic carbon (SOC) prediction using (A) partial least squares (PLS), (B) random forest (RF), and (C) support vector machine (SVM) models. The training dataset contains 400 samples. All standardized residuals (residual = measured − predicted) were then plotted against their predicted C content. If the calibration model is appropriate, the residuals should be distributed randomly around the y = 0 line.
Because of the tendency of all models to underestimate SOC content in the upper A horizon, model residuals for SOC content were analyzed against heavy sand-sized OM (heavy POM) and free light POM (Figure 8), since POM is typically a SOC fraction that increases with high SOC values [3]. The results indicate that SOC prediction accuracy decreases as POM increases, both in heavy and light fractions, despite having somewhat different dynamics (Figure 8). These relationships were characterized by linear (R 2 = 0.35) and exponential growth (R 2 = 0.21) behavior in heavy POM and free light POM, respectively.

Cost Analysis
To evaluate the cost and time advantage of using MIR to estimate SOC and N in projects with large quantities of samples (e.g., over 400), MIR spectroscopy (PerkinElmer, Spectrum 3 FT-IR spectrometer), and dry combustion (Costech ECS CHNSO elemental analyzer), techniques were compared in terms of equipment, maintenance, consumables, and technician costs. The cost analysis assumes both instruments require the same sample preparation (i.e., sieving, oven-dried finely ground). Table 3 shows that adopting MIR technology does appear to be cost and time effective. The use of spectroscopy could increase throughput time from 4 to 12 samples per hour, whereas decrease cost by 2.5 times per sample. Although both methodologies have a similar instrument cost, the cost per sample was lower for MIR. In the dry combustion method, consumables associated with carrier gas (helium), purge gas (oxygen), and other supplies increase the cost. The labor cost per sample using dry combustion and MIR methods is USD 4.38/sample and USD 1.46/sample, respectively (labor costs were estimated at the rate of USD 17.5/h of labor). For this number of samples (n = 400), the total dry combustion cost is nearly USD 7000, while it is only USD 2000 for spectroscopy. Likewise, annual maintenance costs associated with using MIR are USD 3000, whereas the costs for the elemental analyzer are USD 1700. Using an optimal dataset to perform robust calibration from a large data pool might save approximately USD 4200 and 180 h (7.5 days) of technician time. Table 3. Cost comparison analysis (USD) for measuring 400 and 1000 samples using FT-IR spectrometer and combustion analyzer.

Cost Analysis
To evaluate the cost and time advantage of using MIR to estimate SOC and N in projects with large quantities of samples (e.g., over 400), MIR spectroscopy (PerkinElmer, Spectrum 3 FT-IR spectrometer), and dry combustion (Costech ECS CHNSO elemental analyzer), techniques were compared in terms of equipment, maintenance, consumables, and technician costs. The cost analysis assumes both instruments require the same sample preparation (i.e., sieving, oven-dried finely ground). Table 3 shows that adopting MIR technology does appear to be cost and time effective. The use of spectroscopy could increase throughput time from 4 to 12 samples per hour, whereas decrease cost by 2.5 times per sample. Although both methodologies have a similar instrument cost, the cost per sample was lower for MIR. In the dry combustion method, consumables associated with carrier gas (helium), purge gas (oxygen), and other supplies increase the cost. The labor cost per sample using dry combustion and MIR methods is USD 4.38/sample and USD 1.46/sample, respectively (labor costs were estimated at the rate of USD 17.5/h of labor). For this number of samples (n = 400), the total dry combustion cost is nearly USD 7000, while it is only USD 2000 for spectroscopy. Likewise, annual maintenance costs associated with using MIR are USD 3000, whereas the costs for the elemental analyzer are USD 1700. Using an optimal dataset to perform robust calibration from a large data pool might save approximately USD 4200 and 180 h (7.5 days) of technician time. Table 3. Cost comparison analysis (USD) for measuring 400 and 1000 samples using FT-IR spectrometer and combustion analyzer.

Discussion
We investigated the effect of calibration set size on SOC and soil N predictions in AMP and CG grazing systems across a latitudinal gradient in the Southeastern U.S. Additionally, the sample size was evaluated using three different models of increasing complexity (PLS > SVM > RF). Our study demonstrates that MIR spectroscopy could be used in largescale grassland SOC and soil health projects as an alternative for obtaining reliable SOC and soil N estimates with higher throughput and lower costs than elemental analyses using dry combustion. We started with a highly variable set of grassland soils in terms of their MIR spectra, SOC, and N values as well as grazing management in order to include representative conditions for soil analyses in large-scale regional soil grassland projects.
In general, we observed a moderate effect of sample size on the ability to predict SOC and N in grassland soils. As sample size increased, calibration only slightly improved the model's predictive ability, which plateaued around 400 samples. In fact, our results also indicated that there would not be any advantage in increasing the calibration sample size beyond 400 samples. Thus, for example, large-scale grassland projects with over 400 soil samples could analyze 400 samples for C and N using dry combustion in an elemental analyzer, scan all the samples using MIR spectroscopy, and estimate the remaining unknown concentrations with the model predictions. Furthermore, we observed that a larger sample size often resulted in worse predictions. This occurred when the RF model was trained using 800-1000 samples. By contrast, when the sample size was too small (<200 samples) to train the models, the validations were biased toward producing over-estimates.
Our results indicate that all the models underestimated SOC and N content over 2.0% and 0.20%, respectively. Contrary to our expectations, these trends were not improved by applying the non-linear RF and SVM models. Because our dataset was strongly skewed toward low SOC and N concentrations, our suggestion for optimal calibration sample size values does not apply to other datasets with significantly different distributions. Therefore, generalizations could be highly inaccurate for other datasets since we must also take into consideration the distribution of calibration and validation datasets, as well as the range we want to predict more accurately. Similarly, in previous studies [35,36], accuracy has been shown to improve with a higher number of training samples. However, few studies have investigated the impact of calibration sample size on SOC and N.
The PLS model had the smallest median RMSE and a lower underestimation of SOC content in the entire range of calibration sample sizes when compared with RF and SVM. Hence, we did not find any advantage of applying non-linear algorithms such as RF and SVM to our dataset. Therefore, PLS was particularly useful for achieving higher accuracy with less computational power. These results are consistent with previous studies [35], where a smaller sample size is required in order to yield useful predictions using PLS instead of more complex algorithms. In addition, RF had a notably lower performance, even though we expected it to substantially improve its predictive ability when the sample size was increased. It has been suggested that the RF model is able to deal with many predictors and few samples, resulting in low overfitting [37]. Therefore, the RF training algorithms in high-dimensional datasets such as ours (800-1000 samples) might need more trees and a higher degree of tree depth, which could be a serious problem as the observed variables increase. In contrast, the predictive ability of SVM was distinctly lower when smaller sample sizes were used (100-300 samples). These observations are similar to those reported in other studies [17,38], where PLS outperformed SVM.
We have also investigated other factors influencing the tendency to overestimate measured values when calibrations (n = 400) were used. This tendency is not surprising, considering that the relatively poorly predicted C content above 2% at the 0-10 cm depth was influenced mainly by the POM content above this threshold. There was, however, no such effect on accuracy when farm pair location and grazing practices were evaluated. The full potential of MIR to predict soil properties is not always satisfactorily achieved if some of the unknowns' compositional or analytical profile is substantially different from that of the samples in the calibration set [39]. The light component of POM--which mainly consists of partially decomposed plant material with different chemical characteristics --might affect the performance of these models, resulting in less reliable estimates at <10 cm depth [40]. Therefore, the topsoil samples collected play a critical role, while the selection of algorithm and sample size might be less important. In this case, when analyzing shallow samples with high POM concentrations the use of elemental analysis by dry combustion may be more suitable than spectroscopy for C estimates. We also recommend additional work to train MIR to specifically quantify SOC content in POM [41,42].
A rough cost estimate shows that measuring SOC and N using MIR is significantly less expensive than dry combustion techniques. As other groups have shown for these approaches [43,44], MIR spectroscopy offers a cost-effective alternative to conventional methods for C and N estimates. For the MIR method, the highest SOC and N concentrations tended to be under-estimated; however, we believe it will not adversely affect the reliable assessment of these parameters, while it would clearly reduce the costs. In our case, there is a tolerable margin of error for predictions, which depends on the data and application. For instance, for the purposes of restoring soil, the error in relatively organic-poor samples is significantly diminished when compared to the undisturbed ones.

Conclusions
We have defined an optimal calibration set size for SOC and soil N predictions for grazing systems using models of increasing complexity coupled with MIR spectroscopy. Overall, we demonstrated a high predictive performance of PLS, RF and SVM models using datasets with about 400 samples. We found that large sample sets did not necessarily improve the accuracy of the three training algorithms. Moreover, an optimal dataset for one algorithm is not always optimal for other machine learning models. Consequently, there is no cut-off criteria for choosing the ideal sample size, since prediction accuracy depends significantly on each dataset's sample distribution, training algorithm, and sampling depth. The spectroscopic method was highly accurate for SOC and N estimates; however all models overestimated the highest values of SOC and N content. In addition, the nonlinear models were not able to improve upon the classical PLS performance. Our results also suggest that POM will likely introduce uncertainty into the models. Lastly, MIR spectroscopy provides a cost-effective alternative to conventional SOC and N combustion analysis. This study offers insight into MIR methodology as an efficient, scalable, and cost-effective tool that provides reliable C and N estimates for grazing soils.
Supplementary Materials: The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/environments9120149/s1, Figure S1: Principal component analysis (PCA) score plot showing farm pairs (1 to 5) with high spatial spectral heterogeneity between soils. Axis 1 accounts 42% of variance in the data set and Axis 2 accounts for 25%; Figure S2: Frequency histograms for calibration (1000 samples) and validation dataset (612 samples) dataset for A) soil organic carbon (SOC) and (B) total soil nitrogen (N). The predictions were achieved by keeping the same frequency distribution between the calibration and validation dataset using a Kennard-stone sampling algorithm; Figure S3: Effect of sampling grazing practice and farm-pair soil on soil organic carbon (SOC) prediction using partial least squared (PLS). The training dataset contains 400 samples. All standardized residuals (residual = measured -predicted) were then plotted against their predicted C content. If the calibration model is appropriate, the residuals should be distributed randomly around the y = 0 line.