The Influence of Spectral Pretreatment on the Selection of Representative Calibration Samples for Soil Organic Matter Estimation Using Vis-NIR Reflectance Spectroscopy

In constructing models for predicting soil organic matter (SOM) by using visible and near-infrared (vis–NIR) spectroscopy, the selection of representative calibration samples is decisive. Few researchers have studied the inclusion of spectral pretreatments in the sample selection strategy. We collected 108 soil samples and applied six commonly used spectral pretreatments to preprocess soil spectra, namely, Savitzky–Golay (SG) smoothing, first derivative (FD), logarithmic function log(1/R), mean centering (MC), standard normal variate (SNV), and multiplicative scatter correction (MSC). Then, the Kennard–Stone (KS) strategy was used to select calibration samples based on the pretreated spectra, and the size of the calibration set varied from 10 samples to 86 samples (80% of the total samples). These calibration sets were employed to construct partial least squares regression models (PLSR) to predict SOM, and the built models were validated by a set of 21 samples (20% of the total samples). The results showed that 64−78% of the calibration sets selected by the inclusion of pretreatment demonstrated significantly better performance of SOM estimation. The average improved residual predictive deviations (∆RPD) were 0.06, 0.13, 0.19, and 0.13 for FD, log(1/R), MSC, and SNV, respectively. Thus, we concluded that spectral pretreatment improves the sample selection strategy, and the degree of its influence varies with the size of the calibration set and the type of pretreatment.


Introduction
Soil organic matter (SOM) has become a popular topic in the past decade because of its vital role in ecosystem quality, food security, and global climate change [1][2][3].Spatial and temporal monitoring and mapping of SOM are essential and urgent.However, the traditional measurement of SOM, such as combustion or chromate oxidation, is expensive and time-consuming [4,5].
Visible and near-infrared (vis-NIR) spectroscopy is an inexpensive and quick technique for the measurement of soil properties (e.g., SOM), and has been continuously developed over the past 30 years [6,7].Vis-NIR spectra contain the overtone and combination bands of functional group absorptions, such as C-H, C-H, C = O, N-H, and O-H, providing rich information about soil properties [8][9][10].Five parts are usually involved in SOM estimation: (i) field sampling; (ii) measurements, in which the SOM content is determined and soil reflectance spectra are obtained; (iii) preprocessing, in which spectroscopic reflectance spectra are preprocessed; (iv) calibration, in which a subset of samples is selected for building multivariate regression models that relate SOM content to reflection data; and (v) validation, in which a subset of independent samples is used to assess the accuracy of the built multivariate regression models [11,12].In part iv, the selection of representative calibration samples is decisive because all subsequent analyses will rely on the selected subset [13].Thus, special care must be given to the sample selection strategy [14,15].
Various approaches have been proposed to select representative samples, such as Kennard-Stone sampling (KS) [16], the D-optimal procedure [17], and sample set partitioning based on joint x-y distances (SPXY) [18].Some studies have investigated the inclusion of auxiliary information or covariates in sample selection, such as landscape, geographical information, soil type, soil moisture, and parent material [19][20][21][22][23][24][25][26][27].However, most methods of selecting samples use the raw spectra that can be influenced by instrumental status, experimental conditions, soil particle size, and surface roughness [15,28].Thus, extraneous interference may need to be removed before sample selection, which is rarely discussed.
Spectral pretreatment is usually employed to reduce spectra noise when soil spectra are correlated with soil properties by using the multivariate regression models, such as partial least squares regression (PLSR) [29,30].Some studies have discussed the suitable data pretreatment of regression analysis [12,29,31].When selecting samples, some studies have merely utilized pretreated spectra, such as principal component analysis (PCA) scores, the logarithmic function (log(1/R)), and derivatives [9,32,33].However, the effects of spectral pretreatment on the sample selection strategy have not yet been adequately studied.Moreover, whether the pretreatment of spectra is useful in selecting representative samples remains unclear.
The KS algorithm is a popular method that allows for the selection of samples with a uniform distribution over the predictor space [16,34,35].KS can select samples based on spectral characteristics, and it is sensitive to the spectral change after pretreatment.Thus, we chose the KS algorithm.
This study aimed to explore the effects of spectral pretreatment on sample selection.We firstly applied spectral pretreatment, and then selected samples.Six commonly used spectral pretreatments were included in sample selection.We then explored whether such a design can provide more representative samples and improve the subsequent performance of SOM models with vis-NIR spectroscopy.

Study Area
The study area is located in Chahe Town (29 • 45 18" N, 113 • 52 30" E) in central China (Figure 1).Chahe Town lies on the alluvial plain of Honghu Lake (43,100 ha) with an extensive network of rivers.This plain is one of the major rice-producing areas in China.The plain has a subtropical monsoon climate, with the mean daily temperature of 3.8 • C in winter and 28.9 • C in summer.The mean annual precipitation is 1154 mm, and the rainfall is concentrated in summer (from May to June).The main types of soil are Typical Haplaquept, Dystrochrept, Eutroboralf, and Hapludalf, according to the World Reference Base (WRB) for Soil Resource [36].The main types of land use are paddy field and irrigated cropland.The region has undergone dramatic changes in land use and land cover, caused by human activities since the 1950s, which have recently raised ecological concerns [37,38].

Sample Collection
Topsoil samples (0-15 cm; n = 108) were collected in an area of 4340.85 ha in December 2011 (Figure 2).The distance between the sample position was at least 20 meters [39].All samples were packed in sealed plastic bags, and then transported to the laboratory.Two outliers were identified according to the 3σ criterion, and the remaining 106 samples were used.
Remote Sens. 2018, 11, x; FOR PEER REVIEW 3 of 17 irrigated cropland.The region has undergone dramatic changes in land use and land cover, caused by human activities since the 1950s, which have recently raised ecological concerns [37,38].

Sample Collection
Topsoil samples (0-15 cm; n = 108) were collected in an area of 4340.85 ha in December 2011 (Figure 2).The distance between the sample position was at least 20 meters [39].All samples were packed in sealed plastic bags, and then transported to the laboratory.Two outliers were identified according to the 3σ criterion, and the remaining 106 samples were used.

Spectral Measurement and Chemical Analysis
The soil samples were air-dried and ground to pass through a 2-mm sieve.Each sample was divided into two parts: one was for spectral measurement and the other was for chemical analysis.The soil samples were scanned using an ASD FieldSpec® 3 portable spectro-radiometer (Analytical Spectral Devices Inc., Boulder, CO, USA) with a spectral range of 350-2500 nm.Spectral measurement was conducted in a dark room, and a halogen lamp was used as the source of light at an incidence angle of 45°.The fiber probe was installed 12 cm above the sample surface at a zenith angle of 90°.For each sample, approximately 300 g of soil was placed in a 20-cm-diameter dish with a thickness of about 10 mm.Scans were made 10 times and averaged.SOM was indirectly determined by the potassium dichromate volumetric method in accordance with the Chinese standard (specification of soil test, SL237-1999) [40].

Spectral Measurement and Chemical Analysis
The soil samples were air-dried and ground to pass through a 2-mm sieve.Each sample was divided into two parts: one was for spectral measurement and the other was for chemical analysis.The soil samples were scanned using an ASD FieldSpec ® 3 portable spectro-radiometer (Analytical Spectral Devices Inc., Boulder, CO, USA) with a spectral range of 350-2500 nm.Spectral measurement was conducted in a dark room, and a halogen lamp was used as the source of light at an incidence angle of 45 • .The fiber probe was installed 12 cm above the sample surface at a zenith angle of 90 • .For each sample, approximately 300 g of soil was placed in a 20-cm-diameter dish with a thickness of about 10 mm.Scans were made 10 times and averaged.SOM was indirectly determined by the potassium dichromate volumetric method in accordance with the Chinese standard (specification of soil test, SL237-1999) [40].
SG smoothing is a popular filter among smoothing filters to pretreat soil spectra [48].It is a low-pass filter used to smooth spectra by eliminating high-frequency noise and passing low-frequency signals [49].This filter fits successive sub-sets (windows) of adjacent data points with a low-degree polynomial through the use of linear least squares.For SG, the size of the window (filter width) is 15 nm and the order of the polynomial is 2 [50].Derivatives can remove unimportant baseline signals, interferences of background, and spectral overlapping [51].FD, which is the most commonly used pretreatment for removing baseline offset, was adopted in this study.SG smoothing was carried out before computing the derivatives.
The transformation of reflectance R into log(1/R) will highlight the edges of absorption features and help achieve linearization between the spectra and SOM content [28,52].Linearization is important for regression models because many modeling methods expect linear responses, which are easier to model than non-linear ones.MC does not reduce multicollinearity in multivariate regression models, but it will improve the numerical stability of some models (e.g., PLSR) [53,54].MC calculates the mean of a data set and subtracts this from each spectrum: where x i is the measured spectrum of sample #i, X is the mean spectrum of a data set, and x i (MC) is the corrected spectrum.MSC was first introduced by Martens et al. [55] in 1983 and described in detail by Geladi et al. [47] in 1985.MSC attempts to eliminate or minimize the impact from scattering [28].MSC hypothesizes that all samples have the same scatter level as the reference spectrum (e.g., the mean spectrum) [56].Thus, MSC firstly performs a regression of a measured spectrum against the reference spectrum, and then corrects the measured spectrum using the constructed linear regression model.MSC is calculated using Equations ( 2) and (3).
where x i is the measured spectrum of sample #i; a i and b i are the intercept and slope, respectively; X is the mean spectrum of a data set; x i (MSC) is the corrected spectrum; and 1 is a vector of ones.SNV corrects the multiplicative interferences of light scatter and particle size [29], and it has a similar function to MSC [57].In SNV, each spectrum is transformed by subtracting the spectrum mean and dividing by the spectrum standard deviation [58].The main difference between the two methods is that SNV is applied to an individual spectrum, whereas MSC uses a reference spectrum [41].MSC is calculated as follows: where m is the number of wavelengths, and x ij and x ij (SNV) are the measured and corrected reflectance of the jth wavelength for the sample #i, respectively.

Sample Selection Method
The KS algorithm selects samples sequentially, which are uniformly distributed over the spectral space by choosing samples that maximize the Euclidean distance between each other [35].The Euclidean distance is based on spectral characteristics and calculated using Equation (6).First, KS selects a pair of samples whose distance between each other is maximum.Second, KS will select a sample individually from the remaining subset.The next sample will be the farthest away from the samples already selected, and that iteration is repeated until the required number of samples is obtained.
where x mi and x ni are the reflectance of samples #m and #n at the ith wavelength, respectively; and I is the number of wavelengths.

Inclusion of Pretreatment in Sample Selection
All 106 samples were sorted in ascending order based on the concentration of SOM, and then 21 samples were chosen at an interval of four samples as the validation set (20% of the total samples).The validation set was used to validate the built models.Such a division strategy is different from sampling for map validation in digital soil mapping, such as purposive sampling and probability sampling [26,37,59,60].In doing so, we ensured that the validation samples were evenly distributed in the range of the SOM concentration and covered the SOM diversity of expected future samples [44].Moreover, such a division strategy was commonly adopted in previous studies using vis-NIR spectroscopy to estimate soil properties [27,38,61,62].The raw spectra of the remaining 85 samples were pretreated by six methods, namely SG, FD, MC, log(1/R), MSC, and SNV.The raw spectra were also used for comparison.
The samples for the calibration set were selected using KS and pretreated spectra.The size of the calibration set was successively increased from 10 samples to 85 samples in increments of one (i.e., 10, 11, . . ., 85).

PLSR Models
PLSR is a popular technique used to correlate soil spectra with SOM [33,63,64].Partial least squares regression (PLSR) first projected the spectral data onto a low-dimensional space by maximizing the covariance between the soil spectra and SOM.Multiple regression analysis was then performed in the low-dimensional space.The calibration sets were used to construct the PLSR models.The number of latent variables was determined by the leave-one-out cross-validation (LOOCV).We focused on sample selection; hence, no pretreatment was used in PLSR.

Performance of Models
The performance of the PLSR models was assessed by a useful statistic, namely the residual predictive deviation (RPD), which was calculated as follows: [65] RPD = SD RMSEP (7) where SD is the standard deviation of the reference values, and RMSEP is the root-mean-square error of prediction.
We compared the results of including pretreatment in sample selection with that of not including pretreatment in sample selection, and we calculated ∆RPD as follows: where RPD Pretreated is the RPD of the PLSR model when calibration samples were selected using the pretreated spectra, and RPD Raw is the RPD of the PLSR model when calibration samples were selected using the raw spectra.

Descriptive Statistics of Soil Samples
SOM content varied from 4.06 g kg −1 to 47.34 g kg −1 (Table 1).The coefficient of variation (CV) was 0.39, which indicated that SOM was of medium variability (0.1 < CV < 1.0) [66].The skewness was −0.19 and was close to zero, thereby implying that the number of samples with low and high SOM contents was similar.The kurtosis was −1.06, which meant that there were less samples around the mean SOM content than in a normal distribution, and the distribution was relatively flat.

Soil Spectral Characteristics
The spectral profile showed three prominent absorption peaks at ~1420, 1920, and 2220 nm, which were mainly caused by the hydroxyl group (OH) of free water at 1420 and 1920 nm and the Al-OH lattice structure in clay minerals at 2220 nm (Figure 2) [67].The shape of the soil spectral reflectance curves was consistent with the results of other studies [14,68]. 1 Range denotes the difference between the maximum and minimum observations. 2SD denotes standard deviation. 3CV denotes coefficient of variation.SOM, soil organic matter.

Descriptive Statistics of Soil Samples
SOM content varied from 4.06 g kg -1 to 47.34 g kg -1 (Table 1).The coefficient of variation (CV) was 0.39, which indicated that SOM was of medium variability (0.1 < CV < 1.0) [66].The skewness was −0.19 and was close to zero, thereby implying that the number of samples with low and high SOM contents was similar.The kurtosis was −1.06, which meant that there were less samples around the mean SOM content than in a normal distribution, and the distribution was relatively flat.

Soil Spectral Characteristics
The spectral profile showed three prominent absorption peaks at ~1420, 1920, and 2220 nm, which were mainly caused by the hydroxyl group (OH) of free water at 1420 and 1920 nm and the Al-OH lattice structure in clay minerals at 2220 nm (Figure 2) [67].The shape of the soil spectral reflectance curves was consistent with the results of other studies [14,68]. 2 SD denotes standard deviation. 3CV denotes coefficient of variation.
SOM, soil organic matter.

Accuracy of SOM Prediction after Including Pretreatment in Sample Selection
The control group without applying pretreatment to sample selection was the basis for the following comparison, and its performance is illustrated in Figure 3a.When only a few samples (<16) were selected, the model performed poorly, with an RPD of only ~1.03.The RPD drastically increased up to 1.52 at a calibration set size of 17 samples, remained stable with minor volatility (|∆| ≤0.24) at a calibration set size <58 samples, increased slowly (from 1.58 to 1.85) at a calibration set size of 58-68 samples, increased sharply (from 1.58 to 2.22) at a calibration set size of 68-71 samples, and then

Accuracy of SOM Prediction after Including Pretreatment in Sample Selection
The control group without applying pretreatment to sample selection was the basis for the following comparison, and its performance is illustrated in Figure 3a.When only a few samples (<16) were selected, the model performed poorly, with an RPD of only ~1.03.The RPD drastically increased up to 1.52 at a calibration set size of 17 samples, remained stable with minor volatility (|∆RPD| ≤ 0.24) at a calibration set size <58 samples, increased slowly (from 1.58 to 1.85) at a calibration set size of 58-68 samples, increased sharply (from 1.58 to 2.22) at a calibration set size of 68-71 samples, and then remained stable again.

Proportion of Pretreatment's Positive or Negative Influence on Sample Selection
The proportion of pretreatment's positive or negative influence on sample selection is shown in Figure 4.With the exception of SG and MC, the other four pretreatments improved sample selection in over 64 percent of cases, namely FD (64%), log(1/R) (78%), MSC (72%), and SNV (72%).A

FD denotes first derivative (b). MC denotes mean centering (c). log(1/R) denotes logarithmic function (d). MSC denotes multiplicative scatter correction (e). SNV denotes standard normal variate (f). RPD denotes residual predictive deviation.
SG and MC had no impact on sample selection (Figure 3a,c).SG slightly changed the spectra; thus, its effect was nearly negligible.According to Equations ( 1) and ( 6), MC will not change the Euclidean distance between samples; hence, it did not affect the sample selection that was based on Euclidean distance.
With the exception of MC and SG, the other four pretreatments affected sample selection differently (Figure 3b,d-f).For FD, the RPD increased linearly from 1.11 to 2.24 when more samples were selected (Figure 3b).FD considerably improved sample selection at calibration set sizes of 37-69 samples.
log(1/R) improved sample selection at calibration set sizes ≤68 samples, and the RPD of only 17 samples was as high as 1.96 (Figure 3d).The RPD of the raw spectra was above 1.96 until more than 69 samples were selected.Thus, 17 samples obtained the same results as 69 samples with the aid of spectral pretreatment.

Proportion of Pretreatment's Positive or Negative Influence on Sample Selection
The proportion of pretreatment's positive or negative influence on sample selection is shown in Figure 4.With the exception of SG and MC, the other four pretreatments improved sample selection in over 64 percent of cases, namely FD (64%), log(1/R) (78%), MSC (72%), and SNV (72%).A satisfactory result was also achieved for the average degree of pretreatment's influence on sample selection (Figure 4).MSC performed best with the highest average ∆RPD (0.19), and the improvement was considerable.SNV and log(1/R) obtained the same average ∆RPD (0.13).FD slightly influenced sample selection with an average ∆RPD of 0.08.2).Thus, pretreatment's influence on sample selection was significant and positive in most cases.2).Thus, pretreatment's influence on sample selection was significant and positive in most cases.

Table 2. A comparison of different pretreatments in terms of residual predictive deviation (RPD)
according to the analysis of variance (ANOVA) with a Games-Howell post-hoc test.

Euclidean Distance between Samples after Pretreatment
Without pretreatment, the Euclidean distance between samples is shown in Figure 6a.Samples were sorted depending on the SOM concentration.Sample #1 and sample #85 demonstrated the biggest difference in SOM concentration and exhibited the largest spectral distance (bright red).The distance from a sample to itself was zero (dark blue).A "zonal distribution" was observed: the color changed gradually from the bottom right (bright red) to top left (dark blue).In each zone, the difference in SOM concentration of each pair of samples was roughly the same; so is the distance between each pair of samples.The spectral distance gradually increases with SOM concentration difference, resulting in a "zonal distribution".

Influence of Pretreatment on Sample Selection
The influence of spectral pretreatment on sample selection varies with the size of the calibration set.For example, log(1/R) considerably improved sample selection at small sample sizes, whereas its influence was slightly weak at large sample sizes (Figure 3d).In addition, when a large proportion of samples was to be selected, spectral pretreatment only slightly influenced sample selection.One reason behind this is that the impact of a newly selected sample on the calibration set weakened with increasing size of the calibration set.
There might be two potential applications.On the one hand, a subset of samples selected by the inclusion of pretreatment could lead to better performance than the total dataset.For example, 58 samples performed better (RPD = 2.34) than the total dataset (RPD = 2.22) with the inclusion of MSC in sample selection.On the other hand, pretreatment may help to reduce the number of calibration samples.For example, 17 samples selected by including the pretreatment of log(1/R) achieved the same model accuracy as 65 samples selected without pretreatments.Isaksson et al. used cluster analysis techniques to select 20 samples that obtained the same model performance as the original 114 samples [69].Shetty et al. selected 19 samples that represented 118 samples by using Puchwein's method [13].In the present study, similar results were obtained by applying pretreatment to sample selection.
Different types of pretreatment can have varied effects on sample selection.MSC performed best, followed by log(1/R), SNV, and FD.By contrast, SG and MC had no impact on sample selection.The reason behind this is that different spectral pretreatments improve the quality of spectra in various ways, such as eliminating the impact from scattering and particle size (SNV and MSC) [29], highlighting the edges of absorption features and achieving linearization (log(1/R)) [28,52], and removing baseline signals and spectral overlapping (FD) [51].When the pretreated spectra are different, the sample selection method selects different samples.Furthermore, the dataset might be affected by many factors, and only a single pretreatment might not be able to deal with all factors.Therefore, the optimal pretreatment depends on the dataset used, and a combination of multiple pretreatments might be more useful for sample selection, which requires further study.

How Pretreatment Affects Sample Selection
Pretreatment affects sample selection by changing the spectral distance (the Euclidean distance) between samples, and the change might be an enhancement or weakening.For log(1/R), MSC, and SNV, the change is near (mean SOM concentration, low SOM concentration) and (mean SOM concentration, high SOM concentration).Thus, the change near the mean SOM concentration may facilitate sample selection.The reason behind this is that the sample closest to the mean is deemed to be the most representative [33,70].However, FD changed the distance near (high SOM concentration, low SOM concentration) and its influence was weaker than that of the other three pretreatments.
The change in distance affects the process of selecting samples.Figure 7 illustrates an example of how KS selects 14 samples.After pretreatment with log(1/R), more representative samples were selected, and the RPD increased from 1.03 to 1.40.The major differences in sample selection between raw and pretreated spectra were low SOC (Sample #10), mean SOC (Sample #35 and #55), and high SOC (Sample #76 and #80).This finding was in line with our previous results of distance change.Another difference is that the intervals between adjacent samples become reasonable.For example, samples #42 and #61 were selected without pretreatment, and the interval was large.However, after pretreatment, samples #40 and #55 were selected, and the interval decreased.Thus, pretreatment in sample selection could determine a subset of samples that spans the same space, but is more evenly distributed in the space [71].From the perspective of regression, a flat distribution of data is more favorable than a normal distribution [13].
Pretreatment is usually used in multivariate regression models (e.g., PLSR).In the present study, pretreatment was used in sample selection.In multivariate regression analysis, the aim of pretreatment is to improve model performance.Pretreatment works by linearizing the response of a variable and removing noise [72].However, when selecting samples, representative samples should be selected.Pretreatment works by changing the differences or similarities between samples [56,73].Thus, the effects of pretreatments on multivariate regression analysis and sample selection differed.This difference was verified by the results of Table 3 and Figure 4.In addition, if a pretreatment was useful in multivariate regression analysis, it was not necessarily useful in sample selection.For example, log(1/R) considerably improved sample selection, but it worsened the PLSR model.
The soil samples we used were arid-dried, ground, and sieved.During in-situ application, field spectra were affected by environmental factors, such as soil water content and soil surface roughness [74].The removal of effects of water from field spectra could increase the performance of multivariate regression models [25,75].The use of spectral pretreatment to remove the effects of environmental factors on field spectra might also help sample selection methods, which requires further study.
Remote Sens. 2018, 11, x; FOR PEER REVIEW 13 of 17 variable and removing noise [72].However, when selecting samples, representative samples should be selected.Pretreatment works by changing the differences or similarities between samples [56,73].Thus, the effects of pretreatments on multivariate regression analysis and sample selection differed.This difference was verified by the results of Table 3 and Figure 4.In addition, if a pretreatment was useful in multivariate regression analysis, it was not necessarily useful in sample selection.For example, log(1/R) considerably improved sample selection, but it worsened the PLSR model.
The soil samples we used were arid-dried, ground, and sieved.During in-situ application, field spectra were affected by environmental factors, such as soil water content and soil surface roughness [74].The removal of effects of water from field spectra could increase the performance of multivariate regression models [25,75].The use of spectral pretreatment to remove the effects of environmental factors on field spectra might also help sample selection methods, which requires further study.

Conclusions
This present study included spectral pretreatment in a sample selection strategy to select calibration samples for the SOM estimation using vis-NIR spectroscopy.From our results, we draw the following conclusions: (i) the inclusion of spectral pretreatment in sample selection can select more representative samples and improve the subsequent performance of SOC estimation; and (ii) the degree of the influence of pretreatment on sample selection can differ depending on the size of the calibration set and the type of pretreatment.
Despite our success in including pretreatment in sample selection to improve SOM estimation using vis-NIR spectroscopy, sample selection and SOM estimation can still be improved.Future research should be focused on other sample selection methods, increasing the sampling density, field spectra, and the inclusion of multiple pretreatments in sample selection.Our study is based on local samples, but our strategy might also be at a national scale.

Figure 1 .
Figure 1.Maps that show the location of the sampled region, the positions of the sampling sites, and the landscapes, as indicated by a Landsat 7 enhanced thematic mapper plus (ETM+) scan line corrector off (SLC-off) image with a composition of bands 4 (red), 3 (green), and 2 (blue).

Figure 1 .
Figure 1.Maps that show the location of the sampled region, the positions of the sampling sites, and the landscapes, as indicated by a Landsat 7 enhanced thematic mapper plus (ETM+) scan line corrector off (SLC-off) image with a composition of bands 4 (red), 3 (green), and 2 (blue).

Figure 2 .
Figure 2. The spectral reflectance of soil samples (n = 106).The principal positions of spectral absorption by organics and water are highlighted.

Figure 2 .
Figure 2. The spectral reflectance of soil samples (n = 106).The principal positions of spectral absorption by organics and water are highlighted.
Sens. 2018, 11, x; FOR PEER REVIEW 9 of 17 improvement was considerable.SNV and log(1/R) obtained the same average ∆RPD (0.13).FD slightly influenced sample selection with an average ∆RPD of 0.08.The boxplot of RPD value of the partial squares regression (PLSR) model after including pretreatment in sample selection is shown in Figure 5. Pretreatment improved sample selection in terms of mean, median, third quartile, and third quartile of RPD value.The analysis of variance (ANOVA) showed that log(1/R) and MSC significantly changed sample selection (p <0.05) (Table

Figure 4 .
Figure 4.The results of the proportion of calibration sets when pretreatment influenced sample selection positively (dark gray bar) or negatively (light gray bar) and the average ∆RPD (black bar).SG denotes Savitzky-Golay smoothing.FD denotes first derivative.MC denotes mean centering.log(1/R) denotes logarithmic function.MSC denotes multiplicative scatter correction.SNV denotes standard normal variate.

Figure 4 .
Figure 4.The results of the proportion of calibration sets when pretreatment influenced sample selection positively (dark gray bar) or negatively (light gray bar) and the average ∆RPD (black bar).SG denotes Savitzky-Golay smoothing.FD denotes first derivative.MC denotes mean centering.log(1/R) denotes logarithmic function.MSC denotes multiplicative scatter correction.SNV denotes standard normal variate.The boxplot of RPD value of the partial squares regression (PLSR) model after including pretreatment in sample selection is shown in Figure5.Pretreatment improved sample selection in terms of mean, median, third quartile, and third quartile of RPD value.The analysis of variance (ANOVA) showed that log(1/R) and MSC significantly changed sample selection (p < 0.05) (Table2).Thus, pretreatment's influence on sample selection was significant and positive in most cases.

Figure 4 .
Figure 4.The results of the proportion of calibration sets when pretreatment influenced sample selection positively (dark gray bar) or negatively (light gray bar) and the average ∆RPD (black bar).SG denotes Savitzky-Golay smoothing.FD denotes first derivative.MC denotes mean centering.log(1/R) denotes logarithmic function.MSC denotes multiplicative scatter correction.SNV denotes standard normal variate.

Figure 5 .
Figure 5.A boxplot of the RPD of the partial least squares regression (PLSR) model after including pretreatment in sample selection.FD denotes first derivative.log(1/R) denotes logarithmic function.MSC denotes multiplicative scatter correction.SNV denotes standard normal variate.

Figure 5 .
Figure 5.A boxplot of the RPD of the partial least squares regression (PLSR) model after including pretreatment in sample selection.FD denotes first derivative.log(1/R) denotes logarithmic function.MSC denotes multiplicative scatter correction.SNV denotes standard normal variate.Table 2.A comparison of different pretreatments in terms of residual predictive deviation (RPD) according to the analysis of variance (ANOVA) with a Games-Howell post-hoc test.Type of Pretreatment Variable N None FD log(1/R) MSC SNV RPD (Mean ± Std.Deviation) 76 1.73 ± 0.33 1.81 ± 0.34 (p = 0.62)

Figure 6 .
Figure 6.The Euclidean distance among samples of raw spectra (a) and the change in Euclidean distance after the spectra were pretreated by first derivative (FD) (b), logarithmic function (log(1/R)) (c), multiplicative scatter correction (MSC) (d), and standard normal variate (SNV) (e).All the calibration samples are sorted in ascending order according to SOM content and then numbered #1, #2, …, #85.

Figure 6 .
Figure 6.The Euclidean distance among samples of raw spectra (a) and the change in Euclidean distance after the spectra were pretreated by first derivative (FD) (b), logarithmic function (log(1/R)) (c), multiplicative scatter correction (MSC) (d), and standard normal variate (SNV) (e).All the calibration samples are sorted in ascending order according to SOM content and then numbered #1, #2, . . ., #85.

Figure 7 .
Figure 7.A subset of 14 samples selected based on raw and pretreated spectra.The gray ellipse shows the major difference in sample selection between raw and pretreated spectra.

Figure 7 .
Figure 7.A subset of 14 samples selected based on raw and pretreated spectra.The gray ellipse shows the major difference in sample selection between raw and pretreated spectra.

Table 1 .
Descriptive statistics of 106 soil samples for the calibration and validation sets.

Table 1 .
Descriptive statistics of 106 soil samples for the calibration and validation sets.
1Range denotes the difference between the maximum and minimum observations.

Table 3 .
Cross-validation of applying pretreatments to multivariate regression analysis for estimating soil organic matter (SOM).: SG denotes Savitzky-Golay smoothing, FD denotes first derivative, MC denotes mean centering, log(1/R) denotes logarithmic function, MSC denotes multiplicative scatter correction, SNV denotes standard Note

Table 3 .
Cross-validation of applying pretreatments to multivariate regression analysis for estimating soil organic matter (SOM).: SG denotes Savitzky-Golay smoothing, FD denotes first derivative, MC denotes mean centering, log(1/R) denotes logarithmic function, MSC denotes multiplicative scatter correction, SNV denotes standard normal variate, RMSE cv denotes the root-mean-square error of cross-validation, R 2 cv denotes the coefficient of determination for cross-validation, RPD denotes residual predictive deviation, and ∆RPD denotes changed RPD. Note