A Near Standard Soil Samples Spectra Enhanced Modeling Strategy for Cd Concentration Prediction

: Visible and near-infrared (VNIR) spectroscopy technology for soil heavy metal (HM) concentration prediction has been widely studied. However, its spectral response characteristics are still uncertain. In this study, a near standard soil Cd samples (NSS Cd ) spectra enhanced modeling strategy was developed in order to to reveal the soil cadmium (Cd) spectral response characteristics and predict its concentration. NSS Cd were produced by adding the quantitative Cd solution into background soil. Then, prior spectral bands (i.e., the bands with higher variable importance in projection (VIP) score in NSS Cd spectra) were used for predicting Cd concentration in soil samples collected from the Hengyang mining area and Baoding agriculture area. The partial least squares (PLS) and competitive adaptive reweighted sampling-partial least squares (CARS-PLS) were used for validation. Compared to using entire VNIR spectral ranges, the new modeling strategy performed very well, with the coefﬁcient of determination (R 2 ) and the ratio of prediction to deviation (RPD) showing an improvement from 0.63 and 1.72 to 0.71 and 1.95 in Hengyang and from 0.54 and 1.57 to 0.76 and 2.19 in Baoding. These results suggest that NSS prior spectral bands are critical for soil HM prediction. Our results represent an exciting ﬁnding for the future design of remote sensing sensors for soil HM detection.


Introduction
Soil is a fundamental medium for material and energy circulation in terrestrial ecological systems. With rapid urbanization and industrialization, soils have been contaminated by heavy metal (HM) due to various smelting, mining, dust settlement, agricultural, and industrial producing activities [1][2][3]. Soil HM not only causes soil health degradation but also imposes significant harm on organisms by accumulating in the food chain. Therefore, soil HM contamination has gradually emerged as a serious problem all over the world [4][5][6], especially in China where in 2014 it was reported that about 19.4% of farmlands were polluted by heavy metals [7,8].
Conventional methods for HM investigation based on soil sampling in the field and subsequent chemical analysis in the laboratory are inefficient and costly [9,10]. In comparison, visible and near-infrared reflectance spectroscopy (VNIR) remote sensing technique is non-destructive, rapid, repeatable, and spatially and temporally continuous. For the VNIR technique, the prediction of soil Cd concentration relies on its relationship with soil spectral response characteristics. So far, a series of empirical models have been developed for predicting HM concentrations with VNIR spectra. These models include principal component analysis (PCA), partial least squares regression (PLS), random forest (RF), artificial neural network (ANN), and support vector machine (SVM) [11][12][13][14][15][16]. In this process, the subset bands of spectra have been found to be more effective in predicting soil HM due to the removement of redundant information from the entire VNIR spectral ranges (350-2500 nm) [17]. Currently reported methods for selecting subset bands of spectra mainly contain genetic algorithm (GA), uninformative variable elimination (UVE), competitive adaptive reweighted sampling (CARS), and so on [13,[17][18][19][20][21]. Among them, the CARS-PLS model is recognized as an efficient and competitive way to select the optimal combination of key subset bands of spectra [17].
To further reveal the spectral response characteristics of soil HM, some researchers attempted to determine the important spectral bands by using coefficients of Pearson correlation and variable importance in projection (VIP) in PLS modeling [22][23][24][25][26][27][28]. However, according to the previous findings in mining [3,[29][30][31], sediment [32,33], agricultural [10,34], and urban areas [24,25], the spectral reflectance of naturally contaminated soil samples (NCS) can be affected not only by soil HM but also by the heterogeneous constituents of the soil. For this reason, the reported spectral bands for soil HM prediction so far are still uncertain and controversial [35].
To exclude the influence of heterogeneous soil and reveal the importance of reflectance of each spectral band, a control variate technique which can exclude the influence of complex factors was developed [36]. After that, this control variate technique was widely validated in predicting soil petroleum hydrocarbons, crops HM contamination, etc. [37,38]. Particularly in the field of soil HM prediction, the effectiveness of the control variate technique has also been implemented in extracting the spectral response characteristics of near standard soil samples (NSS) [39,40]. Thus, with the support of the control variate technique, it is promising to predicting the soil HM concentrations by reasonably using the prior spectral bands from NSS. Therefore, taking the cadmium (Cd) as an example, the objectives of this study are: (1) to reveal the spectral response characteristics and determine important spectral bands denoting Cd concentration variations in soil based on NSS Cd ; and (2) to develop a modeling strategy enhanced by the NSS Cd prior spectral bands with the CARS-PLS and PLS models, and subsequently verify its feasibility of detecting soil Cd concentration. We hope the proposed strategy could be extended to other soil HMs, and guide the exploitation of remote sensing products, such as the spectral index or the improvement of sensors, for soil HM detection. Figure 1 shows the entire study framework. As illustrated, the whole experiment process consists of three steps: (1) soil sampling, producing the NSS Cd by adding standard Cd solution to background soil, and chemical analysis; (2) spectra measurement and pretreatment; and (3) model construction and validation. In this study, the near stand soil Cd samples (NSS Cd ) are proposed and defined as the samples with expected Cd concentration. The first step aims to collect the field soil samples from two case areas (i.e., Hengyang, Baoding), produce NSS Cd with the background soil (i.e., unpolluted soil), and measure the Cd concentrations of all these soil samples. The second step, spectra measurement and pretreatment, involves measuring the hyperspectral signals of soil samples with a spectrometer and removing the random noise of the measured spectra. The third step involves the main process of the NSS spectra enhanced modeling strategy: the NSS Cd prior spectral bands are extracted by the VIP method, and these spectral ranges are used to predict Cd concentration in NCS collected from the two case areas with both the CARS-PLS and PLS model. Remote Sens. 2021, 13, x FOR PEER REVIEW 3 of 18

Field soil Sampling
Two case areas are, respectively, located in southern Hengyang and northern Baoding in China (Figure 2a). The Hengyang study area is close to the Shuikoushan mining area at the upstream of the Xiangjiang river basin. The main soil type of this area is red soil, which contains Fe-oxides. The historic mining activities had discharged HM into the soil by chimneys exhaust, solid waste accumulation, and sewage [41]. The potential pollution is threatening the surrounding residents' health. The Baoding study area is an agricultural land in Hebei province. This area is used as orchard and paddy field. Long-term irrigation and fertilization have led to the accumulate of heavy metal. The main soil type is loess.
In the Hengyang area, 57 naturally contaminated samples in topsoil (0-20 cm) were collected within the boundary from 112°35′24′′ E to 113°36′37′′ E and from 26°32′37′′ N to 25°34′12′′ N. These soil sampling sites are near the agricultural land, industrial land, and hydrology networks, etc. In the Baoding area, 42 naturally contaminated samples in topsoil (0-20 cm) within the boundary from 116°3′12′′ E to 116°3′27′′ E and from 38°55′38′′ N to 38°36′30′′ N were collected. These sampling sites are mainly located in the orchard and paddy field. Detailed spatial distribution of sampling sites are shown in Figure 2b,c. For each sample site, 500 g soil was collected and packed into plastic bags with a label, and the location was recorded by a global position system (GPS).  Two case areas are, respectively, located in southern Hengyang and northern Baoding in China (Figure 2a). The Hengyang study area is close to the Shuikoushan mining area at the upstream of the Xiangjiang river basin. The main soil type of this area is red soil, which contains Fe-oxides. The historic mining activities had discharged HM into the soil by chimneys exhaust, solid waste accumulation, and sewage [41]. The potential pollution is threatening the surrounding residents' health. The Baoding study area is an agricultural land in Hebei province. This area is used as orchard and paddy field. Long-term irrigation and fertilization have led to the accumulate of heavy metal. The main soil type is loess.

Near Standard Soil Cd Samples Production
To produce the NSSCd with Cd concentration ranges from low to high, the background soil must be unpolluted and homogenized. For this, we selected a high ground site in the neighbor of the Hengyang mining area, which was hardly polluted by mineral slag, sewage, etc. We collected about 30 kg of background soil from the same site. Furthermore, all of the background soil was stirred after plant residues and stones were elimi-  were collected. These sampling sites are mainly located in the orchard and paddy field. Detailed spatial distribution of sampling sites are shown in Figure 2b,c. For each sample site, 500 g soil was collected and packed into plastic bags with a label, and the location was recorded by a global position system (GPS).

Near Standard Soil Cd Samples Production
To produce the NSS Cd with Cd concentration ranges from low to high, the background soil must be unpolluted and homogenized. For this, we selected a high ground site in the neighbor of the Hengyang mining area, which was hardly polluted by mineral slag, sewage, etc. We collected about 30 kg of background soil from the same site. Furthermore, all of the background soil was stirred after plant residues and stones were eliminated.
After the pretreatment, five samples (each sample contains about 100 g) from the different parts of the whole background were used for chemical analysis in order to acquire their basic constituent information. The detailed process was the same as the description in Section 2.2.3. The average pH value of these background soils is 4.17, and the Cd concentration is 0.38 mg/kg, while the soil organic matter (SOM) is 6.78 g/kg. Detailed constituents are presented in Appendix A, Table A1. To make the expected concentrations of NSS Cd correspond to a practical situation, it is necessary to consider the statistics of Cd concentrations in the previous investigations (Appendix A, Table A2). In general, the Cd concentrations in soil are lower than 2 mg/kg. However, in mining areas, these concentrations are significantly higher and can even reach 200 mg/kg. Thus, the expected Cd concentrations of NSS Cd were set from 0.5 to 54 mg/kg with the interval at 1 mg/kg. The interval of expected Cd concentrations was set to 0.1-0.2 mg/kg below 2 mg/kg. Table 1 shows the final expected Cd concentrations of NSS Cd . Following the expected scheme of Table 1, each sample was individually made up of the C mL standard Cd solution (GSBG 62040-90(4801). Concentration, 1000 ug/L; Medium, 10% HCl) and 100 g of background soil and mixed thoroughly. The quantity of Cd solution (C) was calculated by Equation (1) [40]: where C represents the volume (mL) of the standard Cd solution adding into background soils samples; D represents the quantity of air-dried background soil samples without plant residues and stones (g), D = 100; A represents the expected concentrations of NSS Cd (mg/kg) in Table 1; S represents the Cd concentration of the background soil (which is presented in Appendix A, Table A1); ρ is the Cd concentration of standard Cd solution (mg/L), ρ = 1. After that, all the NSS Cd were settled for two months until the solution was naturally dried. Finally, this NSS Cd could be used for the subsequent process.
In the NSS Cd production process, there were measuring errors of the volume of Cd solution (C) and quantity of background (D). This was especially the case for D; while the soil was moved to the container after weighting, the quality was lost easily because of the residual soul on instruments. So, these factors inevitably lead to a deviation between the expected and actual Cd concentration of NSS Cd . It was still necessary to verify the Cd concentration of NSS Cd by chemical analysis.

Laboratory Chemical Analysis
Before the chemical analysis, all the NCS and NSS Cd were pretreated in the following steps: (1) air-drying in the ventilated and shady place; (2) eliminating the plant residues and stones from the NCS; and (3) grinding by the agate mortar and sieving by using a 100-mesh polyethylene sieve. Then, a few soils were taken out from each sample and digested by the mixed-acid HNO 3 -HF-HClO 4 in a microwave digestion instrument. Finally, Cd concentrations of soil samples were determined by atomic absorption spectrophotometry (AAS). The measurement accuracy of the soil Cd concentration is 0.03 mg/kg (GBT23739-2009), and the lowest detectable limit is at 0.005 mg/kg (GB15618-1995). Meanwhile, soil pH value was measured with a pH meter after shaking 10 g in a suspension of soil:water at a ratio of 1:5 for 30 min [42]. The whole process of soil samples chemical analysis was air-dried at the Soil and Fertilizer Institute in Hunan, China. The Cd concentration of NSS Cd was present in Table 1. Detailed discussion about the difference of the measured values of Cd concentration of NSS Cd were described in Section 2.2.2.

Spectra Measurement and Pretreatment
The instrument used for measuring spectra of soil samples in this study was the PSR-3500 spectrometer (Spectral Evolution Inc., Lawrence, MA, USA), covering a spectral range from 350 to 2500 nm. Spectral resolutions of the PSR-3500 spectrometer are 3.5 nm at 700 nm, 10 nm at 1500, and 7 nm at 2100 nm. The measurement process was carried out in a darkroom. After the pretreatment in Section 2.2.3, all the samples were placed on a black cloth in sample number order, while the top was smoothed down by lab spoon before spectra measurement. A 50 W halogen with a 30 • zenith angle was used as the unique light source. The distance between the light source and the soil samples was 60 cm. The spectra of samples were calibrated by using a standardized white BaSO 4 panel before measuring every three samples. During the measurement process, the fiber optic probe was placed approximately 3 cm above the target soil sample and had a 45 • zenith angle. Finally, each soil sample was scanned five times to measure the spectral reflectance, and the average spectrum was recorded.
For the measured spectra of soil samples, the spectral bands at intervals of 350-400 nm and 2400-2500 nm were, for the first, eliminated due to the low signal-to-noise ratio. Then, Savitzky-Golay smoothing with 15 points and second-order polynomials was adopted to filter and reduce the random noise of spectra. Additionally, outlier spectra caused by factors such as operation mistakes were further identified and removed using the spectral angle method, as these 'outliers' are significantly different from the normal soil spectra. The spectral angle is an index to evaluate the similarity between two spectra and can be calculated by the equation as follows: where t j represents the mean spectral reflectance in the j-th band of a set of samples; r ij represents the spectral reflectance in the j-th band of the i-th sample; and p is the total number of spectral bands. If the θ i is greater, there is a greater difference between the spectra of the i-th sample and the average spectra. The θ of all samples were analyzed by box plot. The first, third quartiles (Q 1 and Q 3 ), and the interquartile range (IQR = Q 3 − Q 1 ) were calculated. The samples which θ range outside the interval (Q 1 − 1.5 IQR, Q 3 + 1.5 IQR) are the outlier, the relevant spectrum can be regarded as error data. These samples were finally eliminated in this study.

Prior Spectral Bands Extraction
For the prior spectral bands, they were extracted with the VIP score of the PLS modeling [24,43]. The PLS model can be expressed as Equations (3) and (4) [44]: where, X (n × p) represents the spectra reflectance of n samples, p represents the number of bands, T (n × h) is the scores of h latent variables, P (p × h) is the loading matrix, and y (n × 1) is the concentration of HM. b (h × 1) is the regression coefficient of T. E (n × p) and f (n × 1) are the random error matrix. The result of the PLS model is influenced by the number of latent variables (h). The optimal PLS model was determined through the model performance, which is in detail described in Section 2.4.2. VIP score refers to the variable importance in PLS projections. The idea is to accumulate the importance of each variable j being reflected by w from each component. In general, the variable j should be selected if VIP j > 1 [26]. Thus, in this study, the prior spectral bands were selected with this criterion. VIP score can be calculated based on the determined optimal PLS model as Equation (5) [25]:

Model Calibration and Validation
The PLS model is the primary method employed to predict soil HM using VNIR spectroscopy technology due to its the ability to process the data in which the number of variables greatly exceeded the number of samples (especially multicollinearity data) [45]. On this basis, as an advanced variable selection technique, CARS can improve the original PLS model [17]. The process of CARS contains the N times iterative sampling of subset bands of spectral. In this study, N was set to 100. Each subset bands of spectral sampling from the NCS based includes the following steps: (1) choosing k subset bands of spectral samples from the calibration set by Monte Carlo to build PLS model and evaluating the importance of each spectral band by normalized weight; (2) reducing the number of model variables (i.e., subset bands of spectral) with the adaptive reweighted sampling method and finalizing the number with exponentially decreasing function; (3) repeating step 1 and step 2 until the sampling times reached N; and (4) determining the optimal variable selection while the root mean square error of leave-one-out cross-validation (RMSECV) is the minimum. Figure 3 presents the process of the CARS variable selection. While the RMSECV is the minimum, the optimal variable selection is determined and it is marked with a vertical line in the third-row subfigures. The CARS algorithm was performed in the libPLS package (download from www.libpls.net, [21], accessed on 26 December 2018). PLS and CARS-PLS models were carried out in MATLAB 2016a. the minimum. Figure 3 presents the process of the CARS variable selection. While the RMSECV is the minimum, the optimal variable selection is determined and it is marked with a vertical line in the third-row subfigures. The CARS algorithm was performed in the libPLS package (download from www.libpls.net, [21], accessed on 26 December 2018.). PLS and CARS-PLS models were carried out in MATLAB 2016a. The NSS spectra enhanced modeling strategy was proposed in this study. The VIP scores of NSSCd spectra band were calculated with Equation (5). The NSS prior spectral bands were selected with the criterion" VIP > 1", and these bands were used to predict Cd concentration in NCS. The CARS-PLS and PLS model were carried out the NSS spectra enhanced modeling strategy (CARS-PLSNSS-VIP-VNIR and PLSNSS-VIP-VNIR). To verify the effectiveness of the proposed modeling strategy, the conventional method of predicting Cd concentration using the entire VNIR was the contrast experiment (CARS-PLSVNIR and PLSVNIR).
To verify the stability of the NSSCd spectra enhanced modeling strategy, the calibration and validation set of models were selected every one in five samples (i.e., 1/5 of total samples) while the samples were sorted in ascending Cd concentration. The rest of the soil samples were used as a calibration (i.e., training) set. This method can ensure that the concentration of calibration and validation sets distribute similarly [3]. The number of the validation set can be also set to 1/4, 1/3, and 1/2 of the total samples in the model validation.
As redundant latent variables will result in overfitting, leave-one-out cross-validation was used to determine the optimal number of latent variables (LVs) of PLS according to the following criterion: the later LV will refuse to be added while the improvement of RMSECV is less than 4% [3].
For the precision evaluation of the CARS-PLSNSS-VIP-VNIR and PLSNSS-VIP-VNIR models, the determination coefficient (R 2 ), root mean square error of prediction (RMSEP), and the ratio of prediction to deviation (RPD) was used. A reliable model is supposed to have high R 2 , low RMSE, and high RPD. RPD is defined as the ratio of the standard deviation of the data set to RMSEP: The NSS spectra enhanced modeling strategy was proposed in this study. The VIP scores of NSS Cd spectra band were calculated with Equation (5). The NSS prior spectral bands were selected with the criterion "VIP > 1", and these bands were used to predict Cd concentration in NCS. The CARS-PLS and PLS model were carried out the NSS spectra enhanced modeling strategy (CARS-PLS NSS-VIP-VNIR and PLS NSS-VIP-VNIR ). To verify the effectiveness of the proposed modeling strategy, the conventional method of predicting Cd concentration using the entire VNIR was the contrast experiment (CARS-PLS VNIR and PLS VNIR ).
To verify the stability of the NSS Cd spectra enhanced modeling strategy, the calibration and validation set of models were selected every one in five samples (i.e., 1/5 of total samples) while the samples were sorted in ascending Cd concentration. The rest of the soil samples were used as a calibration (i.e., training) set. This method can ensure that the concentration of calibration and validation sets distribute similarly [3]. The number of the validation set can be also set to 1/4, 1/3, and 1/2 of the total samples in the model validation.
As redundant latent variables will result in overfitting, leave-one-out cross-validation was used to determine the optimal number of latent variables (LVs) of PLS according to the following criterion: the later LV will refuse to be added while the improvement of RMSECV is less than 4% [3].
For the precision evaluation of the CARS-PLS NSS-VIP-VNIR and PLS NSS-VIP-VNIR models, the determination coefficient (R 2 ), root mean square error of prediction (RMSEP), and the ratio of prediction to deviation (RPD) was used. A reliable model is supposed to have high R 2 , low RMSE, and high RPD. RPD is defined as the ratio of the standard deviation of the data set to RMSEP: RPD = SD/RMSEP (6)

Cd Concentration Statistics of Soil Samples
The statistics of soil samples Cd concentrations are presented in Table 2 Table 1. With the factor mentioned in Section 2.2.2, the measured values are 10% higher than the expected values. However, they are mainly consistent with the expected values in Table 1, which is acceptable for being used to analyze in model construction. According to the Environmental Quality Standard for Soils (GB 15618-1995) published by the Ministry of Environmental Protection of China, the tolerable value is 0.3 mg/kg for soils with pH ≤ 6.5 and 0.6 mg/kg for soils with pH ≥ 6.5. The Cd concentrations of all of the soil samples in Hengyang exceed the tolerable value. It indicates that there is serious contamination in Hengyang. In contrast, the Cd concentrations of all of the soil samples in Baoding were below the tolerable value. The data should be normal distribution for modeling, and the Kolmogorov-Smirnov method was used to test the distribution. The Cd concentration of samples in Hengyang is a skewed distribution because of the high value of Cd concentration in several of the samples. Therefore, in the latter, the box-cox transform was used to correct for data distribution. The Cd concentration of samples in Baoding is s normal distribution, which does not require transformation.

Spectral Response Characteristics of Soil Samples
The spectra of soil samples were presented in Figure 4. According to Figure 4, for all of the samples, there are four main absorptions around 900 nm, 1400 nm, 1900 nm, and 2200 nm. The absorption around 1400 nm and 1900 nm are caused by hydroxyl functional groups of free water [46,47], and the absorption around 2200 nm is caused by the clay lattice Al-OH [48][49][50]. However, there are obvious differences between the spectra curve of three sample sets.

Prior Spectral Bands Extraction from NSSCd
For NSSCd, the optimal precision of predicting Cd concentration is presented as follows: The R , RMSE with PLS model are 0.78, 9.33, and RPD = 2.05. The model is reliable. The PLS model regression coefficient and VIP score for each band are presented in Figure  5. According to the criterion "VIP > 1" mentioned in Section 2.4.1, the prior spectral bands extracted from NSSCd are mainly in the range of 400-570 nm, 940-990 nm, 1390-1430 nm, and 1670-2020 nm. In previous studies, several spectral bands with high correlation to Cd concentration were found, including bands around 540 nm [51], 490-580 nm [31], 1000 nm [29], 1400 nm [52], and 2000 nm [24]. The prior spectral bands of NSSCd covers the results in previous studies. The variety of NSSCd spectra is almost caused by Cd concentration. For the NCS in the Hengyang study area, the depth of absorptions is greater than the NCS in Baoding, which is caused by the abundant ferric and aluminum oxide in red soil. Additionally, the spectra of samples in Hengyang are various because the samples were collected from different sites in mine tailing, farmland, factories, etc. The NCS in the Baoding study area were all collected from agricultural land (according to Appendix A Table A1) and all of the degrees of variation of soil constituents are lower than samples in Hengyang. The soil texture is relatively unitary, so the spectral curves are uniform.
Compared to the NCS, the NSS Cd spectra are more uniform in shape because of the production with the same background soil. All constituents are the same except for the Cd concentration. The Cd concentration is the only factor which influenced the spectra of NSS Cd , This situation is similar to the samples in Baoding. The increasing trend of spectral reflectance in the visible range and the depth of absorptions are similar to some of the samples in the Hengyang study area. It is probably due to the fact that the soil type of background soil is similar to the NCS in Hengyang.

Prior Spectral Bands Extraction from NSS Cd
For NSS Cd , the optimal precision of predicting Cd concentration is presented as follows: The R 2 , RMSE with PLS model are 0.78, 9.33, and RPD = 2.05. The model is reliable. The PLS model regression coefficient and VIP score for each band are presented in Figure 5. According to the criterion "VIP > 1" mentioned in Section 2.4.1, the prior spectral bands extracted from NSS Cd are mainly in the range of 400-570 nm, 940-990 nm, 1390-1430 nm, and 1670-2020 nm. In previous studies, several spectral bands with high correlation to Cd concentration were found, including bands around 540 nm [51], 490-580 nm [31], 1000 nm [29], 1400 nm [52], and 2000 nm [24]. The prior spectral bands of NSS Cd covers the results in previous studies. The variety of NSS Cd spectra is almost caused by Cd concentration. Therefore, these prior spectral bands imply general spectral response characteristics and are used to enhance the Cd concentration prediction in NCS.
samples in the Hengyang study area. It is probably due to the fact that the soil type of background soil is similar to the NCS in Hengyang.

Prior Spectral Bands Extraction from NSSCd
For NSSCd, the optimal precision of predicting Cd concentration is presented as follows: The R , RMSE with PLS model are 0.78, 9.33, and RPD = 2.05. The model is reliable. The PLS model regression coefficient and VIP score for each band are presented in Figure  5. According to the criterion "VIP > 1" mentioned in Section 2.4.1, the prior spectral bands extracted from NSSCd are mainly in the range of 400-570 nm, 940-990 nm, 1390-1430 nm, and 1670-2020 nm. In previous studies, several spectral bands with high correlation to Cd concentration were found, including bands around 540 nm [51], 490-580 nm [31], 1000 nm [29], 1400 nm [52], and 2000 nm [24]. The prior spectral bands of NSSCd covers the results in previous studies. The variety of NSSCd spectra is almost caused by Cd concentration. Therefore, these prior spectral bands imply general spectral response characteristics and are used to enhance the Cd concentration prediction in NCS.

Prediction Precision of NSS Cd Enhanced Model
The RMSECV against the number of LVs is shown in Figure 6, which determines the optimal number of LVs (validation set ratio = 1/5). In Figure 6a, the RMSECV is the local minimum while the number of LVs is 3. With the increase of LVs, the RMSECV is decreased by less than 4%. According to the criterion in Section 2.4.2, the latter added LVs is unnecessary. The optimal number of LVs for Hengyang is 3. In Figure 6b, the RMSECV decreases sharply while the number of LVs increasing to 3 for the CARS-PLS VNIR model and reached the minimum while LVs is 4 for the CARS-PLS NSS-VIP-VNIR . For the PLS NSS-VIP-VNIR and PLS VNIR , while the 6th LV was added, the decrease of RMSECV is less than 4%. The optimal number of LVs is regarded as 5. The optimal number of LVs can be determined in the same way while the validation ratio changed, and the results are presented in Table 3. selected by KS is lower ( is ranged in 0.36-0.53). The KS method can ensure that the distribution of calibration is uniform to the total samples along the spectra. This perhaps leads to the difference distribution of Cd concentration between calibration and validation sets since the sample number is not large enough. Unlike the KS method, the method that selected every one in five samples with the ascending sorted Cd concentration ensure that the distribution of calibration is uniform to the total samples along the Cd concentration.   The cases that the NSS Cd enhanced models get higher precision than using entire VNIR were marked with bold font. And the exceptional cases were marked with *.
The results of the NSS Cd enhanced models (PLS NSS-VIP-VNIR and CARS-PLS NSS-VIP-VNIR ) are presented in Table 3. As a comparison, prediction precision that using the entire VNIR (PLS VNIR and CARS-PLS VNIR model) is also presented. The higher precision is marked with bold font. Figures 7 and 8 are the scatter plots of the observed against predicted Cd concentration. The red dots and green dots represent the validation and calibration set respectively. The 95% confidence interval of regression line is marked with a red shaded region.
For the Hengyang sample set, while the validation set ratio is set to 1/5, compared to the PLS VNIR model, the  In addition, other sample selection method such as Kennard-Stone (KS) is also widely utilized in soil spectroscopy for selecting calibration [53,54]. The results of the NSS enhanced model which validation set is selected by the KS algorithm are presented in Appendix A, Table A3. Comparing to the results in Table 3, the results which validation selected by KS is lower (R 2 p is ranged in 0.36-0.53). The KS method can ensure that the distribution of calibration is uniform to the total samples along the spectra. This perhaps leads to the difference distribution of Cd concentration between calibration and validation sets since the sample number is not large enough. Unlike the KS method, the method that selected every one in five samples with the ascending sorted Cd concentration ensure that the distribution of calibration is uniform to the total samples along the Cd concentration.
Though the band number of NSS Cd prior spectral bands is significantly less than the entire VNIR, these bands perhaps contain the main spectral information for predicting Cd concentration and have the potential to reveal spectral response characteristics. Some previous studies have proposed that there is redundant information in the entire VNIR, and that using subset bands of spectra represents an effective method for improving soil HM prediction [3].
Additionally, in this study, NSS Cd prior spectral bands extracted by VIP scores were more effective than the band selected by the CARS method. It was noticed that the precision of the CARS-PLS VNIR is lower than the PLS NSS-VIP-VNIR model both for the NCS in Hengyang and Baoding. In some cases, the CARS method was useless for improving precision. For example, the PLS NSS-VIP-VNIR model gets higher precision than the CARS-PLS NSS-VIP-VNIR model when the validation set ratio is 1/4 and 1/2, respectively. As a variable selection method, CARS can eliminate uninformative variables based on weight. However, the results of variable selection are perhaps influenced by the spectra variation caused by heterogeneous soil. The spectra bands selected by the CARS method are inconsistent in different areas, which may possibly lead to controversial spectral bands for predicting soil HM as in previous studies. In contrast, the NSS Cd prior spectral bands are extracted based on the prior knowledge of NSS Cd spectra, which are more likely to reveal the spectral characteristic of soil Cd. This is the main reason that NSS Cd prior spectral bands are more effective in predicting Cd concentration than spectra bands selected by the CARS method. Therefore, the NSS Cd prior spectral bands can be widely applied to predict Cd concentration in different areas.
Moreover, an NSS enhanced model based on machine learning was also carried out. Random forest (RF) is popular in spectral reflectance due to the accuracy of its classifications [55,56]. It is also applicable to regression and used in some cases [57][58][59]. The result of the RF NSS-VIP-VNIR model is presented in Appendix A, Table A3. However, the precision of the RF NSS-VIP-VNIR model is lower than the PLS NSS-VIP-VNIR and CARS-PLS NSS-VIP-VNIR models. Machine learning is suitable for non-linear regression problems with a large number of samples. However, in this study, due to limitation of the number of samples, machine learning did not perform well. However, for the large number samples, it is worth exploring the capability of the NSS enhance model combined with machine learning for further study.
In this study, although NSS Cd prior spectral bands can be applied to these two areas, it does not mean that it can be applied to all possible study areas. Because soil type is an important factor which can influence spectra, it is necessary to verify the applicability of the NSS Cd spectra enhanced modeling strategy on more types of soil. In addition, the mechanism for predicting soil HM concentration using VNIR spectroscopy is based on HM adsorption on Fe-oxides, clays, and SOM. Therefore, while the other constituent concentrations (e.g., SOM, clays, etc.) of background soil are varied, it is uncertain whether the result of prior spectral bands extracted by VIP scores from NSS Cd will differ from the result in this paper. Moreover, considering the high Cd concentration range of NSS Cd (0.47-58.95 mg/kg), it is difficult to ensure NSS Cd prior spectral bands' reliability for NCS with a low Cd concentration.
For this reason, in further studies, the producing NSS with heterogeneous background soil should be produced to reveal the spectral response characteristic of HM in heterogeneous soil (e.g., the NSS produced based on soil with a high SOM and low SOM concentration). The proposed NSS Cd spectra enhanced modeling strategy also should be used in more study areas with different soil types to verify its' transferability [60]. For other types of soil HM, the spectral characteristics also can reveal by this method. These NSS prior spectral bands are significant for the development of sensors for detecting soil HM. It is a promising hyperspectral remote sensing technique for rapidly detecting soil HM concentrations on a large scale.

1.
The NSS Cd spectra enhanced modeling strategy can effectively predict Cd concentration in different areas.

2.
NSS Cd prior spectral bands are important for the selection of spectral response characteristics from VNIR of natural soil samples.

3.
The VIP method is more helpful to select the key band for predicting Cd concentration than the CARS method. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The data of soil samples from Hebei Province are openly available in Appendix A at 10.1016/j.scitotenv.2018.08.442, reference [3]. The data presented in this study are available on request from the corresponding author. The data are not publicly available due to restrictions privacy.
Acknowledgments: Soil samples from Hebei Province, northern China were provided by Xia Zhang with the Chinese Academy of Science, China.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.
Appendix A