Evaluating Calibration and Spectral Variable Selection Methods for Predicting Three Soil Nutrients Using Vis-NIR Spectroscopy

: Soil nutrients, including soil available potassium (SAK), soil available phosphorous (SAP), and soil organic matter (SOM), play an important role in farmland soil productivity, food security, and agricultural management. Spectroscopic analysis has proven to be a rapid, nondestructive, and effective technique for predicting soil properties in general and potassium, phosphorous, and organic matter in particular. However, the successful estimation of soil nutrient content by visible and near-infrared (Vis-NIR) reﬂectance spectroscopy depends on proper calibration methods (in-cluding preprocessing transformation methods and multivariate methods for regression analysis) and the selection of appropriate variable selection techniques. In this study, raw spectrum and 13 preprocessing transformations combined with 2 variable selection methods (competitive adaptive reweighted sampling (CARS) and the successive projections algorithm (SPA)) and 2 regression algorithms (support vector machine (SVM) and partial least squares regression (PLSR)), for a total of 56 calibration methods, were investigated for modeling and predicting the above three soil nutrients using hyperspectral Vis-NIR data (400–2450 nm). The results show that ﬁrst-order derivatives based on logarithmic and inverse transformations (FD-LGRs) can provide better predictions of soil available potassium and phosphorous, and the best form of soil organic matter transformation is SG+MSC. CARS was superior to the SPA in selecting effective variables, and the PLSR model out-performed the SVM models. The best estimation accuracies (R 2 , RMSE) for soil available potassium, phosphorous, and organic matter were 0.7532, 32.3090 mg/kg; 0.7440, 6.6910 mg/kg; and 0.9009, 3.2103 g/kg, respectively, and their corresponding calibration methods were (FD-LGR)/SPA/PLSR, (FD-LGR)/SPA/PLSR, and SG+MSC/CARS/SVM, respectively. Overall, for the prediction of the soil nutrient content, organic matter was superior to available phosphorous, followed by available potassium. It was concluded that the application of hyperspectral images (Vis-NIR data) was an efﬁcient method for mapping and monitoring soil nutrients at the regional scale, thus contributing to the development of precision agriculture.


Introduction
As the foundation for most terrestrial life, soil has unrivaled complexity and dynamicity [1]. Soil contains minerals, organic matter, uncountable numbers of organisms, and varying amounts of air, water, and essential nutrients which provide life support for the growth of terrestrial plants and other organisms [2]. Meanwhile, as definitive indicators of represented by the first derivative (FD), second derivative (SD) [13], and logarithmic transformation (LG) [24] have been used to remove the baseline while improving the correlation to the sample concentrations and the linear trend [5]. A continuous wavelet transformation (WT) [36] and a Fourier transform (FT) [37], belonging to the high-frequency noise removal method, have been used to enhance the features in the spectrum [5]. Vis-NIR data are usually characterized by a high spectral resolution with a large data volume and multicollinearity [38]. Currently, some studies using Vis-NIR spectroscopy to predict the soil nutrient content under field and laboratory conditions use full-spectrum wavelengths involved in modeling analysis, and some of these studies have also confirmed the superiority of the predictions using the full-spectrum over the correlation coefficient wavelength selection method [39,40]. However, when using other wavelength selection methods such as a genetic algorithm (GA), this showed the improvements in the accuracy and robustness of the model compared with the full-spectrum approach [41]. To some extent, reducing the spectral dimension is another common method to further optimize the model [20] because it can filter out some noisy, unreliable, and irrelevant variables from the whole spectral data [42]. Thus, several previous studies have demonstrated that calibration models are more accurate when wavelength variable selection methods are applied [22,[42][43][44][45]. Among them, the most commonly used method is Pearson correlation analysis [6,30,46,47]. To a certain extent, the correlation coefficient (r) and significance level reflect the correlation relationship between soil elements and wavelength variables [46], and one or more bands with the highest correlation coefficient can be used to build the model. Furthermore, competitive adaptive reweighted sampling (CARS) [48], a genetic algorithm (GA) [49], and the successive projections algorithm (SPA) [50] are also widely used variable selection techniques in spectral modeling. The variables selected by these feature collection algorithms can be used as the input variables for modeling and then predicting soil nutrients or properties. In terms of model analysis methods, there are many methods, such as multiple linear regression (MLR), least squares regression, and partial least squares regression (PLSR), which are easy to use and act as simple model structures. Among them, PLSR is a commonly used method in hyperspectral regression modeling [17,30,44,[51][52][53]. However, in reality, correlations between wavelength variables and soil nutrients are rarely linear. Therefore, nonlinear prediction methods have become popular. For example, artificial neural networks (ANNs), support vector machines (SVMs), random forests (RFs), and extreme learning machines (ELMs) can explain complex nonlinear relationships to a certain extent. These methods have been employed in many fields [44,[54][55][56][57][58], and the results have shown that nonlinear prediction methods have high accuracy and perform better than linear methods.
However, due to the heterogeneity of the geographical environment, no single (or combination of) preprocessing method is suitable for different geographic soilscapes, and the same is true for wavelength variable selection and modeling methods [13,59]. Additionally, the reflectance of the soil spectrum in Vis-NIR (350-2500 nm) is generally low (compared with any other typical object), and the extension of the inversion model based on measured spectra is limited. There has been some research on the use of logarithmic (log(R)), reciprocal (1/R), and first-order differential (R') transformation as a preprocessing method [24,26], but this research has been rarely mentioned, especially for the combined transformation of logarithmic or reciprocal derivatives. On the one hand, this study focuses on preprocessing methods that combine logarithmic, reciprocal, and first-order differential transformation, including the first-order differential of the reciprocal logarithm (FD-RLG, log(1/R)') and the first-order differential of the reciprocal of the logarithm (FD-LGR, (1/log(R))'). We also process the spectral data in five different ways (SNV, MSC, SG, FD, and LG). On the other hand, according to our reviews, only a few studies have focused on the application of three variable selection techniques (CARS, SPA, and GA) and two regression methods (PLSR and SVM) in soil available potassium (SAK), soil available phosphorus (SAP), and soil organic matter (SOM) estimation at the same time, especially with Southwest China as the study area. Thus, the purpose of this study is to (a) compare the existing preprocessing methods and the combined transformation for SAK SAP and SOM estimation; (b) evaluate the differences of the two existing spectral wavelength selection methods (CARS and SPA) and select the best method as the modeling input; (c) compare the performance of the linear (PLSR) and nonlinear (SVM) multivariate methods for three soil nutrient estimations; and (d) evaluate the feasibility of establishing SAK, SAP, and SOM fertility levels using the output optimal model combination and provide scientific theoretical support for the spectral prediction of soil nutrient content and the development of precision agriculture.

Description of the Study Area
The study region was in the Xihe River watershed (30 • 38 28 -30 • (Figure 1a) of China, which has been referred to as a land of abundance since ancient times because of its temperate climate and fertile soil [60]. This region is characterized by relatively gentle terrain with elevations ranging from 468 m to 882 m. The elevation and slope both decrease from northwest to southeast, with a total area of approximately 2.81 × 10 4 ha. The region has a subtropical humid monsoon climate with an annual mean temperature and precipitation of approximately 15.9 • C and 1012.4 mm, respectively. The soil in Chengdu was developed based on Quaternary, tertiary, Jurassic, and Cretaceous parent rocks, according to the classification and codes for Chinese soil (National Standard of China, GB/T 17296-2009), and the representative soil type in this area (Chongzhou) is paddy soil. The region's soil-forming parent material mainly consists of Minjiang gray alluvium, West River purple alluvium, and purple-gray alluvium mixed with alluvium of the Minjiang River and West River, and the soil's parent material mainly consists of Minjiang gray alluvium, West River purple alluvium, Minjiang West River alluvium mixed with purple-gray alluvium and redeposited yellow mud. The land use types are primarily cultivated land, mainly including wheat rice and rapeseed rice, two typical water drought rotation modes, rapeseed corn, and horticultural crops, with two dry farming methods. soil available phosphorus (SAP), and soil organic matter (SOM) estimation at the same time, especially with Southwest China as the study area. Thus, the purpose of this study is to (a) compare the existing preprocessing methods and the combined transformation for SAK SAP and SOM estimation; (b) evaluate the differences of the two existing spectra wavelength selection methods (CARS and SPA) and select the best method as the model ing input; (c) compare the performance of the linear (PLSR) and nonlinear (SVM) multi variate methods for three soil nutrient estimations; and (d) evaluate the feasibility of es tablishing SAK, SAP, and SOM fertility levels using the output optimal model combina tion and provide scientific theoretical support for the spectral prediction of soil nutrien content and the development of precision agriculture.

Description of the Study Area
The study region was in the Xihe River watershed (30°38′28″-30°50′12″N, 103°26′45″-103°40′52″E) of Chongzhou (Figure 1c) west of the Chengdu Plain (Figure 1b). The Chengdu Plain is situated in the western Sichuan Basin (Figure 1a) of China, which has been referred to as a land of abundance since ancient times because of its temperate cli mate and fertile soil [60]. This region is characterized by relatively gentle terrain with ele vations ranging from 468 m to 882 m. The elevation and slope both decrease from north west to southeast, with a total area of approximately 2.81 × 10 4 ha. The region has a sub tropical humid monsoon climate with an annual mean temperature and precipitation o approximately 15.9 °C and 1012.4 mm, respectively. The soil in Chengdu was developed based on Quaternary, tertiary, Jurassic, and Cretaceous parent rocks, according to the clas sification and codes for Chinese soil (National Standard of China, GB/T 17296-2009), and the representative soil type in this area (Chongzhou) is paddy soil. The region's soil-form ing parent material mainly consists of Minjiang gray alluvium, West River purple allu vium, and purple-gray alluvium mixed with alluvium of the Minjiang River and Wes River, and the soil's parent material mainly consists of Minjiang gray alluvium, Wes River purple alluvium, Minjiang West River alluvium mixed with purple-gray alluvium and redeposited yellow mud. The land use types are primarily cultivated land, mainly including wheat rice and rapeseed rice, two typical water drought rotation modes, rape seed corn, and horticultural crops, with two dry farming methods.

Field Sampling and Soil Analysis
Soil samples were collected at a 0-20 cm depth at 105 sampling points (Figure 1c) in October 2018. Here, we adopted a five-point sampling method. First, a square region (1 m 2 ) was selected, and then each soil sample was collected at five points and mixed as one representative sample (i.e., four corners plus the center point). At the same time, the decaying branches and withered leaves, stones, and other non-soil objects on the soil surface were removed. We used GNSS receivers to record the geographic coordinates and altitude at the center of the plot for each sampling site [21]. In addition, the land use type (i.e., woodland or paddy field) of the sample site was also recorded and used to analyze the spectral differences of the soil nutrients (SAK, SAP, and SOM) of different surface features. The soil samples were brought back to the laboratory for physical and chemical analyses.
Each sample was taken to the laboratory for air drying in a natural indoor environment for one month. After air drying, impurity removal, grinding, and sieving (<2 mm), each sample was divided into two parts. One part was designated for standard chemical laboratory analysis and the other for hyperspectral Vis-NIR acquisition [61]. The chemical analysis for SOM was performed with the potassium bichromate titrimetric method [62], the SAK was measured with the ammonium acetate extraction-flame photometric detection method, the titration of the SAK extracts was performed with 1 mol L −1 NH 4 -OAc with a 1:5 weight-to-volume ratio, and the SAP was measured with the sodium hydrogen carbonate solution-Mo-Sb anti-spectrophotometric method. Titration of the SAP extracts was performed with 0.5 mol L −1 NaHCO 3 [63].

Laboratory Spectral Measurements and Data Preprocessing
The soil spectral reflectance was measured using an ASD FieldSpec 3 spectrometer (Analytical Spectral Devices Inc., Boulder, CO, USA), which was able to measure the reflectance in the interval from 350 to 2500 nm with a 1-nm spectral resolution. Each sample was placed into a Petri dish (diameter of 9 cm and thickness of 2 cm), and a piece of black light-absorbing cloth was placed under the targeted object as a background [5]. The light source was a halogen lamp with a power of 50 W. The spectrum was measured through the soil sample with a viewing zenith angle of 30 • at a 30-cm distance from the light source, and a Spectralon ® white reference panel was used as a reference of 100% reflectance (absolute reflectance) for calibration of the spectroradiometer every 20 min to ensure that the instrument was calibrated before the spectral measurements were processed. For each soil sample, spectral reflectance was obtained from four measurements with a 90 • turn between measurements, and a total of 40 spectra were averaged into 1 spectrum for each sample.
The wavelengths on the fringe (350-399 nm and 2451-2500 nm) of the spectrometers had a relatively low signal-to-noise ratio. To eliminate the influence of noise, the spectral regions mentioned above were removed [64,65]. To further reduce the influence of the instrument, experimental environment, baseline shifts, overall curvature, and soil samples on the spectral data and intensify the more chemically related peaks in the spectrum, several commonly used preprocessing transformations were applied in this study, including the Savitzky-Golay filter (SG) [29] with a polynomial order of 2 and window size of 11. Along with a first derivative (FD), SNV, MSC, LG, FD-RLG, and FD-LGR, seven different transformation forms were involved in this paper. Among them, different transformation forms had different effects, such as FD and SD, which could reduce the baseline variation and increase spectral peak features. The SG method was used to smooth and remove random noise from the spectra [66]. In this paper, the transformation forms of FD, SNV, MSC, LG, FD-RLG, and FD-LGR were performed on the SG spectra.

Competitive Adaptive Reweighted Sampling
The principle of CARS is derived from the principle of "survival of the fittest" in Darwin's theory of evolution [48]. The CARS method selects the wavelength with a large Remote Sens. 2021, 13, 4000 6 of 20 absolute regression coefficient in the PLS model through adaptive reweighted sampling (ARS) technology, eliminates the wavelength points with small weights, and selects the subset with the lowest root mean square error of cross validation (RMSE CV ) value through interactive verification to obtain the optimal variable combination. Further detailed information and steps for the CARS algorithm can be found in [10,48,67].

Successive Projections Algorithm
As a forward feature variable selection algorithm that minimizes the collinearity of vector space, the SPA has been widely used for extracting sensitive variables in recent years [31,44]. In the wavelength selection process, the wavelength with the largest projection vector is added to the wavelength combination, and each newly selected wavelength has the smallest linear relation with the previous wavelength. Therefore, useful information can be optimized to the maximum extent, and the combination of variables can be obtained with the minimum collinearity and the least redundant information of hyperspectral data, which is helpful for improving the model prediction ability.

Regression Algorithms
As the SAK, SAP, and SOM contents may have linear and nonlinear relationships with the spectral reflectance or transformed spectrum, linear algorithms and nonlinear algorithms are commonly used for modeling, and the proper model is selected to predict the soil nutrient content. In this study, we used a linear regression method (PLSR) and a nonlinear method (SVM). These two regression algorithms were chosen because of their ability and applicability in spectral wavelength selection and robustness when used in parallel [68]. A brief summary of these techniques (PLSR and SVM) is provided below with key references for more detailed information.
As a well-known linear multivariate algorithm, PLSR has been used in chemometric and quantitative spectral analyses and other wide applications. PLSR is particularly appropriate when the number of independent variables is greater than the number of dependent variables [69]; in particular, there is multicollinearity in independent variables. PLSR is related to principal component regression (PCR), but instead of looking for the hyperplane with the maximum variance between the dependent variables and independent variables, PLSR finds a linear regression model by projecting the dependent variables and independent variables into a new space [70], and through integration, compression, and regression steps, continuous orthogonal factors (i.e., LVs, latent variables for PLSR) are selected to maximize the covariance between the predictive variables and the response variables [22,43]. The PLSR method combines the advantages of three components: principal component analysis (PCA), multiple linear regression, and canonical correlation analysis [31,46]. Leave-one-out cross-validation (LOOCV) was used to evaluate the PLSR models for the calibration set and to select the optimal number of LVs for PLSR [71,72].
The support vector machine (SVM) model is a nonlinear and supervised learning method that, based on statistical learning theory [73], can transform the problem of the data from low-dimensional nonlinear to high-dimensional linear [74]. Support vector machine regression (SVR) can be obtained by extending the SVM from the classification field to the regression field. Here, we still followed the SVM representation, which is essentially SVR. At this time, the standard SVM algorithm is also known as support vector classification (SVC), and the hyperplane decision boundary in SVC is the regression model of SVR. SVR is different from the traditional regression method; the loss function is redefined by introducing relaxation variables (ε), and by using boundary samples, training data points fall into or close to the margin defined by ε as much as possible. The sizes of the coefficients and prediction errors are reduced simultaneously by using a loss function to obtain the best model [75]. In this paper, we selected the radial basis function as the kernel function.

Accuracy Comparison
This paper used determination coefficients (R 2 ), the root mean square error (RMSE), and the ratio of performance to deviation (RPD) to evaluate the accuracy of model inversion. These three statistical indicators have been widely used for regression models of the Vis-NIR spectrum [22,53,76,77]. Regarding the stability and fitness of the R 2 reaction model, the closer R 2 approached 1, the higher the model fit and the more stable it was. The RMSE of prediction measures the spread of prediction errors through the arithmetic mean of errors, and RPD, which is the standard deviation divided by the RMSE, indicates the predictive capacities of the calibration models. Low RMSE and high RPD values indicate good model prediction performance. In this study, according to [78], three categories of the prediction ability were performed: category A (RPD < 1.4) with poor accuracy, making it generally difficult to predict the content effectively; category B (1.4 ≤ RPD ≤ 2) with moderate accuracy, indicating that the prediction ability of the model was moderate and could be roughly estimated; and category C (RPD > 2) with good accuracy, indicating that the model had excellent prediction ability. Detailed information about the formulas can be found in the literature [5]. MATLAB 2020 R2019a was used to select the spectral variables and establish the models. The main method process is shown in Figure 2.
Remote Sens. 2021, 13, x FOR PEER REVIEW 7 of 21 the best model [75]. In this paper, we selected the radial basis function as the kernel function.

Accuracy Comparison
This paper used determination coefficients (R 2 ), the root mean square error (RMSE), and the ratio of performance to deviation (RPD) to evaluate the accuracy of model inversion. These three statistical indicators have been widely used for regression models of the Vis-NIR spectrum [22,53,76,77]. Regarding the stability and fitness of the R 2 reaction model, the closer R 2 approached 1, the higher the model fit and the more stable it was. The RMSE of prediction measures the spread of prediction errors through the arithmetic mean of errors, and RPD, which is the standard deviation divided by the RMSE, indicates the predictive capacities of the calibration models. Low RMSE and high RPD values indicate good model prediction performance. In this study, according to [78], three categories of the prediction ability were performed: category A (RPD < 1.4) with poor accuracy, making it generally difficult to predict the content effectively; category B (1.4 ≤ RPD ≤ 2) with moderate accuracy, indicating that the prediction ability of the model was moderate and could be roughly estimated; and category C (RPD > 2) with good accuracy, indicating that the model had excellent prediction ability. Detailed information about the formulas can be found in the literature [5]. MATLAB 2020 R2019a was used to select the spectral variables and establish the models. The main method process is shown in Figure 2.

Descriptive Statistics of the Soil Nutrients
The total number of soil samples was 105, which were split into a calibration set and a validation set by the Kennard-Stone (KS) method at a proportion of 7:3 [53]. In reality, the levels of the soil properties were affected not only by the natural environment but also by human behaviors (i.e., before sample collection, there were fertilization activities in the field). Here, we detected outliers by using the Monte Carlo cross validation (MCCV) method. The total number of SOM samples was determined to be 97, while for SAK it was 103, and for SAP it was 101.
Summary statistics for the entire calibration and validation datasets for the soil nutrients measured in the laboratory are provided in Table 1. The SAP (2.23-49.91 mg/kg)

Descriptive Statistics of the Soil Nutrients
The total number of soil samples was 105, which were split into a calibration set and a validation set by the Kennard-Stone (KS) method at a proportion of 7:3 [53]. In reality, the levels of the soil properties were affected not only by the natural environment but also by human behaviors (i.e., before sample collection, there were fertilization activities in the field). Here, we detected outliers by using the Monte Carlo cross validation (MCCV) method. The total number of SOM samples was determined to be 97, while for SAK it was 103, and for SAP it was 101.
Summary statistics for the entire calibration and validation datasets for the soil nutrients measured in the laboratory are provided in Table 1 highly variable, 15% < CV < 35% is moderate, and CV < 15% is low variability. The coefficients of variation (CVs) of the SAP, SOM, and SAK in the calibration dataset were 65.06%, 43.62%, and 24.08%, and these values in the validation dataset were higher than those in the calibration dataset at 80.28%, 50.76%, and 26.73%, respectively. The skewness of the SAP, SOM, and SAK were 1.33, 0.73, and 0.31, respectively, and all were greater than zero; the distribution of SAP, SOM, and SAK could be considered offset to the right. The kurtosis of the SAP was 2.05. Hence, the distribution of the SAP could be considered a steep distribution, but the SOM and SAK were flat distributions and smoother-than-normal distributions, respectively.

Spectral Characteristic Analysis
The average spectral reflectance and its degree of distribution of the different soil nutrients are illustrated in Figures 3 and 4. In the entire band range (400-2450 nm), with an increase in the SOM, SAK, and SAP content, the spectral reflectance decreased gradually, but the exhibited spectral shapes were similar. This trend was significant at wavelengths of 400-1000 nm, which were mainly associated with minerals that contained iron, as well as the presence of SOM, SAK, and SAP [38,72,80]. In the Vis region, the reflectance spectrum affected by the soil chromophore and the black color of the SOM showed different spectral intensities. However, the NIR region (1000-2500 nm) was dominated by absorptions related to water, clay minerals, organic matter, and carbonates, and the spectral intensity variation was linked to the double-frequency and combined-frequency absorption of chemical bonds such as N-H, C-H, and C-O. In addition, water had strong absorption bands at approximately 1200 nm, 1400 nm, 1780 nm, and 2200 nm and weak absorption bands at approximately 1900 nm. The wavelengths of strong or weak water absorption were different from some existing studies [38,59,81,82], and the relevant reasons require further study.
The SOM, SAK, and SAP contents in garden plots, paddy fields, woodlands, and dry land showed a certain gradient distribution. The SOM, SAK, and SAP contents were lowest in the woodlands, and the land use types with the highest SAK or SAP and SOM contents were garden plots and paddy fields, respectively. The content of SAK in the dry land was higher than that in the paddy fields, but the content of SAP was the opposite. For the SOM, the dry land SOM was behind the paddy field and garden field SOM. According to the climate characteristics, paddy field planting in Chongzhou usually occurs 2-3 times, so the lowest SOM, SAK, and SAP in being in the woodlands may partly have been due to more crop residue left in the area with 2-3 plantings a year.
Remote Sens. 2021, 13, x FOR PEER REVIEW 9 of 21 so the lowest SOM, SAK, and SAP in being in the woodlands may partly have been due to more crop residue left in the area with 2-3 plantings a year.

Feature Wavelength Selection
To reduce the number of wavelengths and obtain a simpler, more reliable model, CARS and SPA were applied to select the feature wavelengths with the main effective information from the whole spectrum. Figure 5 shows the concentrated map of the selected bands of the different spectral variable selection methods. As shown in Figure 5, there were significant differences in the feature wavelengths selected by the CARS and SPA algorithms under 13 spectral transformations. First, the number of feature wavelengths selected was different, and CARS' were significantly higher than SPA's. Compared with SPA, the distribution of feature wavelengths was more scattered. Second, due to the differences in the responses of wavelengths to soil nutrients, under the same algorithm, there were significant differences in the selection of feature wavelengths. Except for being selected only once as the feature wavelength, using SPA as the spectral variable selection method, the range of the SAK was mainly concentrated at 400-421 nm, 996 nm, 1350 nm, 1351 nm, 1680 nm, 2372 nm, and 2448 nm, while the SAP was mainly concen-

Feature Wavelength Selection
To reduce the number of wavelengths and obtain a simpler, more reliable model, CARS and SPA were applied to select the feature wavelengths with the main effective information from the whole spectrum. Figure 5 shows the concentrated map of the selected bands of the different spectral variable selection methods. As shown in Figure 5, there were significant differences in the feature wavelengths selected by the CARS and SPA algorithms under 13 spectral transformations. First, the number of feature wavelengths selected was different, and CARS' were significantly higher than SPA's. Compared with SPA, the distribution of feature wavelengths was more scattered. Second, due to the differences in the responses of wavelengths to soil nutrients, under the same algorithm, there were significant differences in the selection of feature wavelengths. Except for being selected only once as the feature wavelength, using SPA as the spectral variable selection method, the range of the SAK was mainly concentrated at 400-421 nm, 996 nm,

Feature Wavelength Selection
To reduce the number of wavelengths and obtain a simpler, more reliable model, CARS and SPA were applied to select the feature wavelengths with the main effective information from the whole spectrum. Figure 5 shows the concentrated map of the selected bands of the different spectral variable selection methods. As shown in Figure 5, there were significant differences in the feature wavelengths selected by the CARS and SPA algorithms under 13 spectral transformations. First, the number of feature wavelengths selected was different, and CARS' were significantly higher than SPA's. Compared with SPA, the distribution of feature wavelengths was more scattered. Second, due to the differences in the responses of wavelengths to soil nutrients, under the same algorithm, there were significant differences in the selection of feature wavelengths. Except for being selected only once as the feature wavelength, using SPA as the spectral variable selection method, the range of the SAK was mainly concentrated at 400-421 nm, 996 nm, 1350 nm, 1351 nm, 1680 nm, 2372 nm, and 2448 nm, while the SAP was mainly concentrated at 400-436 nm, around 1000 nm,

Model Performances of Different Calibration Methods
The preprocessing transformation methods provided different calibration and prediction accuracies for the SAK, SAP, and SOM regarding the R 2 , RMSE, and RPD values. Table 2 and Figure 6 and Figure 7 show the R 2 , RMSE, and RPD performance with different calibration methods for the prediction of the SAK, SAP, and SOM. The standard deviation (SD) between groups through SG transformation (SDSG = 0.1881) was smaller than that of the original transformation (SDraw = 0.2220), and based on the standard deviation of the raw spectral (RS), the difference of the samples in the group was the largest. The calibration accuracy based on SG transformation was not significantly improved compared with the unprocessed method, but the effect obtained for the specific part of the transformation form was better (e.g., RS or SNV). Some of them had differences in R 2 that were greater than 0.1 (e.g., RS and FD-LGR). The highest R 2 values of the SAK, SAP, and SOM were 0.8931, 0.9518, and 0.9277, respectively, which were obtained with the calibration methods of FD-RLG/PLSR, FD-RLG/PLSR, and FD-LGR/SVM, while the calibration methods with the lowest accuracy were more irregular. The highest R 2 value of the SAK was 0.1150 (RS/PLSR), while it was 0.3713 for the SAP (SNV/PLSR), and 0.2379 for the SOM (FD-RLG/PLSR). As shown by the calibration results in Table 2, among the three investigated soil nutrients, the SAP was the most accurate calibration element with an average R 2 = 0.7356 followed by the SOM (R 2 = 0.7165), and the worst was for the SAK (R 2 = 0.5522). Similarly, the dispersion of these three soil nutrients of modeling R 2 from large to small was the SAK, SOM, and SAP. Furthermore, for the two regression methods, the modeling accuracy of the SVM was significantly higher than that of PLSR, regardless of the kind of soil nutrients, and the average R 2 values of the SVM and PLSR were 0.8054 and 0.5309. Except for the SAK, the discrete degree of modeling accuracy data (R 2 ) of the other Figure 5. Concentrated map of selected wavelengths of the different spectral variable selection methods (the horizontal axis represents the wavelength from 400 nm to 2450 nm, the vertical axis represents the three soil nutrients, the different size indicates the numbers of times each wavelength was selected by the 14 spectral forms).

Model Performances of Different Calibration Methods
The preprocessing transformation methods provided different calibration and prediction accuracies for the SAK, SAP, and SOM regarding the R 2 , RMSE, and RPD values. Table 2 and Figures 6 and 7 show the R 2 , RMSE, and RPD performance with different calibration methods for the prediction of the SAK, SAP, and SOM. The standard deviation (SD) between groups through SG transformation (SD SG = 0.1881) was smaller than that of the original transformation (SDraw = 0.2220), and based on the standard deviation of the raw spectral (RS), the difference of the samples in the group was the largest. The calibration accuracy based on SG transformation was not significantly improved compared with the unprocessed method, but the effect obtained for the specific part of the transformation form was better (e.g., RS or SNV). Some of them had differences in R 2 that were greater than 0.1 (e.g., RS and FD-LGR). The highest R 2 values of the SAK, SAP, and SOM were 0.8931, 0.9518, and 0.9277, respectively, which were obtained with the calibration methods of FD-RLG/PLSR, FD-RLG/PLSR, and FD-LGR/SVM, while the calibration methods with the lowest accuracy were more irregular. The highest R 2 value of the SAK was 0.1150 (RS/PLSR), while it was 0.3713 for the SAP (SNV/PLSR), and 0.2379 for the SOM (FD-RLG/PLSR). As shown by the calibration results in Table 2, among the three investigated soil nutrients, the SAP was the most accurate calibration element with an average R 2 = 0.7356 followed by the SOM (R 2 = 0.7165), and the worst was for the SAK (R 2 = 0.5522). Similarly, the dispersion of these three soil nutrients of modeling R 2 from large to small was the SAK, SOM, and SAP. Furthermore, for the two regression methods, the modeling accuracy of the SVM was significantly higher than that of PLSR, regardless of the kind of soil nutrients, and the average R 2 values of the SVM and PLSR were 0.8054 and 0.5309. Except for the SAK, the discrete degree of modeling accuracy data (R 2 ) of the other two soil nutrients is shown, as PLSR is more discrete than the SVM.   The RMSE explains the difference between the samples and the model predictions, and RPD represents the ratio of the standard error of the prediction to the standard deviation of the sample. The details of the RPD can be found in Section 2.6. Generally, a model that performs well will have a high R 2 and RPD value and a low RMSE value. The RMSE value trend of the three soil nutrients was similar to the statistical indicator R 2 ; the lowest R 2 values of the SAK, SAP, and SOM were 17.6897 mg/kg, 1.7887 mg/kg, and 2.2046 g/kg, as obtained with the calibration method of FD-RLG/PLSR, FD-RLG/PLSR, and FD-LGR/SVM, respectively, while the calibration methods with the highest RMSEs were irregular as well. For the difference in R 2 , the highest RMSE value of the SAK was 61.7496 mg/kg (SG-SNV/SVM), while it was 6.3700 mg/kg for the SAP (SNV/PLSR) and 6.7730 g/kg for the SOM (FD/PLSR). Considering that the RMSE value could be affected by different soil nutrient dimensions, we used the coefficient of variation (CV) to evaluate the degree of dispersion between groups. The results show that, except for the SAP, the variation degree of the RMSE values of the other two soil nutrients (SAK and SOM) was shown as the SVM being more discrete than PLSR, which was different from the results for R 2 . For soil nutrients, the variation degree of the SAK was greater than that of the SOM, followed by the SAP. Here, we also used the difference analysis of significance to evaluate the difference of the accuracy value (R 2 and RMSE) based on SG transformation and the difference between the two modeling methods. The results showed that SG transformation had no significant difference, as the p values of the R 2 and RMSE were 0.6451 and 0.9904, respectively. Comparing the two modeling methods of different soil nutrients, the accuracy values of PLSR and the SVM had extremely significant differences (p < 0.001). The RPD value indicates the prediction performance of the model, as shown in Figure 6. The model prediction performance of the three soil nutrients shows that the SAP and SOM were better than the SAK. The RPD value of the SVM was generally higher than that of PLSR.

RS
Through the above calibration results, we can conclude that the SVM modeling algorithm was better than PLSR because its comprehensive performance was better than PLSR for the R 2 , RMSE, and RPD indicators. However, through the prediction results, we calculated the ∆R 2 , which indicated a numerical difference between the modeling determination coefficient and the predicted determination coefficient. The ∆R 2 variation ranges of the SAK, SAP, and SOM were −0.4045-0.6860, −0.1172-0.2070, and −0.5591-0.2704, respectively, and the corresponding standard deviations were 0.2981, 0.2033, and 0.2671, respectively. In addition, the analysis results of significant differences between the groups showed extremely significant differences (p < 0.001). Specifically, SG transformation had little effect on the degree of dispersion of data in the group. Generally, the degree of dispersion in the original form was slightly higher than that processed by SG transformation, and the numerical difference of ∆R 2 was not significant. For the two model algorithms, there were significant differences between PLSR and the SVM for ∆R 2 (p < 0.001), and the variation ranges and data discreteness of PLSR were slightly higher than those of the SVM. Abbreviations used: raw spectral (RS), soil nutrients including soil organic matter (SOM), soil available potassium (SAK), soil available phosphorus (SAP), Savitzky-Golay filtering (SG), first derivative (FD), logarithmic transformation (LG), standard normal variate (SNV), multiplicative scatter correction (MSC), partial least squares regression (PLSR), support vector machine (SVM), first-order differential of the reciprocal logarithm (FD-LGR, log(1/R)'), and first-order differential of the reciprocal of the logarithm (FD-RLG, (1/log(R))').

Model Performances of Different Variable Selection Methods
The purpose of variable selection methods is to select wavelengths whose information content is minimally redundant. Figures 6 and 7 show the results of the calibration and prediction accuracy under different variable selection methods. The concentrated map of selected bands of CARS and the SPA ( Figure 5) also reflects the relationship between the number of bands selected, and during the modeling process, we counted the number of wavelengths that were ultimately involved in the modeling. Under the CARS algorithm, the SAK, SAP, and SOM were all approximately 16. For the SPA algorithm, the number of soil nutrients had a certain difference, and the average numbers of the SAK, SAP, and SOM were 6, 20 and 9, respectively. The R 2 average value of the calibration set showed that CARS was higher than the SPA; however, there was no significant difference between the three soil nutrients through significant difference analysis (p > 0.05). Specifically, the SAP performed better than the SOM, and the SAK was the worst for the two variable selection methods.

Model Performances of Different Variable Selection Methods
The purpose of variable selection methods is to select wavelengths whose information content is minimally redundant. Figures 6 and 7 show the results of the calibration and prediction accuracy under different variable selection methods. The concentrated map of selected bands of CARS and the SPA ( Figure 5) also reflects the relationship between the number of bands selected, and during the modeling process, we counted the number of wavelengths that were ultimately involved in the modeling. Under the CARS algorithm, the SAK, SAP, and SOM were all approximately 16. For the SPA algorithm, the number of soil nutrients had a certain difference, and the average numbers of the SAK, SAP, and SOM were 6, 20 and 9, respectively. The R 2 average value of the calibration set showed that CARS was higher than the SPA; however, there was no significant difference between the three soil nutrients through significant difference analysis (p > 0.05). Specifically, the SAP performed better than the SOM, and the SAK was the worst for the two variable selection methods.
In the prediction results, the variable selection methods used for the highest R 2 values for the SAK, SAP, and SOM were the SPA, SPA, and CARS, respectively. Additionally, through the CV value, except for the SAP, the variation degree of the R 2 values of the other two soil nutrients (SAK and SOM) was shown as the SPA being more discrete than CARS, and the SAK was the most discrete no matter which selection method was chosen. The maximum CV was 48.0522%, which appeared in the SPA variable selection method for the SAK, and the lowest CV was 18.1591%, which appeared in the SPA variable selection for the SAP.

Preprocessing Transformations
The spectrum contains useful information as well as considerable redundant and useless information. Consequently, the presence of unexpected irrelevant information in the spectra can also significantly affect the performance of the calibrated models used to quantify the SAK, SAP, and SOM. In this paper, the transformation forms of FD, SNV, MSC, LG, DF-RLG, and DF-LGR were utilized on the SG spectra. Thus, together with the raw spectral form (R and SG), there were a total of 14 transformation forms. These preprocessing transformations based on various mathematical functions can be used for eliminating the influence of noise, correcting for nonlinearity, measurement and sample variations, and noisy spectra [64]. The results of the regression with PLSR, the SVM, and the performance of the models during the prediction phase are shown in Table 2 and Figures 6 and 7. In general, the prediction accuracy under some transformation forms was improved after SG processing, but the significant difference analysis of the prediction results with and without SG treatment showed that the results did not differ significantly (p > 0.05). This was different from most of the current studies, where the SG method was used in combination with other methods to obtain an optimal pretreatment method that was proven to produce good results [34,81]. The SG method was used to smooth and remove random noise (physical jitter, external environmental interference, etc.) from the spectra. The effect of SG filtering was not obvious (though the enhancement was obvious compared with the original spectrum), and the prediction accuracy of PLSR with some preprocessing transformations (e.g., LG and FD-RLG) for predicting the SAK, SAP, and SOM even decreased after SG filtering. which may have been related to the process of noise signal reduction. Part of the important corresponding nonlinear information was removed by SG filtering [5]. In addition, the most commonly used preprocessing transformations performed better than raw spectral reflectance, and similar results have been presented in other works (e.g., [5,83]). This means that various transformations of the spectral variables can successfully eliminate the effects of physical phenomena such as the light scattering effects of particles of different sizes and shapes when using hyperspectral Vis-NIR data for prediction [35].

Feature Wavelengths
The spectrum contains information about the fundamental composition of the soil, which can be used to describe the soil types and their changes in the landscape [84]. Soil's spectral information is affected by its physical properties, chemical composition, and mineral composition [85]. From the microscopic level, the chemical bonds of different molecules vibrate at a characteristic frequency under the action of electromagnetic energy. In this process, there is a process of absorption, reflection, and scattering of energy, which may have a certain relationship with a specific wavelength [86]. Therefore, under the condition of known reflectivity, we can obtain information on the soil property content through the relationship between reflectivity and the soil's nutrients [44,80]. Spectroscopy has been widely recognized as an effective tool for analyzing soil nutrients [84]. Some soil properties or nutrients (e.g., soil moisture content or soil organic matter) can establish a direct relationship with the nutrient content at a specific spectral wavelength. O-H stretch vibrations near 1400 nm [87], 2200 nm [88], and 1900 nm correspond to H 2 O molecules [89]. However, some soil nutrients (e.g., soil potassium and soil phosphorus) need to rely on indirect inversion of other soil component contents, because there is no direct response associated with them at the spectral wavelength and they usually exist at low concentrations [23,24]. This is just as some research teams have revealed the effects of hydrogen bond perturbations in water systems [90]. When these elements (indirect response soil nutrients) are measured, the changes in the spectral profile are due to indirect effects of the association of soil nutrients with other elements (direct response elements). The direct response elements will also be influenced by the environment, and the effects of such changes are dynamic and complex. In this paper, the most common sensitive wavelengths associated with SAK are 400-483 nm, around 728 nm, 967-1031 nm, 1271-1409 nm, 1643-1789 nm, 1975-2004 nm, 2109-2174 nm, and 2312-2449 nm. For SAP, they are 400-450 nm, 1000-1083 nm, 1292-1417 nm, around 1557 nm, 1604 nm, 1659 nm, 1835-2044 nm 2113-2216 nm, and 2355-2450 nm. These feature wavelength selection results are similar to those found in [91,92]. Based on soil absorbance in the visible-near-infrared regions [84], the feature wavelengths of the SAK are mainly related to ferrihydrite, goethite, amine (N-H), organic matter, free water (O-H), cellulose, lignin, starch, the first overtone of O-H stretch, Al-OH or Mg-OH, among others. The components corresponding to the wavelength selection of the SAP are similar to those of the SAK, and the element types are complex and inconsistent. This adds a challenge to the prediction of SAK and SAP concentrations [5]. Due to the overtones and combination absorptions of O-H, C-H, and N-H bonds, the SOM has broad, sensitive bands from the visible to the shortwave infrared range (350-2500 nm) [88]. The range of wavelengths selected for the SOM in this paper were 405-442 nm, 543-788 nm, around 1000 nm, 1295 nm, 1347-1358 nm, 1608-1620 nm, 1835-1934 nm, 2210 nm, and 2309-2448 nm. Some of these wavelengths were consistent with those selected (570-700 nm, 769 nm, 1020 nm, 1340 nm, and 1986 nm) in existing studies [93][94][95]. Figure 5 shows the distribution of selected wavelengths in the 400-2450 nm spectral range under different spectral variable selection methods. The number of effective wavelength variables selected by the CARS method was more than that of the SPA for the SAK, SAP, and SOM. Through the significant difference analysis, there was no significant difference in the accuracy of the models corresponding to the two methods (p > 0.05). Moreover, although the SPA method greatly reduced the number of spectral variables, in Table 2 and Figures 6 and 7, it can be seen that the corresponding integrated stability and accuracy of the model was inferior to that of the CARS method. This may be related to the processing mechanism and process of the SPA algorithm. The wavelength selection process adds the wavelength with the largest projection vector to the wavelength combination, and each newly selected wavelength has the smallest linear relationship with the previous wavelength so it can maximize the preferential selection of useful information, reduce the variable covariance, and reduce the amount of hyperspectral redundant information to enhance the model's computing power. In this process, unstable variables are selected, and some important and relevant variables that contain both confounding and informative variables are removed [44].

The Availability of the Model
According to Figures 6 and 7 and Table 2, both preprocessing transformations, variable selection methods, and regression algorithms influenced the model's performance regarding R 2 , RMSE, and RPD. Similar results have been presented by others (e.g., [5,96]). The results in this paper show that, for the modeling of the SOM, the SVM was better than PLSR, but for the SAK and SAP, PLSR was better than the SVM. It was also shown in [5] that PLSR performed better than the LS-SVM and BPNN. Figure 6 shows the difference between the modeling and prediction results in terms of the metric R 2 , which enabled us to roughly see the stability of the modeling approach. The results connected by the red line, which deviates more from the horizontal axis 0 and is basically above the horizontal axis 0, indicate an obvious overfitting situation of the SVM compared with the blue line. At the same time, we found that the modeling results using the PLSR algorithm were partially underfitted, but the overall performance was better than that of the SVM algorithm. The modeling prediction accuracy for the SAP and SOM was significantly better than that for the SAK. This also shows that the spectral inversion analysis of trace elements such as potassium (K) is more difficult than that for other elements [24]. Table 3 illustrates the best preprocessing transformation, variable selection methods, and regression methods after combining the modeling and prediction accuracy results for the SAK, SAP, and SOM ((FD-LGR)/SPA/PLSR for the SAK, (FD-LGR)/SPA/PLSR for the SAP, and SG+MSC/CARS/SVM for the SOM. It can also be concluded that the integrated inversion of the SOM was significantly better than that of the SAP and SAK. For the results of the final model, although the study areas had the same soil type, historically different land-use types (garden plot, paddy field, woodland, dry land, and wasteland) and different terrain, treatments (variable fertilization, variable straw returning, and variable planting density) resulted in significant data variation of the available nutrient contents. These effects were manifested in the different spectral curves, making it difficult to have a general model to predict the soil nutrient content at present. Moreover, proper sampling strategy in a more homogeneous area and a large dataset may increase the performance of the model. This conclusion was confirmed in the prediction of soil organic carbon [75], and further in-depth studies are needed for nutrients such as SAK, SAP, and SOM.

Conclusions
This study measured the spectra of 103 sample sites in the Xihe River watershed of Chongzhou in the western Chengdu Plain and collected corresponding soil samples. To reveal the relationship between the soil spectral information and soil nutrient content, in this study, two regression methods (SVMR and PLSR) combined with two variable selection methods (CARS and SPA) and 13 preprocessing transformations were used to estimate the SAK, SAP, and SOM contents based on Vis-NIR reflectance spectroscopy. The results illustrate that the prediction performance could be significantly improved by applying proper calibration methods, and the details for this are as follows: 1. In addition to RS, the other 13 kinds of spectra showed different changes in the SAK, SAP, and SOM contents to varying degrees, and SG transformation did not significantly improve the results except for RS. First-order derivatives based on logarithmic and inverse transformations (FD-LGR) can provide better predictions of the SAK and SAP nutrient contents, and the best form of SOM transformation is SG+MSC.
2. In terms of selecting effective wavelength variables for modeling, the CARS technique was superior to the SPA techniques, although the number of bands selected by the SPA methods was much smaller.
3. According to our comprehensive statistical comparison, the PLSR models performed better than the SVM model for estimation because of the overfitting problem of the SVM. 4. As a result, hyperspectral Vis-NIR data (400-2450 nm) showed a good ability to predict the SOM, a moderate ability to predict the SAP, and poor ability to predict the SAK.
According to numerical and visual tradeoffs, in this paper, (FD-LGR)/SPA/PLSR, (FD-LGR)/SPA/PLSR, and SG+MSC/CARS/SVM were selected as the most suitable methods for predicting the SAK, SAP, and SOM. Their corresponding predicted R 2 and RMSE were 0.7532 and 32.3090 mg/kg, 0.7440 and 6.6910 mg/kg, and 0.9009 and 3.2103 g/kg, respectively.
The current study evaluating calibration and spectral variable selection methods for predicting three soil nutrients using Vis-NIR spectroscopy also provides a framework for predicting various soil nutrients that meet the requirements of modern soil management and high-quality development. Further work should investigate other calibration methods and pay more attention to optimizing these methods to improve the predictive performance. In addition, the number of soil samples and the variation in soil types can affect the shape of the spectral curve, and research should focus on the relationship between soil spectral characteristics and chemical composition-especially for elements without direct correlation wavelengths in the band range-to make the model more stable and robust. Moreover, indepth research on the prediction mechanism is also important to enhance the generalization of the model.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the project requirements.

Conflicts of Interest:
The authors declare no conflict of interest.