Estimation of Organic Carbon in Anthropogenic Soil by VIS-NIR Spectroscopy: E ﬀ ect of Variable Selection

: Visible and near-infrared reﬂectance (VIS-NIR) spectroscopy is widely applied to estimate soil organic carbon (SOC). Intense and diverse human activities increase the heterogeneity in the relationships between SOC and VIS-NIR spectra in anthropogenic soil. This fact results in poor performance of SOC estimation models. To improve model accuracy and parsimony, we investigated the performance of two variable selection algorithms, namely competitive adaptive reweighted sampling (CARS) and random frog (RF), coupled with ﬁve spectral pretreatments. A total of 108 samples were collected from Jianghan Plain, China, with the SOC content and VIS-NIR spectra measured in the laboratory. Results showed that both CARS and RF coupled with partial least squares regression (PLSR) outperformed PLSR alone in terms of higher model accuracy and less spectral variables. It revealed that spectral variable selection could identify important spectral variables that account for the relationships between SOC and VIS-NIR spectra, thereby improving the accuracy and parsimony of PLSR models in anthropogenic soil. Our ﬁndings are of signiﬁcant practical value to the SOC estimation in anthropogenic soil by VIS-NIR spectroscopy.


Introduction
Soil holds the most massive storage of organic carbon in the terrestrial ecosystems [1,2]. Natural soil may transform into anthropogenic soil by long-term human activities [3]. Soil organic carbon (SOC) storage has also been altered due to this transformation [4]. Given that SOC plays a pivotal role in global warming [5,6], carbon cycle [7,8], and food security [6,9], changes of SOC storage may influence

Sampling Area and Soil Samples
The study area is located in the Jianghan Plain, China. It is known as 'Land of Fish and Rice'. The elevation changes from 22 m to 28 m. The study area is under a subtropical warm monsoon climate. The mean annual temperature is 16.10 • C, the mean annual precipitation is 1154 mm, and the mean annual relative humidity is 80%. The land-use types include cropland, woodland, and meadows. Cropland patches are highly fragmented, and some of them are close to settlements and various water bodies (breeding, ponds, irrigated canals, lakes, and rivers) [52]. Diverse land management practices are carried out in our study area according to our field survey. Intensive human activities have led to the heterogeneity of the relationship between VIS-NIR spectra and SOC [18].
A total of 108 topsoil samples (0-15 cm) were collected from 20 December 2011 to 21 December 2011. Each soil sample was a mix of five soil subsamples, which were collected from the center and four corners in a square of 1 m 2 [53]. The geographical coordinates of these samples were recorded by a hand-held global positioning system, and the geographical distributions are shown in Figure 1. The total collected soil samples (Dataset 0) were divided into three datasets according to sampling locations, land use and land cover types (Dataset 1, Dataset 2, and Dataset 3, respectively). Samples of the three datasets were collected from three sites with different human activities on a small scale [18]. Samples of Dataset 1 was collected from cropland that was adjacent to a breeding pond. Dataset 2 was sampled from cropland that was surrounded by cropland. Dataset 3 included samples of various land-use types (cropland, artificial forest, meadows, and breeding ponds). These samples were put in sealed plastic bags with sampling sequence labels and then were sent to the laboratory at room temperature on 22 December 2011. After a principle components analysis and a 3σ standard, five outliers were discarded, and 103 samples were retained for further data analysis in this study.

VIS-NIR Spectral Measurement and SOC Analysis
In the laboratory, soil samples were air-dried at 20-30 • C for 1 week, then ground, and passed through a 2-mm sieve [54]. An ASD FiledSpec 3 portable spectro-radiometer with a spectral range of 350-2500 nm, and a spectral resolution of 1 nm was used to scan soil samples in a dark room to avoid stray light interference. All samples were put separately in dishes with a 20-cm diameter. A halogen lamp placed at 30 cm distance and an angle of 45 • was used to illuminate soil samples. The detection fiber probe was placed vertically to soil samples at 12 cm distance. A white Spectralon panel was used to calibrate the spectrometer before measuring spectrum of the first soil sample and repeated every six soil samples [53]. A total of 10 scans were recorded for each soil sample, which were averaged in one sample spectrum [55]. Through these procedures, the reflectance spectra of the 108 samples were obtained. The spectra in the range of 350-399 nm and 2450-2500 nm were removed due to serious noises. The remained spectra (400-2449 nm) were further resampled to 10 nm to extract 205 wavebands.
The SOC content was measured by wet oxidation at 180 • C with a mixture of potassium dichromate and sulfuric acid [18]. It should be noted that the oxidation of active organic carbon by this approach is incomplete, which underestimates the SOC content. A "standardized" corrective factor ranging from 1.10 to 1.40 could be used in practice [56]. In our study, we used the "raw" SOC content without using a "standardized" corrective factor. This allowed comparing the results of our study with other studies that also use the "raw" SOC content.
Remote Sens. 2020, 12, x FOR PEER REVIEW 4 of 19 Figure 1. Location of the study area and soil sampling.

VIS-NIR Spectral Measurement and SOC Analysis
In the laboratory, soil samples were air-dried at 20-30 °C for 1 week, then ground, and passed through a 2-mm sieve [54]. An ASD FiledSpec 3 portable spectro-radiometer with a spectral range of 350-2500 nm, and a spectral resolution of 1 nm was used to scan soil samples in a dark room to avoid stray light interference. All samples were put separately in dishes with a 20-cm diameter. A halogen lamp placed at 30 cm distance and an angle of 45° was used to illuminate soil samples. The detection fiber probe was placed vertically to soil samples at 12 cm distance. A white Spectralon panel was used to calibrate the spectrometer before measuring spectrum of the first soil sample and repeated every six soil samples [53]. A total of 10 scans were recorded for each soil sample, which were averaged in one sample spectrum [55]. Through these procedures, the reflectance spectra of the 108 samples were obtained. The spectra in the range of 350-399 nm and 2450-2500 nm were removed due to serious noises. The remained spectra (400-2449 nm) were further resampled to 10 nm to extract 205 wavebands.
The SOC content was measured by wet oxidation at 180 °C with a mixture of potassium dichromate and sulfuric acid [18]. It should be noted that the oxidation of active organic carbon by this approach is incomplete, which underestimates the SOC content. A "standardized" corrective factor ranging from 1.10 to 1.40 could be used in practice [56]. In our study, we used the "raw" SOC content without using a "standardized" corrective factor. This allowed comparing the results of our study with other studies that also use the "raw" SOC content.

Spectral Variable Selection
CARS and RF were chosen to achieve spectral variable selection in our study. CARS is designed on the principle named 'survival of the fittest' [16,37]. When using this algorithm to select the spectral variables, each spectral waveband is treated as an individual. The adaptive individuals are left behind as the final spectral variable subset, while the maladaptive individuals are eliminated. The scheme of the CARS algorithm can be summarized as five steps: (1) utilize the Monte Carlo algorithm to randomly choose n samples and build Partial least squares regression (PLSR) models for these subsets; (2) retain variables with high regression coefficients and remove variables with low regression coefficients through a exponentially decreasing function and adaptive reweighted sampling algorithm; (3) build PLSR model and compute the root mean square error of cross validation (RMSE cv ) with the retained variables as a new variable subset; (4) repeat steps 1-3 for N runs to obtain N new variable subsets; and (5) choose the new subsets with the lowest RMSE cv as the optimal variable subset [41]. The run times (N) was set to 50 in this study.
Random frog (RF) is also a variable selection algorithm, which is inspired by the reversible jump Markov Chain Monte Carlo (RJMCMC) technique [40]. It is mathematically easier and more computationally efficient than RJMCMC. RF searches for the best variable subset from a full variable Remote Sens. 2020, 12, 3394 5 of 18 set by considering the interaction between spectral variables. Five steps are needed to realize this algorithm: (1) randomly initialize variable subset K0 consisting of Q variables; (2) obtain new variable subset K* using the way of the normal distribution, then update the variable subset K0 as K1 through the cross-validated misclassification errors of K0 and K*; this step needs to be iterated many times; (3) compute each variable selection probability after the iterations finish, and variables with same selection probability are selected as variable subsets; (4) build PLSR model with each variable subset, and compute RMSE cv ; and (5) choose the variable subset with the lowest RMSE cv as the optimal variable subset. X (m×n), Y (n×1), N, and Q are the key input parameters for this algorithm, where X is the spectral variables, and Y is the SOC content, in which m is the number of samples, and n is the number of spectral variables. N is the number of iterations, and Q is the number of variables consisting of the initialized variable subset [44]. The number of iterations (N) was set to 50 in this study. In this study, CARS and RF were performed in Matlab (R2018b, MathWorks, Inc., Natick, MA, USA) with the libPLS toolbox (Version 1.98), which was available at http://www.libpls.net/download.php.

Model Calibration and Validation
The 103 samples were divided into calibration and validation datasets using the concentration gradient method [16]. According to the SOC values, the 103 samples were sorted in ascending order. Then, the 103 samples were divided into a total of 35 groups. The last group included one sample, and the other groups had three samples. In each group, the first and third samples were selected and added into the calibration dataset, whereas the second sample was added into the validation dataset. Thus, the calibration dataset contained 69 samples, and the validation dataset contained 34 samples.
Partial least squares regression (PLSR) was chosen to develop regression models between SOC and VIS-NIR spectra variables, which is proposed by Wold et al. [57]. It creates a linear regression model by projecting the independent and dependent variables into a new space [58]. This method integrates the advantages of principal component analysis, canonical correlation analysis and linear regression. It can overcome the multicollinearity problem and improve model performance. In this study, the PLSR models were built on the calibration dataset and the performance of the PLSR models was evaluated on the validation dataset. The coefficient of determination (R 2 ), the root mean squared error (RMSE) and the residual prediction deviation (RPD) were calculated using Equations (1)-(3) [57]. A desirable PLSR estimation model should have high R 2 and RPD with a low RMSE on the validation dataset.
where i is the ith soil sample, n is the number of soil samples, y is the mean of the measured SOC content, y i is the measured SOC content for the ith soil sample,ŷ i is the predicted SOC content for the ith soil sample, SD is the standard deviation of the measured SOC content of the validation dataset, and RMSE P is the root mean square error of the validation dataset.  Figure 2 shows raw spectra and pretreated spectra with different pretreatments. In Figure 2a, raw spectra had a similar shape for all samples, whose reflectance rose rapidly before 850 nm and slowly after 850 nm. There were three absorption peaks with different depths distinguishable in the near-infrared region, which are attributed to the hydroxyl group of free water (1410 nm and 1920 nm) and the Al-OH group of clay minerals (2210 nm) [59]. After FD, the most spectral values tended to approach zero, highlighting the bands where the curvature of raw spectra changed greatly. Spectra after Log (1/R)) pretreatment showed typical decline in absorption at different rates with the range of 400-2449 nm. After MC, spectral values were between −0.15 and 0.15. There was still a significant fluctuation at around 1410 nm, 1920 nm, and 2210 nm. The spectra after MSC and SNV had similar shapes, but with different ranges.

Correlation Analysis
The differences in the correlation coefficient curves reveal heterogeneous relationships between raw VIS-NIR spectra and SOC ( Figure 3). In Figure 3, blue '+', green '+', and magenta '+' symbols refer to locations of VIS-NIR spectral variables having significant correlations for Dataset 1, Dataset 2, and Dataset 0, respectively (at a significance level of 0.05). It was revealed that SOC had a

Correlation Analysis
The differences in the correlation coefficient curves reveal heterogeneous relationships between raw VIS-NIR spectra and SOC ( Figure 3). In Figure 3, blue '+', green '+', and magenta '+' symbols refer to locations of VIS-NIR spectral variables having significant correlations for Dataset 1, Dataset 2, and Dataset 0, respectively (at a significance level of 0.05). It was revealed that SOC had a significantly negative correlation with raw VIS-NIR spectra in the region of 400-2449 nm for Dataset 1 and Dataset 0. The spectral variables with significant negative correlations distributed in the region of 480-900 nm and 1170-1870 nm for Dataset 2, whereas no significant correlations were observed for Dataset 3. Figure 2. The raw and pretreated soil spectra. (a) None: raw spectra; (b) FD: the spectra after first derivative; (c) Log (1/R): the absorption spectra; (d) MC: the spectra after mean centering; (e) MSC: the spectra after multiplicative scatter correction; and (f) SNV: the spectra after standard normal variate.

Correlation Analysis
The differences in the correlation coefficient curves reveal heterogeneous relationships between raw VIS-NIR spectra and SOC ( Figure 3). In Figure 3, blue '+', green '+', and magenta '+' symbols refer to locations of VIS-NIR spectral variables having significant correlations for Dataset 1, Dataset 2, and Dataset 0, respectively (at a significance level of 0.05). It was revealed that SOC had a significantly negative correlation with raw VIS-NIR spectra in the region of 400-2449 nm for Dataset 1 and Dataset 0. The spectral variables with significant negative correlations distributed in the region of 480-900 nm and 1170-1870 nm for Dataset 2, whereas no significant correlations were observed for Dataset 3.   Figure 4 shows details of spectral variables selected by CARS, taking the Log (1/R) spectra as an example. Figure 4a shows the number of selected spectral variables during 50 iterations. As the number of iterations increases, the selected spectral variables first decreased sharply and then gradually. Figure 4b shows the change of 5-flod RMSEcv during the 50 iterations. As the number of iterations increased, the RMSEcv value first declined and then ascended. At 21 iterations, RMSEcv reached a minimum. Figure 4c shows the regression coefficient path of all spectral variables (regression coefficient path: the changing trend of regression coefficient values during the 50 iterations). At iteration one, the regression coefficients of spectral variables were similar. With the increase of iterations, the regression coefficient of some spectral variables gradually increased, while they remained zero for others. According to Figure 4a-c, at iteration 21, the selected spectral variables were the optimal subset that helped to achieve the best model performance.

Spectral Variable Selection
reached a minimum. Figure 4c shows the regression coefficient path of all spectral variables (regression coefficient path: the changing trend of regression coefficient values during the 50 iterations). At iteration one, the regression coefficients of spectral variables were similar. With the increase of iterations, the regression coefficient of some spectral variables gradually increased, while they remained zero for others. According to Figure 4a-c, at iteration 21, the selected spectral variables were the optimal subset that helped to achieve the best model performance. For the other spectral pretreatments, the number of selected spectral variables also decreased with iterations. Different iterations were required to reach the minimum RMSEcv for spectra with different pretreatments (e.g., 25, 23, 25, 25, and 28 for the pretreatments of None, FD, MC, MSC, and SNV, respectively). For the other spectral pretreatments, the number of selected spectral variables also decreased with iterations. Different iterations were required to reach the minimum RMSE cv for spectra with different pretreatments (e.g., 25, 23, 25, 25, and 28 for the pretreatments of None, FD, MC, MSC, and SNV, respectively). Figure 5 shows the spectral variables selected by the CARS method with different spectral pretreatments. A total of 21, 26, 31, 21, 21, and 16 spectral variables were selected as the optimal spectral subset for the pretreatments of None, FD, Log (1/R), MC, MSC, and SNV, respectively. Most of the selected spectra variables were concentrated in the 1800-2449 nm spectral range. Only one spectral variable was selected in the range of 400-780 nm for MSC and SNV. No spectral variable was selected in the range of 1200-1800 nm for FD and SNV, and neither in the range of 780-1200 nm for MSC.
Remote Sens. 2020, 12, x FOR PEER REVIEW 9 of 19 Figure 5 shows the spectral variables selected by the CARS method with different spectral pretreatments. A total of 21, 26, 31, 21, 21, and 16 spectral variables were selected as the optimal spectral subset for the pretreatments of None, FD, Log (1/R), MC, MSC, and SNV, respectively. Most of the selected spectra variables were concentrated in the 1800-2449 nm spectral range. Only one spectral variable was selected in the range of 400-780 nm for MSC and SNV. No spectral variable was selected in the range of 1200-1800 nm for FD and SNV, and neither in the range of 780-1200 nm for MSC.  Figure 6 shows the probability that each spectral variable is repeatedly selected during the 50 iterations. With the increase of selection probability, the number of selected spectral variables reduced. For different spectral pretreatments, each spectral variable had different selection  Figure 6 shows the probability that each spectral variable is repeatedly selected during the 50 iterations. With the increase of selection probability, the number of selected spectral variables reduced. For different spectral pretreatments, each spectral variable had different selection probability. The number of selected spectral variables was the least when spectral pretreatment was FD. Figure 7 shows the change of 5-Flod RMSE cv in different spectral variable selection probability. Spectral variables were selected as the optimal subset with the lowest 5-Flod RMSE cv . According to Figure

Accuracy of Estimation after Different Pretreatment and Variable Selection Techniques
To evaluate the effectiveness of different variable selection methods and the influence of different pretreatments tested, PLSR was used to establish a series of SOC models, whose results are shown in Table 2 and Figure 9. Table 2. Accuracies of soil organic carbon (SOC) estimation using full-spectrum-based partial least squares regression (PLSR) models, competitive adaptive reweighted sampling (CARS)-based PLSR models, and random frog (RF)-based PLSR models after different spectral pretreatments.

Accuracy of Estimation after Different Pretreatment and Variable Selection Techniques
To evaluate the effectiveness of different variable selection methods and the influence of different pretreatments tested, PLSR was used to establish a series of SOC models, whose results are shown in Table 2 and Figure 9. Table 2. Accuracies of soil organic carbon (SOC) estimation using full-spectrum-based partial least squares regression (PLSR) models, competitive adaptive reweighted sampling (CARS)-based PLSR models, and random frog (RF)-based PLSR models after different spectral pretreatments.  Both CARS and RF could enhance the performance of PLSR models (except for the combination of CARS and FD, Figure 9a), compared to the full-spectrum models. The RPD of the full-spectrum models was 1.81, while the RPD of CARS and RF models were of equal value of 1.94 ( Table 2). The best Rp 2 in each spectral variable selection category slightly increased from 0.80 (Full spectrum) to 0.81 (CARS) and 0.83 (RF), and RPD increased from 1.96 (Full spectra) to 2.05 (CARS) and 2.11 (RF).

Spectral Variable Selection
Spectral pretreatments have different effects on the full-spectrum PLSR models and the PLSR models obtained after spectral variable selection (Table 2 and Figure 9b). For the full-spectrum PLSR models, FD, Log (1/R) and MC had positive effects, as Rp 2 increased from 0.70 to 0.80; RMSEp decreased from 3.60 g/kg to 3.17 g/kg; and RPD increased from 1.72 to 1.96. MSC and SNV had negative effects, as Rp 2 remained 0.70; RMSEp increased from 3.60 g/kg to 3.66 g/kg; and RPD decreased from 1.72 to 1.69. Full-spectrum PLSR model with FD had the largest RPD difference (0.23), and that with MSC and SNV had the lowest RPD difference (−0.03).
For the CARS + PLSR models, the spectral pretreatments of Log (1/R) and MC enhanced the PLSR model performance, but the spectral pretreatments of FD, MSC and SNV weakened PLSR model performance. The best Rp 2 , RMSEp and RPD values were 0.81, 3.02 g/kg and 2.05, respectively. The worst Rp 2 , RMSEp and RPD values were 0.73, 3.42 g/kg and 1.81, respectively. CARS + PLSR model with MSC had the largest RPD difference (−0.25), and that with MC had the lowest RPD difference (0.01).
For the RF + PLSR models, all spectral pretreatments had positive effects. Rp 2 increased from 0.72 to 0.83; RMSEp decreased from 3.34 g/kg to 2.94 g/kg; and RPD increased from 1.86 to 2.11. The RF + PLSR model with Log (1/R) had the largest RPD difference (0.23), and that with SNV had the lowest RPD difference (−0.03).
The number of selected spectral variables after RF was larger than that after CARS. For RF, the minimum and maximum numbers of selected spectral variables were 21 and 106, respectively, whereas there were 16 and 31 for CARS. Besides, the number of latent variables after RF was larger than that after CARS. Figure 9. (a) Ratio of prediction deviation (RPD) difference between competitive adaptive reweighted sampling (CARS)/random frog (RF) and full spectrum SOC models after the same spectral pretreatments (spectral pretreatments include non-pretreatment (None), first derivative (FD), Log (1/R), mean centering (MC), multiplicative scatter correction (MSC), and standard normal variate (SNV)); (b) RPD difference between non-pretreated and pretreated SOC models after the same variable selection algorithms (variable selection algorithms include full spectrum, CARS, and RF).
Both CARS and RF could enhance the performance of PLSR models (except for the combination of CARS and FD, Figure 9a), compared to the full-spectrum models. The RPD of the full-spectrum models was 1.81, while the RPDs of CARS and RF models were of equal value of 1.94 ( Table 2). The best R p 2 in each spectral variable selection category slightly increased from 0.80 (Full spectrum) to 0.81 (CARS) and 0.83 (RF), and RPD increased from 1.96 (Full spectra) to 2.05 (CARS) and 2.11 (RF). Spectral pretreatments have different effects on the full-spectrum PLSR models and the PLSR models obtained after spectral variable selection (Table 2 and Figure 9b). For the full-spectrum PLSR models, FD, Log (1/R) and MC had positive effects, as R p 2 increased from 0.70 to 0.80; RMSE p decreased from 3.60 g/kg to 3.17 g/kg; and RPD increased from 1.72 to 1.96. MSC and SNV had negative effects, as R p 2 remained 0.70; RMSE p increased from 3.60 g/kg to 3.66 g/kg; and RPD decreased from 1.72 to 1.69. Full-spectrum PLSR model with FD had the largest RPD difference (0.23), and that with MSC and SNV had the lowest RPD difference (−0.03). For the CARS + PLSR models, the spectral pretreatments of Log (1/R) and MC enhanced the PLSR model performance, but the spectral pretreatments of FD, MSC and SNV weakened PLSR model performance. The best R p 2 , RMSE p and RPD values were 0.81, 3.02 g/kg and 2.05, respectively.
The worst R p 2 , RMSE p and RPD values were 0.73, 3.42 g/kg and 1.81, respectively. CARS + PLSR model with MSC had the largest RPD difference (−0.25), and that with MC had the lowest RPD difference (0.01). For the RF + PLSR models, all spectral pretreatments had positive effects. R p 2 increased from 0.72 to 0.83; RMSE p decreased from 3.34 g/kg to 2.94 g/kg; and RPD increased from 1.86 to 2.11. The RF + PLSR model with Log (1/R) had the largest RPD difference (0.23), and that with SNV had the lowest RPD difference (−0.03). Therefore, the Log (1/R) was the best pretreatment for spectral variable selection. Log (1/R) + CARS/RF + PLSR models provided the best performance (Log (1/R) + CARS + PLSR model: R p 2 = 0.81, RMSE p = 3.02 g/kg, and RPD = 2.05, Log (1/R) + RF + PLSR model: R p 2 = 0.83, RMSE p = 2.94 g/kg, and RPD = 2.11). The number of selected spectral variables after RF was larger than that after CARS. For RF, the minimum and maximum numbers of selected spectral variables were 21 and 106, respectively, whereas there were 16 and 31 for CARS. Besides, the number of latent variables after RF was larger than that after CARS.

The Effect of Spectral Variable Selection Techniques on Model Accuracy
Numerous studies have proven that SOC estimation by VIS-NIR spectra is efficient and accurate. Nevertheless, it remains a challenge to apply VIS-NIR spectroscopy to estimate SOC in anthropogenic soils that are characterized by high heterogeneity in the relationship between VIS-NIR spectra and SOC. The study area, Jianghan plain, has been under a long-term period of human activities with a highly fragmented landscape [52,60]. Our study reported that there was strong heterogeneity in the relationship between SOC and VIS-NIR spectra. With soil samples collected from the study area, we investigated the effect of two spectral variable selection techniques to improve the performance of the PLSR models in SOC estimation in anthropogenic soils. Our results demonstrated that spectral variable selection could improve the accuracy of PLSR models, wherein RPD increased by 0.13 (without spectral variable selection: the best R 2 = 0.80, RPD = 1.96 and RPD = 1.81; with spectral variable selection: the best R 2 = 0.83, RPD = 2.11 and RPD = 1.94). This is because that spectral variable selection could eliminate unimportant information, reserve relevant information, and reduce spectral collinearity [37]. The performance of the proposed spectral variable selection was comparable to other strategies that aimed to improve the accuracy of PLSR models for anthropogenic soil. Liu et al. compared a variety of sample selection algorithms, which aimed to develop a representative calibration dataset for SOM estimation [18]. The best RPD achieved in their study was lower than the RPD in this study. Liu et al. further combined the Kennard-Stone algorithm and spectral pretreatment to choose representative calibration samples, and achieved an RPD of 1.85, which was still poorer than that obtained in the current study [59]. Wang et al. proposed the MVARC-R-KS method to select representative calibration samples (not spectral variables as in the current study), which has resulted in good accuracy of PLSR models [61]. They reported that the best RPD was 1.81, which was also lower than the RPD in this study. These mentioned strategies mainly focus on the selection of representative calibration samples to improve the accuracy of PLSR models, while our strategies focus on the selection of representative spectral variables. A combination of these two strategies to further improve the performance of PLSR models in SOC estimation could be explored in future rese arch.
We also tested the performance of spectral pretreatments with spectral variable selection techniques. The PLSR models performed differently using different selected spectral variables after different spectral pretreatments. For the CARS-based models, Log (1/R) and MC enhanced the accuracy of PLSR models, while FD, MSC and SNV led to accuracy deterioration. FD has not improved the estimation performance of the PLSR model. This might be due to the lack of variables in the spectral region of 1200-1800 nm, a spectral range that contains relevant overtones and combinations of fundamental bonds of O-H and C-O groups that are associated with SOC. The amplification of noise may also hinder useful signals [62]. MSC has failed to improve model performance as only one variable was selected in the visible region of 400-780 nm, which is an important region associated with soil color variation in the blue band around 450 nm. The darker the soil color, the higher the SOC content [63]. This single selected variable may not have accounted for sufficient information to capture variation in soil darkness that could be linked with variation in SOC. The number of selected spectral variables with SNV was large than that with MSC, which explains why the PLSR model with SNV outperforms that with MSC. For the RF-based models, all pretreatments enhanced the accuracy of PLSR models. It could be attributed to the fact that RF could select important spectral variables as comprehensively as possible while reducing collinearity between spectral variables. Among them, Log (1/R) + CARS/RF + PLSR models provided the best estimation accuracy. Since SOC absorbs energy at specific frequencies in VIS-NIR, Log (1/R) enhanced its absorption feature [64].
It should be noted that the effects of spectral pretreatments are determined by the quality of raw spectra. In other words, spectral pretreatments would be necessary in the case that raw spectra are influenced by variable soil physical conditions and by the surrounding environment during measurement (e.g., temperature, ambient light). In this study, however, all the samples were air-dried, ground and passed through a 2-mm sieve, and VIS-NIR spectra were measured in a well-controlled environment. Therefore, both CARS and RF could develop satisfactory SOC estimation models without spectral pretreatments. Meanwhile, this study provided that Log (1/R) should be adopted to further improve the accuracy of SOC estimation models.

The Effect of Spectral Variable Selection Techniques on Model Parsimony
Model parsimony is an important issue that has been less investigated in previous studies. With the demand for advanced regression algorithms, fewer spectral variables are desired for SOC estimation and mapping when using spectra data. This study demonstrated fewer spectral variables were needed to build PLSR models by CARS and RF algorithms, as the numbers of selected variables were 31 and 83 for CARS and RF, respectively.
The search mechanism of spectral variables by CARS and RF algorithms is different, which results in a different number of selected spectral variables and their distributions. CARS selects spectral variables that have the largest regression coefficient through continuous iterations. In our study, the selected spectral variables were mainly concentrated in the spectral region around 540 nm and 1800-2400 nm. This result was similar to that of Vohland et al. and Hong et al. [16,33,42]. Vohland et al. selected variables at 1875 nm, 2205-2220 nm, and 2330-2345 nm to successfully estimate SOC content in the Eifel-Moselle-Hunsr-ück region. Hong et al. reported spectral variables at 400-800 nm and 2000-2400 nm to be associated with SOC. It should be noted that sensitive spectral variables for SOC may differ from one dataset to another. This explains why the selected spectral variables by CARS are not the same in different studies. The number of the selected spectral variables by RF is larger than that by CARS, and the distribution of its selected spectral variables is sparser. It can be explained by the fact that RF globally searches the optimal variable combination based on random frog leaping. The selected spectral variables by RF were mainly distributed around 430 nm, 610 nm, 800 nm, 1000 nm, 1100 nm, 1200 nm, 1420 nm, 1500 nm, 1800 nm, 1920 nm, 2000 nm, 2100 nm, 2200 nm, and 2350 nm.
In summary, the selected spectral variables in the visible region by both methods are mainly concentrated in the locations of 400 nm, 530 nm and 610 nm. These wavelengths are associated with variation in chromophores and the darkness of humic acid (e.g., due to blue absorption band at 450 nm and perhaps red color absorption band at 680 nm) [65]. Spectral variables in the near-infrared region are chosen because some fundamental vibrational bonds are associated with SOC. These bonds mainly include C-H, N-H, C-O, C-O, O-H, and Al-OH [54,66]. Table 3 shows the possible assignments of fundamental bonds, absorption wavelength, and related soil constituents for the main selected spectral variables in the near-infrared region by RF and CARS. Some selected spectral wavelengths in this study do not coincide with possible wavelengths of fundamental vibrational bonds. It could be attributed to slight shifts in wavelength locations due to inharmonic molecules vibrations [54]. Table 3. Possible assignments of fundamental bonds, absorption wavelength, and related soil constituent for the selected spectral variables in the near-infrared region by competitive adaptive reweighted sampling (CARS) and random frog (RF) [54,66].

The Implication of the Proposed Strategy
Accuracy and parsimony of PLSR models could be improved by the proposed modeling approach investigated in this study, but could not be optimal at the same time. The more spectral variables, the more complex PLSR models and vice versa. However, the inclusion of a larger number of input spectral variables could be redundant, while a smaller number of spectral variables could cause loss of significant spectral information related to SOC. Both cases will result in low accuracy of PLSR models. An optimal number of selected spectral variables could balance the model parsimony and accuracy. Nevertheless, an optimal number of spectral variables is difficult to determine and varies from one dataset to another. Therefore, it is essential to search for the best pretreatment to obtain the most appropriate number of spectral variables. For the CARS algorithm, the PLSR model achieved the best accuracy when a total of 31 spectral variables were selected after the Log (1/R) spectra pretreatment. The number of selected spectral variables were smaller with other spectral pretreatments, which resulted in deteriorated accuracy of PLSR models due to the exclusion of significant variables related with SOC. For the RF algorithm, selected spectral variables with Log (1/R), MC, MSC, and SNV were of similar distribution. The PLSR model with Log (1/R) outperformed the models using the other three spectral pretreatments. This is possibly because of the redundant information that resulted from the larger number of selected spectral variables obtained with MC and MSC compared to that with Log (1/R). The number of spectral variables selected with SNV was smaller than that with Log (1/R), which may indicate the loss of important spectral information by SNV. Since the Log (1/R) has resulted in selecting an appropriate number of spectral variables, the best PLSR models could be established for SOC estimation using both spectral variable selection techniques (RF: R 2 = 0.83 and RPD = 2.05; CARS: R 2 = 0.81, and RPD = 2.11).

Conclusions
In this study, competitive adaptive reweighted sampling (CARS) and random frog (RF) algorithms in combination with five different spectral pretreatments were used to select spectral variables, which were used as input in partial least squares regression (PLSR) to estimate soil organic carbon (SOC) in the Jianghan Plain of China. According to our results, the following conclusions can be made: (i) Both spectral variable selection algorithms (e.g., CARS and RF) could improve the accuracy and parsimony of PLSR models; (ii) the accuracy of the PLSR model obtained after RF is better than that with CARS, although the parsimony of the latter model was worse (the best models with RF: R 2 = 0.83, RPD = 2.11, and the number of spectral variables = 83; the best models with CARS: R 2 = 0.81, RPD = 2.05, and the number of spectral variables = 31); (iii) the effects of spectral pretreatments vary among spectral variable selection algorithms. All FD, Log (1/R), MC, MSC, and SNV could improve the accuracy of PLSR models with RF, whereas only Log (1/R) and MC could slightly improve the accuracy of PLSR models with CARS; and (iv) appropriate number and distribution of spectral variables could be selected by Log (1/R) after both CARS and RF.
Although our study improved the accuracy and parsimony of PLSR models for SOC estimation using two spectral selection algorithms in anthropogenic soil, laboratory non-imaging sensors were adopted. Future researches should focus on airborne and satellite sensors to explore their potential for estimation and mapping topsoil SOC.