Combining Partial Least Squares and the Gradient-Boosting Method for Soil Property Retrieval Using Visible Near-Infrared Shortwave Infrared Spectra

: Soil spectroscopy has experienced a tremendous increase in soil property characterisation, and can be used not only in the laboratory but also from the space (imaging spectroscopy). Partial least squares (PLS) regression is one of the most common approaches for the calibration of soil properties using soil spectra. Besides functioning as a calibration method, PLS can also be used as a dimension reduction tool, which has scarcely been studied in soil spectroscopy. PLS components retained from high-dimensional spectral data can further be explored with the gradient-boosted decision tree (GBDT) method. Three soil sample categories were extracted from the Land Use/Land Cover Area Frame Survey (LUCAS) soil library according to the type of land cover (woodland, grassland, and cropland). First, PLS regression and GBDT were separately applied to build the spectroscopic models for soil organic carbon (OC), total nitrogen content (N), and clay for each soil category. Then, PLS-derived components were used as input variables for the GBDT model. The results demonstrate that the combined PLS-GBDT approach has better performance than PLS or GBDT alone. The relative important variables for soil property estimation revealed by the proposed method demonstrated that the PLS method is a useful dimension reduction tool for soil spectra to retain target-related information. VIP score derived from coefficients to assess the importance of input variables. It calculates the contribution of independent variables to the contribution of the dependent variable. For the gradient-boosting method, the importance of input variable can be calculated based on metric of “split” or “gain”. “Split” is the number of times a variable is used in a model and “gain” is the total gain of splits that use the variable. We use split as the descriptor of relative variable importance in this study. The more a variable is used to make key decisions with decision trees, the higher its relative importance.


Introduction
Monitoring the status of soil is very important for tackling many challenges including food security, climate change, land degradation, and biodiversity [1]. Traditional laboratory technologies to analyse soil are often time consuming and expensive, and these soil analyses are usually limited to a few samples and lack information on the spatial variability of soil [2]. Soil spectroscopy, as a fast, cost-effective, and environmental-friendly analytical technique, has successfully been utilised to retrieve soil properties and has experienced a tremendous increase in the past years. It has been shown that soil spectra across the Visible Near-Infrared Shortwave Infrared (VIS-NIR-SWIR; 400-2500 nm) spectral region are characterised by significant spectral signals [3][4][5][6], which makes it possible for quantitative analysis of soil properties. Furthermore, the wide spread use of visible and infrared spectroscopy can resolve the trade-off between the growing need of large scale soil information and its high cost [7]. Using spectral measurements and corresponding soil properties measured by soil analyses, soil spectroscopy can be adopted to quantitatively estimate many soil properties, such as organic matter, heavy metals, clay content, exchangeable potassium, and electrical conductivity [8][9][10].
The multivariate analysis technique is vital for quantitative analysis of soils. Partial least squares (PLS) regression is frequently used for spectroscopic data and demonstrates good capability for

Partial Least Squares Algorithm
PLS regression has proven to be a very successful method for multivariate data analysis. It is a standard tool in chemometrics and has received a great amount of attention in the field of soil spectroscopy. It is similar to principal component regression (PCR), as both can overcome the problems of high dimensionality and multicollinearity. In its classical form, the PLS method is based on the nonlinear iterative partial least squares (NIPALS) algorithm. To calibrate a PLS regression model for each soil property, the optimal number of latent variables was identified by performing a 10-fold cross validation, and the root-mean-square error (RMSE) in the cross-validation was used as decision criterion. Besides directly applying PLS regression to soil spectra, the transformed PLS components were also used as inputs for the following gradient-boosting model.

Gradient-Boosted Decision Trees (GBDT)
Gradient-boosting is a machine learning technique for regression and classification problems, which was developed by Jerome Friedman [35,36]. One of the widely used gradient-boosting The Vis-NIR soil spectra were measured using a FOSS XDS Rapid Content Analyser (FOSS NIRSystems Inc., Hilleroed, Denmark), operating in the 400-2500 nm wavelength range, with 0.5 nm spectral resolution. Pre-processing included removal of the data at wavelengths of 400-500 nm that showed instrumental artefacts, transformation of absorbance (A) spectra into reflectance (1/10 A ) spectra, continuum removal, Savitzky-Golay Filter with a window size of 51 and 2nd order polynomial, and resampling to contain 200 bands. 13 soil properties were analysed in a central laboratory. Three key soil properties, soil organic carbon (OC), total nitrogen content (N), and clay, were selected as our studied properties. A brief statistical summary of soil properties is listed in Table 1.

Partial Least Squares Algorithm
PLS regression has proven to be a very successful method for multivariate data analysis. It is a standard tool in chemometrics and has received a great amount of attention in the field of soil spectroscopy. It is similar to principal component regression (PCR), as both can overcome the problems of high dimensionality and multicollinearity. In its classical form, the PLS method is based on the nonlinear iterative partial least squares (NIPALS) algorithm. To calibrate a PLS regression model for each soil property, the optimal number of latent variables was identified by performing a 10-fold cross validation, and the root-mean-square error (RMSE) in the cross-validation was used as decision criterion. Besides directly applying PLS regression to soil spectra, the transformed PLS components were also used as inputs for the following gradient-boosting model.

Gradient-Boosted Decision Trees (GBDT)
Gradient-boosting is a machine learning technique for regression and classification problems, which was developed by Jerome Friedman [35,36]. One of the widely used gradient-boosting methods is GBDT, which is highly adaptable and able to model feature interactions and inherently perform feature selection [37]. These features have made GBDT one of the most widely used machine learning algorithms. Gradient-boosting develops an ensemble of tree-based models by training each of the trees in a sequential manner. Each iteration fits a decision to the residuals left by the previous one and then prediction is accomplished by combining the trees. It can produce robust and interpretable procedures for both regression and classification. Mathematically, the model can be viewed as: where k is the number of trees, f is a function in the functional space F, and F is the set of all possible regression trees.
There are several open-source projects that have implemented GBDT, like scikit-learn, XGBoost, and LightGBM [30,38,39]. LightGBM [40] is used in this study, and it is developed by Microsoft. It takes advantage of histogram-based algorithms to accelerate training process and reduce memory consumption by aggregating continuous features into discrete bins [41]. Most decision tree learning algorithms grow trees by level-wise or depth-wise approach, as shown in Figure 2; LightGBM grows trees by leaf-wise or best-first approach. It will choose the leaf with max delta loss to grow. When growing the same number of leafs, leaf-wise algorithm can reduce more loss than level-wise algorithm. LightGBM also supports parallel and GPU learning, and it is capable of handling large-scale data. and LightGBM [30,38,39]. LightGBM [40] is used in this study, and it is developed by Microsoft. It takes advantage of histogram-based algorithms to accelerate training process and reduce memory consumption by aggregating continuous features into discrete bins [41]. Most decision tree learning algorithms grow trees by level-wise or depth-wise approach, as shown in Figure 2; LightGBM grows trees by leaf-wise or best-first approach. It will choose the leaf with max delta loss to grow. When growing the same number of leafs, leaf-wise algorithm can reduce more loss than level-wise algorithm. LightGBM also supports parallel and GPU learning, and it is capable of handling largescale data. Soil spectra quantitatively correlates with soil properties. By fitting a regression model, it is supposed to achieve a good predictive accuracy for the estimation of soil property. There are many parameters that need to be tuned in GBDT, like learning rate or shrinkage, max depth, number of trees, etc. Reducing the learning rate parameter helps prevent overfitting and has a smoothing effect but increases the learning time [31]. The learning rate was set to 0.05. Parameters of max depth and number of trees can also determine whether the model is over fitted or not, and these two parameters were explored using a grid search strategy.

Calculation of Relative Variable Importance
PLS regression and the gradient-boosting method both can estimate relative contribution of each input variable or feature. The resultant variable importance measure is useful for understanding the relevance of contributing wavelengths. Ranking based on relative contribution values can help to identify the reflectance bands that are most important for developing soil spectroscopic models. In general, the top few bands contribute most for the model development. For PLS algorithm, the calculation of important input variables are based on weighted sums of the absolute PLS-regression coefficients. A large loading also indicates the importance of a variable. Here, we use the VIP score derived from coefficients to assess the importance of input variables. It calculates the contribution of independent variables to the contribution of the dependent variable. For the gradient-boosting method, the importance of input variable can be calculated based on metric of "split" or "gain". "Split" is the number of times a variable is used in a model and "gain" is the total gain of splits that use the variable. We use split as the descriptor of relative variable importance in this study. The more a variable is used to make key decisions with decision trees, the higher its relative importance.

Assessment
For each soil property, the soil spectral quantitative model was developed on a random sample of two-thirds of the selected soil samples using PLS regression or the gradient-boosting regression method. The calibrations were tested by predicting the soil properties on validation dataset composed of the remaining one-third samples for each soil category. The model accuracies were Soil spectra quantitatively correlates with soil properties. By fitting a regression model, it is supposed to achieve a good predictive accuracy for the estimation of soil property. There are many parameters that need to be tuned in GBDT, like learning rate or shrinkage, max depth, number of trees, etc. Reducing the learning rate parameter helps prevent overfitting and has a smoothing effect but increases the learning time [31]. The learning rate was set to 0.05. Parameters of max depth and number of trees can also determine whether the model is over fitted or not, and these two parameters were explored using a grid search strategy.

Calculation of Relative Variable Importance
PLS regression and the gradient-boosting method both can estimate relative contribution of each input variable or feature. The resultant variable importance measure is useful for understanding the relevance of contributing wavelengths. Ranking based on relative contribution values can help to identify the reflectance bands that are most important for developing soil spectroscopic models. In general, the top few bands contribute most for the model development. For PLS algorithm, the calculation of important input variables are based on weighted sums of the absolute PLS-regression coefficients. A large loading also indicates the importance of a variable. Here, we use the VIP score derived from coefficients to assess the importance of input variables. It calculates the contribution of independent variables to the contribution of the dependent variable. For the gradient-boosting method, the importance of input variable can be calculated based on metric of "split" or "gain". "Split" is the number of times a variable is used in a model and "gain" is the total gain of splits that use the variable. We use split as the descriptor of relative variable importance in this study. The more a variable is used to make key decisions with decision trees, the higher its relative importance.

Assessment
For each soil property, the soil spectral quantitative model was developed on a random sample of two-thirds of the selected soil samples using PLS regression or the gradient-boosting regression method. The calibrations were tested by predicting the soil properties on validation dataset composed of the remaining one-third samples for each soil category. The model accuracies were evaluated on estimated and measured soil OC, N, and clay values using coefficient of determination (R 2 ), RMSE, and the ratio of performance to deviation (RPD) [42].
Remote Sens. 2017, 9, 1299 6 of 18 where n is the number of validation samples, y is the measured value, y is the mean of the measured value, andŷ is the estimated value.

Overview of the Spectral Measurement of Soil Samples
The mean soil reflectance spectra and standard deviations for soil samples from woodland, cropland, and grassland were plotted in Figure 3. The mean spectra of three soil categories have a similar curve shape whose reflectance values increase with increasing wavelength in the range of 500-1300 nm. Absorption features can be identified near 1400 and 1900 nm, which are assigned to soil hygroscopic water in clay minerals [43]. The mean soil CR spectra and standard deviations for samples from woodland, cropland, and grassland are also shown. CR spectra can be used to isolate and identify characteristic absorptions of minerals, organic compounds, and water in soils [3]. The main spectral difference is that the mean reflectance spectrum for cropland soils demonstrates a higher albedo than spectra for woodland and cropland soils, as cropland soils have a lower mean value of OC content (17.1 g/kg) than woodland soils (37.3 g/kg) and grassland soils (30.2 g/kg). From the CR spectra, it can be seen that the absorption features are stronger for cropland soils than the other two soil categories, and woodland soils have the weakest absorption features, which can also be explained by the variation of OC contents. Soil samples with high organic matter content tend to show weak absorption features [22]. Besides, cropland soils have the highest mean value of clay content.

∑
(2) where n is the number of validation samples, is the measured value, is the mean of the measured value, and is the estimated value.

Overview of the Spectral Measurement of Soil Samples
The mean soil reflectance spectra and standard deviations for soil samples from woodland, cropland, and grassland were plotted in Figure 3. The mean spectra of three soil categories have a similar curve shape whose reflectance values increase with increasing wavelength in the range of 500-1300 nm. Absorption features can be identified near 1400 and 1900 nm, which are assigned to soil hygroscopic water in clay minerals [43]. The mean soil CR spectra and standard deviations for samples from woodland, cropland, and grassland are also shown. CR spectra can be used to isolate and identify characteristic absorptions of minerals, organic compounds, and water in soils [3]. The main spectral difference is that the mean reflectance spectrum for cropland soils demonstrates a higher albedo than spectra for woodland and cropland soils, as cropland soils have a lower mean value of OC content (17.1 g/kg) than woodland soils (37.3 g/kg) and grassland soils (30.2 g/kg). From the CR spectra, it can be seen that the absorption features are stronger for cropland soils than the other two soil categories, and woodland soils have the weakest absorption features, which can also be explained by the variation of OC contents. Soil samples with high organic matter content tend to show weak absorption features [22]. Besides, cropland soils have the highest mean value of clay content.

Results of PLS Regression for the Estimation of Soil Properties
To make a comparison with the following results obtained from PLS-GBDT, soil spectroscopic models for OC, N, and clay were developed using PLS regression with the same dataset ( Figure 4). For each model, the PLS component number was optimised and kept the same as retained by PLS-GBDT ( Table 2). The accuracies were assessed by R 2 , RMSE, and RPD. Spectroscopic models developed for OC estimation achieved R 2 values ranging from 0.537 to 0.569 and RPD values from 1.51 to 1.57. For N, the highest accuracy (R 2 = 0.652, RMSE = 0.78 g/kg, RPD = 1.66) was obtained Remote Sens. 2017, 9, 1299 7 of 18 from woodland soils. Models developed for clay estimation achieved comparable good results, and R 2 values vary from 0.656 to 0.732. From RPD values, it can be seen that PLS regression can develop fair models for soil spectroscopic analysis that may be used for assessment and correlation.

Results of PLS Regression for the Estimation of Soil Properties
To make a comparison with the following results obtained from PLS-GBDT, soil spectroscopic models for OC, N, and clay were developed using PLS regression with the same dataset ( Figure 4). For each model, the PLS component number was optimised and kept the same as retained by PLS-GBDT ( Table 2). The accuracies were assessed by R 2 , RMSE, and RPD. Spectroscopic models developed for OC estimation achieved R 2 values ranging from 0.537 to 0.569 and RPD values from 1.51 to 1.57. For N, the highest accuracy (R 2 = 0.652, RMSE = 0.78 g/kg, RPD = 1.66) was obtained from woodland soils. Models developed for clay estimation achieved comparable good results, and R 2 values vary from 0.656 to 0.732. From RPD values, it can be seen that PLS regression can develop fair models for soil spectroscopic analysis that may be used for assessment and correlation.    Variable selection can be done with PLS. We use VIP scores to rank the relative variable importance. The top 60% variables were kept and further modelled with PLS regression. The results for all three soil categories were shown in Figure 5. After variable selection, the accuracy for clay estimation from woodland soils improved with retained variables (R 2 = 0.715, RMSE = 5.6 g/kg, RPD = 1.86) compared with using full spectrum (R 2 = 0.674, RMSE = 5.99 g/kg, RPD = 1.74). Variable selection can also increase the OC estimation accuracy for woodland soils. However, the estimation accuracies for clay from cropland soils and N from woodland soils decreased after variable selection. The R 2 values declined from 0.732 to 0.714 for clay (cropland soils) and 0.652 to 0.636 for N (woodland soils). Soil spectra is complex, especially for large-scale soil spectral data. Soil properties associated with spectrally active constituents cannot be expected to be globally stable [22]. Thus, directly dropping some bands via variable selection may result in a loss of information that is important for some soil samples. Variable selection can be done with PLS. We use VIP scores to rank the relative variable importance. The top 60% variables were kept and further modelled with PLS regression. The results for all three soil categories were shown in Figure 5. After variable selection, the accuracy for clay estimation from woodland soils improved with retained variables (R 2 = 0.715, RMSE = 5.6 g/kg, RPD = 1.86) compared with using full spectrum (R 2 = 0.674, RMSE = 5.99 g/kg, RPD = 1.74). Variable selection can also increase the OC estimation accuracy for woodland soils. However, the estimation accuracies for clay from cropland soils and N from woodland soils decreased after variable selection. The R 2 values declined from 0.732 to 0.714 for clay (cropland soils) and 0.652 to 0.636 for N (woodland soils). Soil spectra is complex, especially for large-scale soil spectral data. Soil properties associated with spectrally active constituents cannot be expected to be globally stable [22]. Thus, directly dropping some bands via variable selection may result in a loss of information that is important for some soil samples.

Results of PLS-GBDT for the Estimation of Soil Properties
In this study, we propose to transfer soil reflectance spectra data into PLS components so as to reduce the dimensionality and also decrease the computational complexity. Then, for each category (woodland, cropland, and grassland), soil properties of OC, N, and clay were modelled using the GBDT method while the input variables were PLS components instead of reflectance spectra. A grid Remote Sens. 2017, 9, 1299 9 of 18 search method was adopted to tune the optimised PLS components for the first step and also the number of boosted trees and the maximum tree depth for GBDT (Table 2).
For OC, the model built using cropland soil samples achieved the best result (R 2 = 0.679, RMSE = 6.0 g/kg, RPD = 1.84) compared with soil samples from woodland (R 2 = 0.658, RMSE = 13.81 g/kg, RPD = 1.76) and grassland (R 2 = 0.671, RMSE = 10.92 g/kg, RPD = 1.76), which is the same case for the other two soil properties. The spectroscopic model developed from cropland soils has a RPD value of 1.94 for N and 2.34 for clay, and both are higher than models developed for woodland soils and grassland soils. This might be due to the complexity of the soil sampling matrix and soil sampling density. From Figure 1, it can be seen that cropland soils have the largest proportion of samples because of their ease of access and thus distribute more homogenously compared with woodland soils and grassland soils. The accuracy of clay obtained from the developed PLS-GBDT model has the highest value compared with the other two properties, R 2 values ranging from 0.736 to 0.812 and RPD from 1.94 to 2.34.
Compared with Figures 4 and 6, it can be clearly seen that the results achieved by PLS regression with or without variable selection are worse than by PLS-GBDT. For woodland soils, the R 2 value for OC reduced from 0.679 to 0.537 and the RPD value from 1.84 to 1.53, the R 2 value for N dropped from 0.687 to 0.55, the RPD value from 1.94 to 1.61, and the estimation of clay also has the same trend. Therefore, the model developed by non-linear regression method such as PLS-GBDT is suitable for quantitative retrieval of soil properties as reported by [3,44].
In this study, we propose to transfer soil reflectance spectra data into PLS components so as to reduce the dimensionality and also decrease the computational complexity. Then, for each category (woodland, cropland, and grassland), soil properties of OC, N, and clay were modelled using the GBDT method while the input variables were PLS components instead of reflectance spectra. A grid search method was adopted to tune the optimised PLS components for the first step and also the number of boosted trees and the maximum tree depth for GBDT (Table 2).
For OC, the model built using cropland soil samples achieved the best result (R 2 = 0.679, RMSE = 6.0 g/kg, RPD = 1.84) compared with soil samples from woodland (R 2 = 0.658, RMSE = 13.81 g/kg, RPD = 1.76) and grassland (R 2 = 0.671, RMSE = 10.92 g/kg, RPD = 1.76), which is the same case for the other two soil properties. The spectroscopic model developed from cropland soils has a RPD value of 1.94 for N and 2.34 for clay, and both are higher than models developed for woodland soils and grassland soils. This might be due to the complexity of the soil sampling matrix and soil sampling density. From Figure 1, it can be seen that cropland soils have the largest proportion of samples because of their ease of access and thus distribute more homogenously compared with woodland soils and grassland soils. The accuracy of clay obtained from the developed PLS-GBDT model has the highest value compared with the other two properties, R 2 values ranging from 0.736 to 0.812 and RPD from 1.94 to 2.34.
Compared with Figures 4 and 6, it can be clearly seen that the results achieved by PLS regression with or without variable selection are worse than by PLS-GBDT. For woodland soils, the R 2 value for OC reduced from 0.679 to 0.537 and the RPD value from 1.84 to 1.53, the R 2 value for N dropped from 0.687 to 0.55, the RPD value from 1.94 to 1.61, and the estimation of clay also has the same trend. Therefore, the model developed by non-linear regression method such as PLS-GBDT is suitable for quantitative retrieval of soil properties as reported by [3,44].  To further evaluate the performance of PLS-GBDT method, we also directly applied GBDT to soil reflectance spectra. We take samples from woodland soils as an example, and use the mean square error (MSE) as the evaluation metric. From Figure 7, it can be seen that GBDT model did not perform well, and it is not easy for it to be convergent with the increase of epochs in the training step, as the model tends to be complex when the data dimensionality is too high. PLS-GBDT models achieved much lower MSE values compared with GBDT models, both in the training and validation steps.
Remote Sens. 2017, 9,1299 10 of 18 To further evaluate the performance of PLS-GBDT method, we also directly applied GBDT to soil reflectance spectra. We take samples from woodland soils as an example, and use the mean square error (MSE) as the evaluation metric. From Figure 7, it can be seen that GBDT model did not perform well, and it is not easy for it to be convergent with the increase of epochs in the training step, as the model tends to be complex when the data dimensionality is too high. PLS-GBDT models achieved much lower MSE values compared with GBDT models, both in the training and validation steps.

Relative Important Variables Derived from PLS Regression and the Gradient-Boosting Method
A benefit of PLS regression and GBDT is that they can provide the estimation of variable importance from the trained calibration model. As it is time consuming to tune hyperparameters for GBDT models with very high dimensional data, soil spectra were resampled to 200 bands. The top 13 relative important variables derived from PLS regression models can be seen from Figure 8. For OC, the most important bands for these three soil categories are at 1920, 2170, and 2050 nm. The top ranked bands for N are similar to OC (2160, 1940, and 2000 nm). For clay, the derived important variables are at 2070, 1950, and 2230 nm. In previous study [45], the bands near 800, 1000, 1400, and 1900-2450 nm were confirmed to be important for OC estimation, and the bands around 1100, 1600, 1700 to 1800, 2000, and 2200 to 2400 nm were also identified as key bands for OC and N estimation [46]. The results are basically in agreement with previous research.
For all of these three soil properties, the top ranked variables derived from GBDT model were basically at the beginning and the end of the spectrum (Figure 9). It can be seen that the GBDT method failed to select meaningful bands for quantitative estimation of OC, N, and clay when directly using the full spectrum as an input variable, which also explained why the accuracy of the GBDT model is worse than the results obtained from PLS and PLS-GBDT models. Conversely, relative important variables derived from PLS regression are more reasonable.

Relative Important Variables Derived from PLS Regression and the Gradient-Boosting Method
A benefit of PLS regression and GBDT is that they can provide the estimation of variable importance from the trained calibration model. As it is time consuming to tune hyperparameters for GBDT models with very high dimensional data, soil spectra were resampled to 200 bands. The top 13 relative important variables derived from PLS regression models can be seen from Figure 8. For OC, the most important bands for these three soil categories are at 1920, 2170, and 2050 nm. The top ranked bands for N are similar to OC (2160, 1940, and 2000 nm). For clay, the derived important variables are at 2070, 1950, and 2230 nm. In previous study [45], the bands near 800, 1000, 1400, and 1900-2450 nm were confirmed to be important for OC estimation, and the bands around 1100, 1600, 1700 to 1800, 2000, and 2200 to 2400 nm were also identified as key bands for OC and N estimation [46]. The results are basically in agreement with previous research.  For all of these three soil properties, the top ranked variables derived from GBDT model were basically at the beginning and the end of the spectrum (Figure 9). It can be seen that the GBDT method failed to select meaningful bands for quantitative estimation of OC, N, and clay when directly using the full spectrum as an input variable, which also explained why the accuracy of the GBDT model is worse than the results obtained from PLS and PLS-GBDT models. Conversely, relative important variables derived from PLS regression are more reasonable. Although the GBDT method failed to derive the relative important variables when using full spectrum, it does not mean that this method is not suitable for soil spectroscopic analysis. The obtained relative important variables were demonstrated ( Figure 10) when the model was combined with retained PLS components. For PLS-GBDT, the first PLS component is supposed to be the most important variable for the estimation of corresponding soil properties, as PLS retains target-related information. The results demonstrate that the most important variables are the first PLS component for OC and N, while the second PLS component is ranked first for clay. In general, the top-ranked PLS components are also important to the gradient-boosting model, as revealed by Figure 10. This also means that PLS performs well on the extraction of target-related information. Although the GBDT method failed to derive the relative important variables when using full spectrum, it does not mean that this method is not suitable for soil spectroscopic analysis. The obtained relative important variables were demonstrated ( Figure 10) when the model was combined with retained PLS components. For PLS-GBDT, the first PLS component is supposed to be the most important variable for the estimation of corresponding soil properties, as PLS retains target-related information. The results demonstrate that the most important variables are the first PLS component for OC and N, while the second PLS component is ranked first for clay. In general, the top-ranked PLS components are also important to the gradient-boosting model, as revealed by Figure 10. This also means that PLS performs well on the extraction of target-related information.

Dimension Reduction for High-Dimensional Soil Spectra
High-dimensional data like soil spectra often contain redundant information and will increase computation complexity, which is known as the curse of dimensionality or Hughes phenomenon [47][48][49]. Variable selection can reduce the complexity and improve the robustness of the model. By selecting the most informative spectral bands instead of using the full spectrum, the calibration model is supposed to be more accurate [50]. Variable selection can be based on physical background by identifying key wavelengths for the target property. It is also possible to evaluate it using the statistics of the resulting calibration model, like the VIP score derived from the PLS regression model in this study.
High-dimensional spectral data can be projected to a lower dimensional space without actually losing significant information using methods like principal component analysis (PCA). PCA reduces the dimensionality of the data to fewer components that describe a large proportion of the variance.

Dimension Reduction for High-Dimensional Soil Spectra
High-dimensional data like soil spectra often contain redundant information and will increase computation complexity, which is known as the curse of dimensionality or Hughes phenomenon [47][48][49]. Variable selection can reduce the complexity and improve the robustness of the model. By selecting the most informative spectral bands instead of using the full spectrum, the calibration model is supposed to be more accurate [50]. Variable selection can be based on physical background by identifying key wavelengths for the target property. It is also possible to evaluate it using the statistics of the resulting calibration model, like the VIP score derived from the PLS regression model in this study.
High-dimensional spectral data can be projected to a lower dimensional space without actually losing significant information using methods like principal component analysis (PCA). PCA reduces the dimensionality of the data to fewer components that describe a large proportion of the variance. The first principal component accounts for the largest variance, while subsequent components account for decreasingly smaller proportions [51]. Local linear embedding (LLE) is a nonlinear dimensionality reduction method, and it can identify the underlying structure of a manifold [52]. PCA and LLE have been exploited in a comparative way for soil spectral distance and similarity in projected space [53]. Autoencoder (AE) is an unsupervised learning algorithm and its performance on reducing the dimensionality of soil spectra has not been well studied. Several approaches were developed based on it, such as stacked autoencoder and sparse autoencoder [54,55]. AE trains a neural network by constraining the output values to be equal to the input values. The reconstruction error between the input and the output is used to adjust the weights of each layer. Ideally, features learned by AE can well represent the input data [56]. The difference between PLS and the above mentioned DR methods is that PLS is able to retain target-related information and can be viewed as a supervised DR method. It has the potential to explore the intrinsic structure of spectra, and it not only reduces data redundancy, it also improves estimation accuracy. Besides, it is worth making a comparison between PLS and other mentioned DR methods (PCA, LLE, AE, etc.) for soil spectral analysis in the following studies.

GBDT for Quantitative Soil Spectroscopic Modelling
Modelling soil properties using large and diverse soil spectral libraries is still a challenging task. PLS regression, as a common approach in soil spectroscopy, has limitations in handling large-scale soil spectral data. With variable selection using VIP scores, the performance with regard to improving the estimation accuracies is still not satisfying in this study. GBDT has been used to win machine learning competitions on Kaggle and has gained a lot of attention. In this study, we proposed to take advantage of GBDT for the estimation of soil properties by using PLS components as the input variables instead of raw reflectance spectra. The result demonstrated that the combined PLS-GBDT approach performs better than PLS or GBDT alone. It also confirmed the experiments in [57], in which the boosted decision trees method performed exceptionally well when dimensionality was low. The model is prone to being complex when the dimension is too high, and it tends to need more trees and a high degree of tree depth, which could be a serious problem in high dimensions [58]. Therefore, it is suggested to reduce the number of input features via dimension reduction or feature selection when facing high-dimensional data. There are several studies related to soil spectroscopic modelling using large-scale soil spectral libraries. Local or MBL approaches are reputed to have better performance on large-scale soil data. PLS, SVM, LWR, and SBL were comparatively studied on a regional soil spectral library in Brazil and a global soil spectral library [22]. SBL algorithm achieved the best performance for OC estimation in the regional (R 2 = 0.59) and the global data (R 2 = 0.68). MBL approaches are very flexible and can be easily integrated with PLS-GBDT. Besides, additional soil information like texture (sand, clay, and silt) can contribute to soil spectroscopic model. By only using spectral bands as the input variables in [59], SVM obtained a similar result for OC estimation of cropland soils as achieved by PLS-GBDT. However, the R 2 value improved from 0.67 to 0.71 with variable selection and clay content as auxiliary variable. A higher accuracy of the OC estimation model was also obtained by [60] when considering sand content. Therefore, additional soil information is very important to calibration models for large-scale soil spectral data.

Conclusions
Soil spectra measured in the laboratory typical have several hundred or even thousand bands, which would be a problem for the gradient-boosting model when directly using such high-dimensional data as inputs. This study presents a PLS-GBDT method to retrieve soil properties from reflectance spectra. The LUCAS soil spectral library was used to evaluate its performance. For three soil categories (woodland, grassland, and cropland), R 2 achieved values of 0.658-0.679 for OC, 0.687-0.719 for N, and 0.739-0.812 for clay. Both PLS and GBDT can estimate the relative contributions of input variables. However, GBDT failed in this task when directly using high-dimensional soil spectra as input data. The GBDT method is a well-known machine learning algorithm that uses the decision tree as the weak learner, and it has successfully been applied in numerous areas. By using PLS components as input variables, which are retained with target variable-related information, GBDT is able to perform well on soil quantitative analysis. Although the PLS-GBDT method is directly used to develop a global model to fit the whole soil spectral library in this study, it is possible to combine it with MBL if it functions as a basic or local model.