Estimation of Soil Nutrient Content Using Hyperspectral Data

Soil nutrients play a vital role in plant growth and thus the rapid acquisition of soil nutrient content is of great significance for agricultural sustainable development. Hyperspectral remotesensing techniques allow for the quick monitoring of soil nutrients. However, at present, obtaining accurate estimates proves to be difficult due to the weak spectral features of soil nutrients and the low accuracy of soil nutrient estimation models. This study proposed a new method to improve soil nutrient estimation. Firstly, for obtaining characteristic variables, we employed partial least squares regression (PLSR) fit degree to select an optimal screening algorithm from three algorithms (Pearson correlation coefficient, PCC; least absolute shrinkage and selection operator, LASSO; and gradient boosting decision tree, GBDT). Secondly, linear (multi-linear regression, MLR; ridge regression, RR) and nonlinear (support vector machine, SVM; and back propagation neural network with genetic algorithm optimization, GABP) algorithms with 10-fold cross-validation were implemented to determine the most accurate model for estimating soil total nitrogen (TN), total phosphorus (TP), and total potassium (TK) contents. Finally, the new method was used to map the soil TK content at a regional scale using the soil component spectral variables retrieved by the fully constrained least squares (FCLS) method based on an image from the HuanJing-1A Hyperspectral Imager (HJ-1A HSI) of the Conghua District of Guangzhou, China. The results identified the GBDT-GABP was observed as the most accurate estimation method of soil TN (Rcv of 0.69, the root mean square error of cross-validation (RMSECV) of 0.35 g kg−1 and ratio of performance to interquartile range (RPIQ) of 2.03) and TP (Rcv of 0.73, RMSECV of 0.30 g kg−1 and RPIQ = 2.10), and the LASSO-GABP proved to be optimal for soil TK estimations (Rcv of 0.82, RMSECV of 3.39 g kg−1 and RPIQ = 3.57). Additionally, the highly accurate LASSO-GABP-estimated soil TK (R2 = 0.79) reveals the feasibility of the LASSO-GABP method to retrieve soil TK content at the regional scale.


Introduction
The rapid and efficient monitoring of soil nutrients has become an important prerequisite for agricultural production management and ensuring the healthy development of crops. However, current soil nutrient estimations are often obtained using field sampling and laboratory analysis, which is time-consuming and costly. The monitoring of soil nutrients via hyperspectral remote-sensing techniques is rapid and efficient, and numerous related studies have been performed within the past 30 years [1][2][3][4]. 600 km, respectively. The province belongs to the East Asian monsoon region, with middle subtropical, south subtropical, and north tropical zone climate types from the north to south. Mean annual temperature and precipitation of the area are 21.8 • C and 1789.3 mm, respectively. Guangdong is an important grain production region, with a crop planting area of 4.28 × 10 4 km 2 in 2019 and total grain yield of 1.19 × 10 10 kg.
Agriculture 2021, 11, x FOR PEER REVIEW 3 of 16 West and North-South spans of Guangdong province are approximately 800 and 600 km, respectively. The province belongs to the East Asian monsoon region, with middle subtropical, south subtropical, and north tropical zone climate types from the north to south. Mean annual temperature and precipitation of the area are 21.8 °C and 1789.3 mm, respectively. Guangdong is an important grain production region, with a crop planting area of 4.28 × 10 4 km 2 in 2019 and total grain yield of 1.19 × 10 10 kg.

Soil Sampling and Chemical Analysis
A total of 75 soil samples were gathered for constructing hyperspectral estimate models of soil nutrients contents based on a 50 × 50 km sampling grid within Guangdong province and field actual conditions to ensure uniform distribution of the soil samples [21,22]. Surface soil samples (0-20 cm) were collected at five sampling locations at each site. To remove stones and other large debris, the samples were air-dried and sieved through a 2 mm polyethylene sieve. After that, the samples were pulverized into fine powder. The soil nutrient content and soil spectral reflectance were then determined by dividing each sample into two parts. Soil TN was measured using the semi-micro Kjeldahl method described by Walkley and Black [23]. Soil TP and TK were determined via an ultraviolet spectrophotometer (UV-2600, Shimadzu CO, LTD., Kyoto, Japan) and a flame photometer (FP640, INESA Analytical Instrument CO, LTD., Shanghai, China), respectively. The soil nutrient content statistics from the 75 soil samples are presented in Table  1. Note: soil total nitrogen, TN; total phosphorus, TP; and total potassium, TK; unit: g kg −1 . Q1, first quartile; Q3, third quartile; SD, standard deviation; CV, coefficient of variation (%).
Moreover, the mapping of Guangdong Province needs multiple HJ-1A images with 100 m spatial resolution. In addition, it is very difficult to obtain multiple high-quality

Soil Sampling and Chemical Analysis
A total of 75 soil samples were gathered for constructing hyperspectral estimate models of soil nutrients contents based on a 50 × 50 km sampling grid within Guangdong province and field actual conditions to ensure uniform distribution of the soil samples [21,22]. Surface soil samples (0-20 cm) were collected at five sampling locations at each site. To remove stones and other large debris, the samples were air-dried and sieved through a 2 mm polyethylene sieve. After that, the samples were pulverized into fine powder. The soil nutrient content and soil spectral reflectance were then determined by dividing each sample into two parts. Soil TN was measured using the semi-micro Kjeldahl method described by Walkley and Black [23]. Soil TP and TK were determined via an ultraviolet spectrophotometer (UV-2600, Shimadzu CO, LTD., Kyoto, Japan) and a flame photometer (FP640, INESA Analytical Instrument CO, LTD., Shanghai, China), respectively. The soil nutrient content statistics from the 75 soil samples are presented in Table 1. Moreover, the mapping of Guangdong Province needs multiple HJ-1A images with 100 m spatial resolution. In addition, it is very difficult to obtain multiple high-quality satellite images of the whole province on the same day. Therefore, in this study, Conghua district was selected for conducting the soil nutrient mapping experiment. A total of 33 soil samples were collected in Conghua District (Figure 1b) to verify the feasibility of mapping soil nutrient content at the regional scale. The acquisition time of the HJ-1A HSI image coincided with the collection of the samples, which are evenly distributed in the whole image. The soil sample collection principle and pretreatment are consistent with the Guangdong province samples.

Spectral Measurements and Pre-Processing of Soil Samples
Soil spectral measurements were performed on 75 soil samples collected across the province. An AvaField portable spectrometer (Avantes, Inc., Apeldoorn, Holland) was used to measure soil spectral reflectance, which has a spectral range and resolution of 340-2511 and 0.6 nm, respectively. The spectral measurements were carried out in a dark room to regulate the lighting environment and minimize the influence of stray light. The soil spectral reflectance values were measured using a 50 W halogen lamp with a 10 • field of view in vertical contact with the soil sample. Each sample was uniformly tiled on a black cloth and measured five times. The average spectrum was calculated and used in further processing. Prior to the collection of the reflectance readings, the spectrometer was calibrated every three samples with a white Spectralon. To decrease signal noise, we used Savitzky-Golay smoothing with a window size of 10. In addition, the smoothed spectral data (raw spectral, R) were processed with the first derivative (FD), second derivative (SD), and reciprocal logarithmic (RL) to eliminate or reduce the effect of background noise and account for signal intensity fluctuations induced by soil surface spectral scattering and absorption. The outcomes of the processing are shown in Figure 2. satellite images of the whole province on the same day. Therefore, in this study, Conghua district was selected for conducting the soil nutrient mapping experiment. A total of 33 soil samples were collected in Conghua District (Figure 1b) to verify the feasibility of mapping soil nutrient content at the regional scale. The acquisition time of the HJ-1A HSI image coincided with the collection of the samples, which are evenly distributed in the whole image. The soil sample collection principle and pretreatment are consistent with the Guangdong province samples.

Spectral Measurements and Pre-Processing of Soil Samples
Soil spectral measurements were performed on 75 soil samples collected across the province. An AvaField portable spectrometer (Avantes, Inc., Apeldoorn, Holland) was used to measure soil spectral reflectance, which has a spectral range and resolution of 340-2511 and 0.6 nm, respectively. The spectral measurements were carried out in a dark room to regulate the lighting environment and minimize the influence of stray light. The soil spectral reflectance values were measured using a 50 W halogen lamp with a 10° field of view in vertical contact with the soil sample. Each sample was uniformly tiled on a black cloth and measured five times. The average spectrum was calculated and used in further processing. Prior to the collection of the reflectance readings, the spectrometer was calibrated every three samples with a white Spectralon. To decrease signal noise, we used Savitzky-Golay smoothing with a window size of 10. In addition, the smoothed spectral data (raw spectral, R) were processed with the first derivative (FD), second derivative (SD), and reciprocal logarithmic (RL) to eliminate or reduce the effect of background noise and account for signal intensity fluctuations induced by soil surface spectral scattering and absorption. The outcomes of the processing are shown in Figure 2.

Image Acquisition and Pre-Processing
In order to extend the application of the established model at the regional scale, a HJ-1A image acquired on 30 October 2017 with a 100 m spatial resolution and 115 bands (459-956 nm) was used to map the soil nutrient contents. The image was subjected to radiometric correction, atmospheric correction, geometric precision correction, and stripe noise

Image Acquisition and Pre-Processing
In order to extend the application of the established model at the regional scale, a HJ-1A image acquired on 30 October 2017 with a 100 m spatial resolution and 115 bands (459-956 nm) was used to map the soil nutrient contents. The image was subjected to radiometric correction, atmospheric correction, geometric precision correction, and stripe noise reduction ( Figure 3) using ENVI 5.3 (Exelis Visual Information Solutions, Inc., Boulder, CO, USA). The image's spectral resolution was 5 nm, which was substantially coarser than the AvaField portable spectrometer's measured spectral interval of 0.6 nm. ENVI's spectral resampling technique was utilized to spectrally resample the measured soil spectral data gathered with the AvaField portable spectrometer in order to match the spectral resolution of the HJ-1A HSI data.
Agriculture 2021, 11, x FOR PEER REVIEW 5 of 16 reduction ( Figure 3) using ENVI 5.3 (Exelis Visual Information Solutions, Inc., Boulder, CO, USA). The image's spectral resolution was 5 nm, which was substantially coarser than the AvaField portable spectrometer's measured spectral interval of 0.6 nm. ENVI's spectral resampling technique was utilized to spectrally resample the measured soil spectral data gathered with the AvaField portable spectrometer in order to match the spectral resolution of the HJ-1A HSI data.

Methods
This section is organized into four parts. In Section 2.3.1, we describe how the optimal algorithm of screening the soil nutrient characteristic variables can be determined using the PLSR fit degrees. The second section explains how the optimal prediction algorithm for soil nutrients can be screened from four algorithms by their prediction accuracy. In Section 2.3.3, we detail the mapping of soil nutrient base on HJ-1A image data using the above the optimal screening and predicting algorithms. Section 2.3.4 describes accuracy validation methods for the predicting models and mapping.

Determining the Optimal Screening Algorithm of the Characteristic Variables
One of the most important steps in the development of the optimal hyperspectral estimation method of the soil nutrient contents was the determination of the characteristic variables [8,24,25]. Additionally, the determination of the accurate screening algorithms is key for characteristic variables of the soil nutrient content. In order to determine the optimal screening algorithm of the characteristic variables, we compared traditional linear screening algorithm (PCC) and nonlinear screening algorithms (LASSO and GBDT) based on two evaluation steps. First, the characteristic variables were screened using PCC, GBDT and LASSO might be correlated with each other. That is, there are collinearities among the variables. Therefore, the variance inflation factor (VIF) of a stepwise regression was applied to eliminate the collinearity of the selected characteristic variables. The set of variables having a VIF lower than 10 [26] was retained. The three screening algorithms are described in detail as follows: LASSO: The least absolute shrinkage and selection operator, proposed by Tibshirani (1996), minimizes the sum of squares of residuals under the constraint that the sum of the absolute values of the regression coefficients (penalty coefficient) is less than a pre-defined constant. This produces regression coefficients (RC) strictly equal to 0 and removes low-

Methods
This section is organized into four parts. In Section 2.3.1, we describe how the optimal algorithm of screening the soil nutrient characteristic variables can be determined using the PLSR fit degrees. The second section explains how the optimal prediction algorithm for soil nutrients can be screened from four algorithms by their prediction accuracy. In Section 2.3.3, we detail the mapping of soil nutrient base on HJ-1A image data using the above the optimal screening and predicting algorithms. Section 2.3.4 describes accuracy validation methods for the predicting models and mapping.

Determining the Optimal Screening Algorithm of the Characteristic Variables
One of the most important steps in the development of the optimal hyperspectral estimation method of the soil nutrient contents was the determination of the characteristic variables [8,24,25]. Additionally, the determination of the accurate screening algorithms is key for characteristic variables of the soil nutrient content. In order to determine the optimal screening algorithm of the characteristic variables, we compared traditional linear screening algorithm (PCC) and nonlinear screening algorithms (LASSO and GBDT) based on two evaluation steps. First, the characteristic variables were screened using PCC, GBDT and LASSO might be correlated with each other. That is, there are collinearities among the variables. Therefore, the variance inflation factor (VIF) of a stepwise regression was applied to eliminate the collinearity of the selected characteristic variables. The set of variables having a VIF lower than 10 [26] was retained. The three screening algorithms are described in detail as follows: LASSO: The least absolute shrinkage and selection operator, proposed by Tibshirani (1996), minimizes the sum of squares of residuals under the constraint that the sum of the absolute values of the regression coefficients (penalty coefficient) is less than a pre-defined constant. This produces regression coefficients (RC) strictly equal to 0 and removes low-weight variables and can therefore effectively deal with complex high-dimensional data problems [27]. LASSO can be defined as follows: where y i represents the measured spectral data in the ith band; n is the spectral dimensionality; B j denotes the input weight in the jth spectral sample; x ij is the covariate vector of the ith measured spectral data and j spectral sample; and p is the spectral sample number. GBDT: The gradient boosting decision tree is a boosting algorithm that calculates the information gain during the branching of the decision tree to determine the spectral variable to be split and the corresponding split value. Once all decision trees are constructed, the feature importance (FI) is obtained by calculating the information gain of the decision tree feature and dividing by the total frequency of the feature in all trees of the GBDT strong learner [28]: where I(a, D) denotes the feature (spectral variable) information gain; a is the feature; D is the soil sample; and N a is the total frequency of feature a in all trees. PCC: The Pearson correlation coefficient is commonly employed to screen characteristic variables. Here, the PCC was implemented between the spectral variables and soil nutrient content to determine characteristic variables with the largest correlation coefficient (p ≤ 0.05 significance level). The Pearson correlation coefficient can be expressed as: where R ni is the spectral value of the ith spectral variable of the nth soil sample point; R i is the average spectral value of the ith spectral variable; y n is the soil nutrient content of the nth soil sample point; and y is the average value of the soil nutrient content. Once the characteristic variables were selected by each algorithm, PLSR fit degrees (R 2 ) [29][30][31] between the measured soil nutrient contents and characteristic variables were compared. The screening algorithm with the maximum fit degree was determined as the optimal.

Determining the Accurate Model for Estimating Soil Nutrients
In this study, the screened characteristic variables were used as independent variables and each of the soil nutrient (TN, TP and TK) contents were used as the dependent variable. Additionally, four different algorithms were applied to build the relationship models between characteristic variables and soil nutrients: MLR, RR, SVM, and GABP. The four algorithms are described in detail as follows: (1) Multi-Linear Regression Multi-linear regression is a type of regression analysis for multiple independent variables. The optimal combination of these independent variables is taken to estimate the dependent variables. This model can describe the influence of each variable on the soil properties and is widely used in soil property estimations [32][33][34]. We adopted MLR to estimate soil nutrient content using the following formula: where Z MLR is the dependent variable (soil nutrient content); x i (i = 1, 2, . . . , n) is the independent variable (spectral variables); a i (i = 1, 2, . . . , n) represents the regression fitting coefficient; and a 0 is the intercept.
(2) Ridge Regression Ridge regression (RR) is a least square estimation method that improves on its predecessors. In particular, it abandons the unbiasedness of the least square method, thus losing part of the information and reducing the accuracy and making the regression coefficient more realistic and reliable [35]. The existence of multiple collinear relations between independent variables magnifies the mean square error. This error is reduced by using RR estimation rather than the standard least square estimation [36,37]. The RR is expressed as: whereβ(k) is the ridge regression estimate of β; and k is the ridge parameter. When k = 0, the least square estimate of β is equal toβ(0).
(3) Support Vector Machine SVM, proposed by Cortes and Vapnik (1995), is a robust supervised learning model with a capacity for solving practical problems (e.g., nonlinearity and high dimensionality). SVM greatly simplifies the traditional regression process through efficient "transduction inference" from training samples to predictions [38]. The SVM model can be expressed as: where f(x) is the soil nutrient estimate; x is the characteristic variable; w i is the weight coefficient; b is the error term; ∅ i denotes a nonlinear transfer function; and ω and b are calculated by the following convex optimization problem with an e-insensitivity loss function [39]: where ||w|| 2 represents the flatness of the m-dimensional space; ε is a parameter that indicates the maximum allowed error between the measured and estimated values; ξ i and ξ * i are slack variables and C is the penalty factor. Equations (7) and (8) belong to the convex quadratic programming problem with inequality constraints. In order to obtain the Lagrangian multipliers, the equations are converted into a dual problem via the Lagrange multiplier method. The constrained original objective function (Equation (8)) is then transformed into the unconstrained Lagrangian objective function: where α i − α * i is the transformation of w. The SVM function is expressed as: where K (x i , x) = ∅(x i )∅ x j is the kernel function. The radial basis function was selected as the kernel function.
(4) Genetic Algorithm-Back Propagation Neural Network The GABP algorithm optimizes the structure and connection weight of the back propagation neural network using the parallel random search ability of the genetic algorithm, effectively avoiding a local optimal solution [40]. We adopted the population search method to optimize the weights and thresholds of the neural network ( Figure 4).

Estimating Regional-Scale Soil Nutrient Contents Using HJ-1A Hyperspectral Data
Once the optimal variable screening and predictor models were selected, the method was applied to mapping the contents of the soil nutrient using HJ-1A image with 115 bands (459-956 nm) and 5 nm spectral resolution, which will not provide the above characteristic variables with beyond 956 nm wavelength. Thus, the characteristic variables should be re-screened from the resampling measured soil spectral data with 5 nm spectral resolution using the above optimal screening algorithm and to develop the corresponding estimation models. Then, the model was applied to mapping the contents of the soil TK using the HJ-1A HSI image for the Conghua district at the regional scale.
Moreover, in order to apply the methods to Conghua district, the HJ-1A image was considered to contain pure pixels. However, the coarse image spatial resolution of 100 × 100 m generally prevents the existence of pure pixels, with mixed pixels (including crop and soil) typically dominating the study area. Thus, the fully constrained least squares (FCLS) method [41] was used to obtained pure pixels and spectral reflectance of soil (Figure 5).

Estimating Regional-Scale Soil Nutrient Contents Using HJ-1A Hyperspectral Data
Once the optimal variable screening and predictor models were selected, the method was applied to mapping the contents of the soil nutrient using HJ-1A image with 115 bands (459-956 nm) and 5 nm spectral resolution, which will not provide the above characteristic variables with beyond 956 nm wavelength. Thus, the characteristic variables should be re-screened from the resampling measured soil spectral data with 5 nm spectral resolution using the above optimal screening algorithm and to develop the corresponding estimation models. Then, the model was applied to mapping the contents of the soil TK using the HJ-1A HSI image for the Conghua district at the regional scale.
Moreover, in order to apply the methods to Conghua district, the HJ-1A image was considered to contain pure pixels. However, the coarse image spatial resolution of 100 × 100 m generally prevents the existence of pure pixels, with mixed pixels (including crop and soil) typically dominating the study area. Thus, the fully constrained least squares (FCLS) method [41] was used to obtained pure pixels and spectral reflectance of soil ( Figure 5).

Estimating Regional-Scale Soil Nutrient Contents Using HJ-1A Hyperspectral Data
Once the optimal variable screening and predictor models were selected, the method was applied to mapping the contents of the soil nutrient using HJ-1A image with 115 bands (459-956 nm) and 5 nm spectral resolution, which will not provide the above characteristic variables with beyond 956 nm wavelength. Thus, the characteristic variables should be re-screened from the resampling measured soil spectral data with 5 nm spectral resolution using the above optimal screening algorithm and to develop the corresponding estimation models. Then, the model was applied to mapping the contents of the soil TK using the HJ-1A HSI image for the Conghua district at the regional scale.
Moreover, in order to apply the methods to Conghua district, the HJ-1A image was considered to contain pure pixels. However, the coarse image spatial resolution of 100 × 100 m generally prevents the existence of pure pixels, with mixed pixels (including crop and soil) typically dominating the study area. Thus, the fully constrained least squares (FCLS) method [41] was used to obtained pure pixels and spectral reflectance of soil (Figure 5).

Accuracy Validation
The coefficient of determination (R 2 ), concordance correlation coefficient (CCC), ratio of performance to interquartile range (RPIQ), the root mean square error of calibration (RMSEC), and cross-validation (RMSECV) were used as statistical measures to assess the performance of estimation models. The RPIQ is defined as the ratio of IQ to RMSECV [42]. IQ is the interquartile range (IQ = Q3 − Q1) of the observed values. Q1 and Q3 denote the first and third quartile, respectively.

Optimal Algorithm for the Screening of the Characteristic Variables
In order to determine the characteristic variables, the three choosing algorithms (PCC, LASSO, and GBDT) were implemented on 6272 spectral data of the R, FD, SD, and RL ( Figure 2) and soil nutrient contents in the 75 sample points collected across the province. Figure 6 illustrates the correlation coefficients of the spectral variables. Stepwise regression with VIF analysis was further used to eliminate the collinearity among the spectral variables screened by the PCC algorithms (Table 2).

Accuracy Validation
The coefficient of determination (R 2 ), concordance correlation coefficient (CCC), ratio of performance to interquartile range (RPIQ), the root mean square error of calibration (RMSEC), and cross-validation (RMSECV) were used as statistical measures to assess the performance of estimation models. The RPIQ is defined as the ratio of IQ to RMSECV [42]. IQ is the interquartile range (IQ = Q3 − Q1) of the observed values. Q1 and Q3 denote the first and third quartile, respectively.

Optimal Algorithm for the Screening of the Characteristic Variables
In order to determine the characteristic variables, the three choosing algorithms (PCC, LASSO, and GBDT) were implemented on 6272 spectral data of the R, FD, SD, and RL ( Figure 2) and soil nutrient contents in the 75 sample points collected across the province. Figure 6 illustrates the correlation coefficients of the spectral variables. Stepwise regression with VIF analysis was further used to eliminate the collinearity among the spectral variables screened by the PCC algorithms (Table 2).  Considering the possible existence of a nonlinear relationship between the spectral variables and soil nutrient contents, we introduced the GBDT and LASSO algorithms for  Considering the possible existence of a nonlinear relationship between the spectral variables and soil nutrient contents, we introduced the GBDT and LASSO algorithms for the screening task. Numerous experiments were performed, identifying the prediction error of the GBDT algorithm to tend towards stability for screening criteria of soil TN, TP, and TK in the GBDT algorithm equal to FI > 0.015, FI > 0.015, and FI > 0.01, respectively. Figure 7 depicts the feature importance and regression coefficient of the screened spectral variables. For the LASSO algorithm we employed RC = 0 as the screening criteria.
Stepwise regression with VIF analysis was further employed to eliminate the collinearity among the spectral variables screened by the GBDT and LASSO algorithms. Table 3 reports the final results of the screening characteristic variables for the three soil nutrients (TN, TP, and TK).
the screening task. Numerous experiments were performed, identifying the prediction error of the GBDT algorithm to tend towards stability for screening criteria of soil TN, TP, and TK in the GBDT algorithm equal to FI > 0.015, FI > 0.015, and FI > 0.01, respectively. Figure 7 depicts the feature importance and regression coefficient of the screened spectral variables. For the LASSO algorithm we employed RC ≠ 0 as the screening criteria.
Stepwise regression with VIF analysis was further employed to eliminate the collinearity among the spectral variables screened by the GBDT and LASSO algorithms. Table 3 reports the final results of the screening characteristic variables for the three soil nutrients (TN, TP, and TK).  In order determine the most accurate screening algorithm, the PLSR approach was selected to construct the model between soil nutrients and the characteristic variables from the three algorithms based on 75 soil samples from the province. The PLSR relationship models are described as follows:  In order determine the most accurate screening algorithm, the PLSR approach was selected to construct the model between soil nutrients and the characteristic variables from the three algorithms based on 75 soil samples from the province. The PLSR relationship models are described as follows: The results identify the optimal algorithms of soil TN, TP, and TK as GBDT, GBDT, and LASSO, with R 2 values of 0.26, 0.37, and 0.47, respectively. Among three nutrients, the LASSO-PLSR showed the best estimation of soil TK.

Determining the Optimal Model for Soil Nutrient Content Estimations
The MLR, RR, SVM, and GABP models were adopted to determine the relationship between the characteristic variables and soil nutrients (Figure 8). The GABP model exhibited the highest predicative capability for the three soil nutrients, with scatter plots closer to the 1:1 line compared to MLR, RR, and SVM. Additionally, it offered the most accurate estimates in cross-validation with R 2 cv of 0.69, RMSECV of 0.35, and RPIQ = 2.03 for TN; R 2 cv of 0.73, RMSECV of 0.30 and RPIQ = 2.10 for TP, R 2 cv of 0.82, RMSECV of 3.39, and RPIQ = 3.57 for TK, respectively ( Table 4). The prediction effect of soil TK is obviously better than that of TN and TP, which may be due to potassium being a metal element, with a spectral response sensitivity that exceeds other non-metal elements (e.g., nitrogen and phosphorus). The results identify the optimal algorithms of soil TN, TP, and TK as GBDT, GBDT, and LASSO, with R 2 values of 0.26, 0.37, and 0.47, respectively. Among three nutrients, the LASSO-PLSR showed the best estimation of soil TK.

Determining the Optimal Model for Soil Nutrient Content Estimations
The MLR, RR, SVM, and GABP models were adopted to determine the relationship between the characteristic variables and soil nutrients (Figure 8). The GABP model exhibited the highest predicative capability for the three soil nutrients, with scatter plots closer to the 1:1 line compared to MLR, RR, and SVM. Additionally, it offered the most accurate estimates in cross-validation with R of 0.69, RMSECV of 0.35, and RPIQ = 2.03 for TN; R of 0.73, RMSECV of 0.30 and RPIQ = 2.10 for TP, R of 0.82, RMSECV of 3.39, and RPIQ = 3.57 for TK, respectively ( Table 4). The prediction effect of soil TK is obviously better than that of TN and TP, which may be due to potassium being a metal element, with a spectral response sensitivity that exceeds other non-metal elements (e.g., nitrogen and phosphorus).     Table 4 demonstrates the soil TK estimation accuracy exceeding that of TN and TP. Therefore, we applied the proposed method to map soil TK contents in Conghua District at the regional scale using HJ-1A imagery because the spectral wavelength of HJ-1A data ranged from 459 to 956 nm, which had different range and spectral bands of wavelengths from those of the spectral variables involved in the above estimation models. The model based on 75 sample points collected across the province could not be utilized for the HJ-1A images. We employed the LASSO-GABP method to re-screen the optimal spectral variables from the resample measured soil spectral data with 5 nm spectral resolution and to develop the corresponding estimation models. The screened spectral variables were determined as band462, band464, band466, band470, band477, band484, band574, and band652. The soil TK was estimated with reliable accuracy (R 2 of 0.82, RMSEC of 3.28 g kg −1 ; Figure 9).

Mapping Soil Nutrient Contents Using the Proposed Method
Agriculture 2021, 11, x FOR PEER REVIEW 12 of 16 at the regional scale using HJ-1A imagery because the spectral wavelength of HJ-1A data ranged from 459 to 956 nm, which had different range and spectral bands of wavelengths from those of the spectral variables involved in the above estimation models. The model based on 75 sample points collected across the province could not be utilized for the HJ-1A images. We employed the LASSO-GABP method to re-screen the optimal spectral variables from the resample measured soil spectral data with 5 nm spectral resolution and to develop the corresponding estimation models. The screened spectral variables were determined as band462, band464, band466, band470, band477, band484, band574, and band652. The soil TK was estimated with reliable accuracy (R 2 of 0.82, RMSEC of 3.28 g kg −1 ; Figure 9).

Figure 9.
Scatter plots of the measured and estimated values based on resample measured soil spectral data. Figure 10 demonstrates the spatial distribution of the soil TK contents obtained using the estimation model. The soil TK content is generally concentrated within 10-20 g kg −1 , with flat areas exhibiting a higher content and areas with high slopes and close proximity to rivers associated with lower content. This may be linked to soil erosion, which is consistent with the actual situation.  Figure 10 demonstrates the spatial distribution of the soil TK contents obtained using the estimation model. The soil TK content is generally concentrated within 10-20 g kg −1 , with flat areas exhibiting a higher content and areas with high slopes and close proximity to rivers associated with lower content. This may be linked to soil erosion, which is consistent with the actual situation.  Table 5). The estimation accuracy of soil TK content was relatively high, with an R 2 of 0.79 and RMSE of 4.01 g kg −1 . This indicates that the GABP model is capable of mapping the soil TK content. However, the estimation accuracy of the regional-scale retrievals is lower than that of the point-scale. This may be due to the limitation of the narrow spectral region of the HJ-1A HSI data (450-960 nm).

Discussion
In the current paper we compared three algorithms (PCC, LASSO, and GBDT) and four models (MLR, RR, SVM, and GABP) in terms of soil nutrient estimations in order to determine a method for the prediction of high-accuracy soil nutrients.
In this method, to the best of our knowledge, this is the first attempt to use the LASSO and GBDT algorithms to determine the characteristic variables for soil nutrient estimations. LASSO with PLSR fit degree (R 2 ) of 0.47 was determined as optimal for the accurate selection of soil TK characteristic variables, and GBDT for TN and TP with R 2 of 0.26 and 0.37. This indicates the significant nonlinear spectral response mechanism of the soil nutrients. The result found that 16 characteristic variables obtained using the optimal screening algorithms are sensitive to soil nutrients: FD572, FD977, FD1084, FD1015, FD2051, SD418 for TN, FD663, FD747, FD1009, SD831 for TP and FD659, FD904, FD965, FD1128, FD1521, SD1006 for TK. Some selected wavelengths are in general agreement with previous research [43][44][45].
Previous studies generally employ linear models to estimate soil nutrients [2,[46][47][48]. The 33 sample plots (Figure 1b) were used to verify the feasibility of mapping soil nutrient content by calculating the R 2 , RMSE, and RPIQ values ( Table 5). The estimation accuracy of soil TK content was relatively high, with an R 2 of 0.79 and RMSE of 4.01 g kg −1 . This indicates that the GABP model is capable of mapping the soil TK content. However, the estimation accuracy of the regional-scale retrievals is lower than that of the point-scale. This may be due to the limitation of the narrow spectral region of the HJ-1A HSI data (450-960 nm).

Discussion
In the current paper we compared three algorithms (PCC, LASSO, and GBDT) and four models (MLR, RR, SVM, and GABP) in terms of soil nutrient estimations in order to determine a method for the prediction of high-accuracy soil nutrients.
In this method, to the best of our knowledge, this is the first attempt to use the LASSO and GBDT algorithms to determine the characteristic variables for soil nutrient estimations. LASSO with PLSR fit degree (R 2 ) of 0.47 was determined as optimal for the accurate selection of soil TK characteristic variables, and GBDT for TN and TP with R 2 of 0.26 and 0.37. This indicates the significant nonlinear spectral response mechanism of the soil nutrients. The result found that 16 characteristic variables obtained using the optimal screening algorithms are sensitive to soil nutrients: FD 572 , FD 977 , FD 1084 , FD 1015 ,  FD 2051 , SD 418 for TN, FD 663 , FD 747 , FD 1009 , SD 831 for TP and FD 659 , FD 904 , FD 965 , FD 1128 ,  FD 1521 , SD 1006 for TK. Some selected wavelengths are in general agreement with previous research [43][44][45].
Previous studies generally employ linear models to estimate soil nutrients [2,[46][47][48]. In order to improve the estimation accuracy of soil nutrients, we adopted linear (MLR and RR) and nonlinear (SVM and GABP) algorithms to construct the soil nutrient estimation models based on the determined spectral characteristic variables ( Table 3). The validation results (Table 4) revealed the GBDT-GABP algorithm to perform the best in soil TN (R 2 cv of 0.69, RMSECV of 0.35, and RPIQ = 2.03) and TP (R 2 cv of 0.73, RMSECV of 0.30, and RPIQ = 2.10) estimations, while LASSO-GABP was optimal for soil TK (R 2 cv of 0.82, RMSECV of 3.39, and RPIQ = 3.57), which are in general agreement with previous research results with R 2 from 0.56 to 0.84 (TN), 0.65 to 0.81 (TP), and 0.67 to 0.82 (TK) [43,[49][50][51][52][53][54]. The proposed model constructed using machine learning algorithms outperformed the linear models. This indicates the existence of a significant nonlinear relationship between the soil nutrients and spectral characteristic variables.
In order to validate the regional-scale applicability of the new method, HJ-1A image data obtained from pure pixels using the fully constrained least squares (FCLS) method was used to map soil TK with the best estimation accuracy (R 2 = 0.86) on point scale. Results using the 33 validation sample plots demonstrate the screened spectral characteristic variables to explain 79% of the variance in the TK content, with an RMSE of 4.01 g kg −1 for the mapping of TK content. This indicates the great potential of GABP to map the soil TK content at a large scale. However, the point-scale estimation accuracy is higher than that of the regional-scale due to the narrow spectral range of the HJ-1A HSI data. Future research will map the TK contents using satellite hyperspectral images covering a wider spectral region (350-2500 nm).
The prediction effect of soil TK is obviously better than that of TN and TP (Table 3). This may be because potassium is a metal element, with a spectral response sensitivity that exceeds other non-metal elements (e.g., nitrogen and phosphorus). The introduction of additional soil elements (including metals and nonmetals) to explore this phenomenon will be the focus of further work.
We employed 75 soil samples to develop the models and validate the method for the whole Guangdong province, while 33 sample plots were used to verify the feasibility of mapping soil nutrient content in Conghua District. Although the sampling design was conducted based on different soil characteristics and soil types, the sample sizes were relatively small. Future studies will employ larger sample sizes to further develop and validate the proposed method.

Conclusions
The determination of characteristic variables is key for accurate hyperspectral estimation models of the soil nutrient content. This paper introduced the LASSO and GBDT algorithms to screen the optimal relevant characteristic variables of soil TN, TP, and TK. The estimation models of soil nutrient content were subsequently developed using the selected characteristic variables and field observations of soil nutrient content. The most accurate estimation model was then adopted to explore the possibility of spatially mapping the soil nutrient content using HJ-1A data. The results demonstrated that compared with the statistical analysis method, the machine learning method effectively screened the characteristic variables. In addition, based on the RMSECV values, the GABP models of the soil nutrient contents determined the most accurate estimates at the soil sample point level.
The new method provides the potential for soil nutrient mapping at the regional scale with a reasonable accuracy using hyperspectral imagery. Results indicate the ability of the LASSO and GBDT algorithms to improve the estimation accuracy of soil TN, TP, and TK, which are crucial for agricultural management. The proposed machine learning method has the potential to effectively select the spectral characteristic indices of soil nutrients, increasing the accuracy of the results.