Prediction of Soil Nutrient Contents Using Visible and Near-Infrared Reﬂectance Spectroscopy

: Quickly and e ﬃ ciently monitoring soil nutrient contents using remote sensing technology is of great signiﬁcance for farmland soil productivity, food security and sustainable agricultural development. Current research has been conducted to estimate and map soil nutrient contents in large areas using hyper-spectral techniques, however, it is di ﬃ cult to obtain accurate estimates. In order to improve the estimation accuracy of soil nutrient contents, we introduced a GA-BPNN method, which combined a back propagation neural network (BPNN) with the genetic algorithm optimization (GA). This study was conducted in Guangdong, China, based on soil nutrient contents and hyperspectral data. The prediction accuracies from a partial least squares regression (PLSR), BPNN and GA-BPNN were compared using ﬁeld observations. The results showed that (1) Among three methods, the GA-BPNN provided the most accurate estimates of soil total nitrogen (TN), total phosphorus (TP) and total potassium (TK) contents; (2) Compared with the BPNN models, the GA-BPNN models signiﬁcantly improved the estimation accuracies of the soil nutrient contents by decreasing the relative root mean square error (RRMSE) values by 15.9%, 5.6% and 20.2% at the sample point level, and 20.1%, 16.5% and 47.1% at the regional scale for TN, TP and TK, respectively. This indicated that by optimizing the parameters of BPNN, the GA-BPNN provided greater potential to improving the estimation; and (3) Soil TK content could be more accurately mapped by the GA-BPNN method using HuanJing-1A Hyperspectral Imager (HJ-1A HSI) (manufacturer: China Aerospace Science and Technology Corporation; Beijing, China) data with a RRMSE value of 20.37% than the soil TN and TP with the RRMSE values of 40.41% and 34.71%, respectively. This implied that the GA-BPNN model provided the potential to map the soil TK content for the large area. The research results provided an important reference for high-accuracy prediction of soil nutrient contents.


Introduction
Soil is an important component of terrestrial ecosystems and provides necessary moisture and nutrients for plant growth. As a main source of plant nutrition and soil, nutrients such as nitrogen, phosphorus and potassium play a vital role in plant growth [1,2]. Although traditional measurement methods (including field and laboratory measurements) provide accurate estimates of soil nutrient contents at sampled points, they are time-consuming and costly for the generation of spatially explicit estimates for a whole study area. For example, in China the analysis cost of every soil sample to obtain the contents of total nitrogen (TN), total phosphorus (TP) and total potassium (TK) in a laboratory is about ¥165 ($24). If a 100 m × 100 m spatial resolution map for an area of 50 km × 50 km is produced using the traditional methods, the total cost of obtaining the spatially explicit estimates of the soil properties for the whole area will be ¥2965 million ($424 million). This cost also ignores the traveling cost, and the time and labor for collection of soil samples in the field. Thus, traditional measurement methods do not meet the need of modern soil management. On the other hand, remote sensing technologies could quickly lead to spatially explicit estimates of soil nutrient contents and monitor their dynamics at regional scales with low cost [3]. This is mainly because remote sensing images provide spectral reflectance values on the basis of pixel-by-pixel with large coverages and repeated spectral measurements. Moreover, remote sensing-based methods focus on developing relationships of soil properties with spectral variables from images based on the field measurements from soil samples and then apply the relationships to estimate soil properties at the unobserved locations. Substantial research has been conducted in this field, especially for agricultural areas. For example, Dean et al. (2011) used airborne hyperspectral images to quantitatively estimate the 15 important soil elements (e.g., potassium, nitrogen, etc.) in tilled agricultural fields [4]. Gao et al. (2011) constructed the relationships between hyperspectral data and soil total nitrogen and organic matter contents using a high signal noise ratio measuring system to predict the contents of the soil properties in northeast China [5].  performed image pan-sharpening of Landsat 8 with WorldView-2 and Pleiades-1A using three pan-sharpening techniques, analyzed the relationships between pan-sharpened-multi-spectral indices and soil TN, and developed the soil TN prediction models using Random Forest methods in Bijarpur district, Karnataka State, South India [6].
Various studies using hyperspectral techniques to estimate soil nutrient contents have been reported. The present hyperspectral estimation methods of soil nutrient contents can be divided into two categories: linear and nonlinear predictions. The linear prediction methods build linear mathematical relationships between spectral variables from images and soil nutrient contents. Of them, the multiple linear regressions (MLR) and partial least squares regression (PLSR) are often used [7][8][9]. Due to their stability, the relationships often lead to an ideal estimation accuracy for specific research areas. However, they usually fail when the relationships are applied to other areas.
With the development of machine learning, many scholars have used nonlinear methods to estimate soil nutrient contents. The nonlinear methods mainly include various machine learning models to construct the nonlinear relationships between spectral variables and soil nutrient contents for prediction. The nonlinear models, including support vector machine (SVM), random forest (RF) and back-propagation neural network (BPNN), have been widely used to predict soil TN, TP, TK, organic matter and so on [10][11][12][13][14]. However, these models also have their own disadvantages. For example, SVM is difficult to implement for large-scale training samples because quadratic programming routines have high complexity and require huge memory and computational time for large area applications [15][16][17]. Random forests are prone to over fitting on regression models when RF models learn the detail and noise in training data that negatively impact the performance of the models on new data [18]. The BPNN algorithm has a strong nonlinear ability of modeling the relationships between soil nutrient contents and spectral variables from images, learning adaptability, and fault tolerance, which has been widely used to estimate contents of soil nutrients [19]. Substantial research [20][21][22][23][24] has also demonstrated that the BPNN method is a good alternative for this purpose. However, the large uncertainties of weights and threshold may affect the improvement of estimation accuracy for soil nutrient contents [25].
The objective of this study was to determine a method to accurately estimate soil nutrient contents for large areas using hyperspectral data by integrating the genetic algorithm (GA) with BPNN to optimize the weights and threshold of the network. The integration led to a new model GA-BPNN for the estimation of soil nutrient contents. The specific objectives were to: (1) use the GA-BPNN to construct a high-accuracy model to predict the contents of soil nutrients; and (2) apply the high-accuracy model to map the soil nutrient contents at a regional scale using HuanJing-1A Hyperspectral Imager (HJ-1A HSI) image. This method was examined to predict and map the contents of soil nutrients TN, TP and TK in a large area using hyperspectral data collected from the measurements of soil samples in laboratory and in a relatively small area using a HJ-1A HSI image, respectively.

Study Areas
This study dealt with both the whole Guangdong province and the Conghua district of Guangzhou in the province. Guangdong Province is one of the most developed areas in China with an area of 179,700 km 2 and located in Southern China with the latitude and longitude ranges of 20 • 09 to 25 • 31 N and 109 • 45 to 117 • 20 E (Figure 1). The annual average temperature of the province is 21.8 • C and the annual average number of sunshine hours increases from less than 1500 h in the north to more than 2300 h in the south. The annual total solar radiation is between 4200 and 5400 MJ·m −2 , and the annual average precipitation is 1789.3 mm. Lateritic red soil, red soil and lateritic soil dominate this area. A total of 75 soil samples were collected throughout the province in May 2017 and used for developing the hyperspectral estimation models of the contents of soil nutrients (TN, TP, and TK) ( Figure 1). The objective of this study was to determine a method to accurately estimate soil nutrient contents for large areas using hyperspectral data by integrating the genetic algorithm (GA) with BPNN to optimize the weights and threshold of the network. The integration led to a new model GA-BPNN for the estimation of soil nutrient contents. The specific objectives were to: (1) use the GA-BPNN to construct a high-accuracy model to predict the contents of soil nutrients; and (2) apply the high-accuracy model to map the soil nutrient contents at a regional scale using HuanJing-1A Hyperspectral Imager (HJ-1A HSI) image. This method was examined to predict and map the contents of soil nutrients TN, TP and TK in a large area using hyperspectral data collected from the measurements of soil samples in laboratory and in a relatively small area using a HJ-1A HSI image, respectively.

Study Areas
This study dealt with both the whole Guangdong province and the Conghua district of Guangzhou in the province. Guangdong Province is one of the most developed areas in China with an area of 179,700 km 2 and located in Southern China with the latitude and longitude ranges of 20°09′ to 25°31′ N and 109°45′ to 117°20′ E ( Figure 1). The annual average temperature of the province is 21.8 °C and the annual average number of sunshine hours increases from less than 1500 h in the north to more than 2300 h in the south. The annual total solar radiation is between 4200 and 5400 MJ•m −2 , and the annual average precipitation is 1789.3 mm. Lateritic red soil, red soil and lateritic soil dominate this area. A total of 75 soil samples were collected throughout the province in May 2017 and used for developing the hyperspectral estimation models of the contents of soil nutrients (TN, TP, and TK) ( Figure 1). In the Conghua district of Guangzhou city (Figure 2a), a total of 33 soil samples (red points in Figure 2b) were collected on 30 October 2017 and used for the accuracy validation of the model applications at the regional scale. In order to validate the application of the established models to hyperspectral satellite data, an HJ-1A HIS image at the 100 m spatial resolution and the view width of 50 km were selected to accurately and timely monitor the soil nutrient contents and map their spatial distributions because it has obvious advantages in regional macro-ecology remote sensing monitoring and evaluation [26]. The HJ-1A HSI image dated on 30 October 2017 was acquired with a total of 115 bands from 459 nm to 956 nm with a spectral resolution of 5 nm. Radiometric correction of the HJ-1A HIS image was conducted using fast line-of-sight atmospheric analysis of hypercubes (FLAASH) model. Geometric precision correction for the HJ-1A HSI data was conducted by using In the Conghua district of Guangzhou city (Figure 2a), a total of 33 soil samples (red points in Figure 2b) were collected on 30 October 2017 and used for the accuracy validation of the model applications at the regional scale. In order to validate the application of the established models to hyperspectral satellite data, an HJ-1A HIS image at the 100 m spatial resolution and the view width of 50 km were selected to accurately and timely monitor the soil nutrient contents and map their spatial distributions because it has obvious advantages in regional macro-ecology remote sensing monitoring and evaluation [26]. The HJ-1A HSI image dated on 30 October 2017 was acquired with a total of 115 bands from 459 nm to 956 nm with a spectral resolution of 5 nm. Radiometric correction of the HJ-1A HIS image was conducted using fast line-of-sight atmospheric analysis of hypercubes (FLAASH) model. Geometric precision correction for the HJ-1A HSI data was conducted by using the quadratic polynomial calculation model and a cubic convolution interpolation method, and the geometric correction errors were controlled within 0.5 pixels. the quadratic polynomial calculation model and a cubic convolution interpolation method, and the geometric correction errors were controlled within 0.5 pixels.

Collection and Chemical Analysis of Soil Samples
In order to ensure the homogeneous distribution of the soil samples in Guangdong province, 75 soil samples were collected at a 50 km × 50 km sampling grid according to a stratified sampling method in this study ( Figure 1). The 75 samples were randomly divided into 50 training samples (black dots in Figure 1) for model developments and 25 validation samples (red dots in Figure 1) for assessment of model predictions. Moreover, 33 soil samples (red dots in Figure 2b) in the Conghua district were employed to assess the accuracy of mapping soil nutrient contents at the regional scale using the HJ-1A HIS image. All the soil sample points were located using a global positioning system (GPS) receiver. At each sample point, five soil sub-samples of 0-20 cm depth soil layer were collected, mixed and used as the soil sample of this point. The soil samples were stripped of impurities, air dried, milled and filtered with a 2 mm sieve. The TN was measured using the semi-micro Kjeldahl method described by Walkley and Black [27]. The TP and TK were analyzed by ultraviolet spectrophotometer UV-2600 (made by Shimadzu CO, LTD.) and flame photometer FP640 (made by INESA Analytical Instrument CO, LTD.), respectively.
In order to enhance understanding of the data, descriptive statistics of the dataset from 75 soil samples were calculated in Table 1. The contents of soil TN, TP and TK ranged from 0.21 g kg −1 to 2.79 g kg −1 , 0.13 g kg −1 to 3.75 g kg −1 and 0.62 g kg −1 to 30.39 g kg −1 with the mean values of 1.36 g kg −1 , 0.75 g kg -1 and 10.55 g kg −1 , respectively. The coefficients of variation (CV) of TN, TP and TK were 41.91%, 73.33% and 72.13%, respectively. The variability of the soil nutrient contents in the study area was moderate for TN and great for TP and TK [8], indicating that the 75 soil samples were reasonable.
In addition, the sample means were not significantly different from each other between the training and test datasets at the significant level of 0.05.

Collection and Chemical Analysis of Soil Samples
In order to ensure the homogeneous distribution of the soil samples in Guangdong province, 75 soil samples were collected at a 50 km × 50 km sampling grid according to a stratified sampling method in this study ( Figure 1). The 75 samples were randomly divided into 50 training samples (black dots in Figure 1) for model developments and 25 validation samples (red dots in Figure 1) for assessment of model predictions. Moreover, 33 soil samples (red dots in Figure 2b) in the Conghua district were employed to assess the accuracy of mapping soil nutrient contents at the regional scale using the HJ-1A HIS image. All the soil sample points were located using a global positioning system (GPS) receiver. At each sample point, five soil sub-samples of 0-20 cm depth soil layer were collected, mixed and used as the soil sample of this point. The soil samples were stripped of impurities, air dried, milled and filtered with a 2 mm sieve. The TN was measured using the semi-micro Kjeldahl method described by Walkley and Black [27]. The TP and TK were analyzed by ultraviolet spectrophotometer UV-2600 (made by Shimadzu CO, LTD.) and flame photometer FP640 (made by INESA Analytical Instrument CO, LTD.), respectively.
In order to enhance understanding of the data, descriptive statistics of the dataset from 75 soil samples were calculated in Table 1. The contents of soil TN, TP and TK ranged from 0.21 g kg −1 to 2.79 g kg −1 , 0.13 g kg −1 to 3.75 g kg −1 and 0.62 g kg −1 to 30.39 g kg −1 with the mean values of 1.36 g kg −1 , 0.75 g kg −1 and 10.55 g kg −1 , respectively. The coefficients of variation (CV) of TN, TP and TK were 41.91%, 73.33% and 72.13%, respectively. The variability of the soil nutrient contents in the study area was moderate for TN and great for TP and TK [8], indicating that the 75 soil samples were reasonable. In addition, the sample means were not significantly different from each other between the training and test datasets at the significant level of 0.05.

Spectral Measurement and Pre-Treatment of Soil Samples
The spectral reflectance values of soil samples were measured with an AvaField portable spectrometer (Avantes, Inc., Holland) with a high signal-to-noise ratio (SNR) and high reliability (http://www.avantes.cn), which has a wavelength range of 340-2511 nm and a spectral sampling interval of 0.6 nm. The experiment was carried out in a black box, and each of the soil samples was placed in a black paper cup having a diameter of 10 cm and a depth of 7 cm. A 50 W (650 lx) halogen lamp was used to simulate sunlight with a 10 • field of view (FOV). The soil spectral reflectance values were measured by vertical contact with the soil sample, and the white plate was used for calibration before the data collection to obtain the absolute reflectance. In order to improve the accuracy of measured soil reflectance, spectral sampling was performed in the center location of four zoning within the scope of soil samples. For each of the four sensing locations, 5 spectra were recorded, and the mean value of the 20 spectra was used to represent the soil spectral value at that point. In order to reduce the noise introduced in the process of spectral measurement, the spectra were smoothed using the piecewise Savitzky-Golay (SG) filter with a window size of 10 [28,29], and the smoothed curves of spectral reflectance for the soil samples are shown in Figure 3.

Spectral Measurement and Pre-Treatment of Soil Samples
The spectral reflectance values of soil samples were measured with an AvaField portable spectrometer (Avantes, Inc., Holland) with a high signal-to-noise ratio (SNR) and high reliability (http://www.avantes.cn), which has a wavelength range of 340-2511 nm and a spectral sampling interval of 0.6 nm. The experiment was carried out in a black box, and each of the soil samples was placed in a black paper cup having a diameter of 10 cm and a depth of 7 cm. A 50 W (650 lx) halogen lamp was used to simulate sunlight with a 10° field of view (FOV). The soil spectral reflectance values were measured by vertical contact with the soil sample, and the white plate was used for calibration before the data collection to obtain the absolute reflectance. In order to improve the accuracy of measured soil reflectance, spectral sampling was performed in the center location of four zoning within the scope of soil samples. For each of the four sensing locations, 5 spectra were recorded, and the mean value of the 20 spectra was used to represent the soil spectral value at that point. In order to reduce the noise introduced in the process of spectral measurement, the spectra were smoothed using the piecewise Savitzky-Golay (SG) filter with a window size of 10 [28,29], and the smoothed curves of spectral reflectance for the soil samples are shown in Figure 3.    Figure 3 illustrates the spectral absorption features of the reflectance curves for each of the 75 soil samples. The characteristics of the waveform and absorption peak of these spectral curves are consistent with the current research [30][31][32], indicating that the collected soil spectral data are of high quality.
The differences in the spectral response of each soil nutrient can be so subtle that it is difficult to detect them using raw spectral data. In order to improve the prediction accuracy, the smoothed spectral data of soil nutrients were transformed with the First Derivative (FD), Second Derivative (SD) and Reciprocal Logarithmic (RL), which attempted to eliminate or reduce the effect of background noise and the change of signal intensity caused by the soil surface spectral scattering and absorption. The results are shown in Figure 4. consistent with the current research [30][31][32], indicating that the collected soil spectral data are of high quality.
The differences in the spectral response of each soil nutrient can be so subtle that it is difficult to detect them using raw spectral data. In order to improve the prediction accuracy, the smoothed spectral data of soil nutrients were transformed with the First Derivative (FD), Second Derivative (SD) and Reciprocal Logarithmic (RL), which attempted to eliminate or reduce the effect of background noise and the change of signal intensity caused by the soil surface spectral scattering and absorption. The results are shown in Figure 4.

Selection of Spectral Variables
One of the most important steps for obtaining accurate hyperspectral estimation models of the soil TN, TP and TK contents was the determination of appropriate spectral variables. As an indication on the strength of the linear relationship between two random variables [20,33], Pearson correlation was used to select the spectral variables with the greatest correlation coefficients under the condition of significant level p ≤ 0.05. The Pearson correlation coefficient is expressed as: where is the correlation coefficient between a soil nutrient content and a spectral variable, is the spectral reflectance of the i-th band of the nth soil sample, is the mean value of the reflectance of the soil sample in the i-th band, and is one of the soil nutrient (TN, TP and TK) contents, is the average value of a soil nutrient content.
In addition, the Variance Inflation Factor (VIF) was used to reduce the collinearity among the spectral variables. When 0 < VIF < 10, there was no collinearity relationship, when 10 ≤ VIF < 100, there was a strong collinearity relationship, and when VIF ≥ 100, there was a severe collinearity relationship [34,35].

Partial Least Squares Regression to Estimate Soil Nutrient Contents
The PLSR is a standard multivariate statistical technique that was developed by Herman Wold in 1966 [36]. The PLSR has been widely used in different disciplines because it allows for the analysis of data with strong correlations in the predictor variables, even when the number of training samples is far smaller than that of predictor variables [37]. The equation of PLSR is as follows: where is the dependent variable (each of the soil nutrient contents) after mean centering; is the mean-centered independent variable matrix (spectral variables); is the coefficient matrix, and is the residual matrix.

Selection of Spectral Variables
One of the most important steps for obtaining accurate hyperspectral estimation models of the soil TN, TP and TK contents was the determination of appropriate spectral variables. As an indication on the strength of the linear relationship between two random variables [20,33], Pearson correlation was used to select the spectral variables with the greatest correlation coefficients under the condition of significant level p ≤ 0.05. The Pearson correlation coefficient is expressed as: where r i is the correlation coefficient between a soil nutrient content and a spectral variable, R ni is the spectral reflectance of the i-th band of the nth soil sample, R i is the mean value of the reflectance of the soil sample in the i-th band, and y is one of the soil nutrient (TN, TP and TK) contents, y is the average value of a soil nutrient content.
In addition, the Variance Inflation Factor (VIF) was used to reduce the collinearity among the spectral variables. When 0 < VIF < 10, there was no collinearity relationship, when 10 ≤ VIF < 100, there was a strong collinearity relationship, and when VIF ≥ 100, there was a severe collinearity relationship [34,35].

Partial Least Squares Regression to Estimate Soil Nutrient Contents
The PLSR is a standard multivariate statistical technique that was developed by Herman Wold in 1966 [36]. The PLSR has been widely used in different disciplines because it allows for the analysis of data with strong correlations in the predictor variables, even when the number of training samples is far smaller than that of predictor variables [37]. The equation of PLSR is as follows: where Y is the dependent variable (each of the soil nutrient contents) after mean centering; X is the mean-centered independent variable matrix (spectral variables); β is the coefficient matrix, and ε is the residual matrix.

Back-Propagation Neural Network to Estimate Soil Nutrient Contents
The BPNN is a multi-layer feed forward network trained by the error back propagation algorithm, which is suitable for various nonlinear relationship analysis. It consists of an input layer, an output layer and several hidden layers [38,39]. The trainlm and purelin were selected as the training function and the transfer function of the output layer in the BPNN. The steepest descent method and the back-propagation algorithm in the BPNN Model are used to repeatedly adjust the weight and deviation of the network until the actual value and the expected output are as close as possible [4,40], whose structure is shown in Figure 5a. The BPNN is a multi-layer feed forward network trained by the error back propagation algorithm, which is suitable for various nonlinear relationship analysis. It consists of an input layer, an output layer and several hidden layers [38,39]. The trainlm and purelin were selected as the training function and the transfer function of the output layer in the BPNN. The steepest descent method and the back-propagation algorithm in the BPNN Model are used to repeatedly adjust the weight and deviation of the network until the actual value and the expected output are as close as possible [4,40], whose structure is shown in Figure 5a. The BPNN network training includes two stages: forward propagation and error back propagation [41].
(1) Forward propagation In neural networks, the forward propagation needs to calculate the neuron's input and output value. The equation of the output value ( ) is expressed as: where is the input layer information, indicating the spectral variables; is the hidden layer The BPNN network training includes two stages: forward propagation and error back propagation [41].
(1) Forward propagation In neural networks, the forward propagation needs to calculate the neuron's input and output value. The equation of the output value (o j ) is expressed as: where o i is the input layer information, indicating the spectral variables; o j is the hidden layer information; ω ji is the weight of the input layer to the hidden layer; f i is the transfer function of the input layer to the hidden layer. In this study, the trainlm function is chosen; θ j is the hidden layer threshold. The output value (o k ) in the hidden layer is transmitted to the output layer, and the equation of the output value is expressed as: where o k is the output layer information (each of the soil nutrient contents); f j is the transfer function of the hidden layer to the output layer, and the Purelin function was used in this study; ω k j represents the weight of the hidden layer to the output layer; θ k is the output layer threshold.
(2) Error back propagation The number of neurons in the hidden layer is determined according to the empirical formula expressed as Equation (5): where n h is the number of hidden layer units and n i is the number of input units. If the predicted value differs greatly from the measured value, the difference is transferred to the error of the back propagation process. The process of reverse propagation uses the Levenberg-Marquardt algorithm from the output layer to the input layer to modify the connection weight to reduce the mean square error (MSE).
where o is the measured soil nutrient content; o k is the predicted soil nutrient content; N is the number of training samples.

Genetic Algorithm-Back-Propagation Neural Network to Estimate Soil Nutrient Contents
The GA-BPNN Model combining Genetic Algorithm (GA) with BPNN was used to estimate soil nutrient contents. The GA is a parallel random search optimization method which is formed by simulating natural genetic mechanism and biological evolution theory [42]. It can effectively avoid local optimal solutions. In this study, the weight and threshold of the neural network were optimized by the GA, which led to the optimized BPNN prediction model of the soil nutrient contents [33]. The structure is shown in Figure 5b.
The original weight and threshold of BPNN were converted into chromosomes in GA by real-number coding. The code length was calculated using Equation (7): where i is the number of input layer neuron nodes, which is the number of spectral variables; j is the number of neurons in the hidden layer; k is the number of neurons in the output layer, and the output layer has only one of the soil nutrient contents in this case, hence k = 1. Then, a random population of chromosomes was generated. The BPNN was used to obtain the sum of the absolute error between the predicted and measured values of the training data as the individual fitness value (E), which is expressed as: where y k and o k are the measured and predicted value of the kth soil nutrient content, respectively.

Mapping Soil Nutrient Contents Based on the HuanJing-1A Hyperspectral Imager Image
The above PLSR, BPNN and GA-BPNN models were developed based on the measurements of spectral reflectance collected in the laboratory, implying lab-derived models. In this study, the HJ-1A HSI data were further used to drive the lab-derived models with the highest prediction accuracy due to erasing circumstance effects (e.g., atmospheric conditions, soil surface conditions), for mapping the spatial distribution of soil nutrient contents at the regional scale. However, the lab-derived models could not be directly utilized for the HJ-1A HSI images because the spectral resolution of the image was 5 nm, being much coarser compared with the measured spectral interval of 0.6 nm by the AvaField portable spectrometer used. Therefore, in order to match the spectral resolution of the HJ-1A HSI data, ENV's spectral resampling routine (ENVI Version 5.3, 2015 Edition, Copyright ITT Visual Information Solutions) was used to spectrally resample the measured soil spectral data collected using the AvaField portable spectrometer. Moreover, these resampling spectral variables from the HJ-1A HSI data were applied to mapping the spatial distribution of the soil nutrient contents.
In addition, there were a lot of mixed pixels each consisting of different land cover types in the HJ-1A HSI image with the 100 m spatial resolution. The average reflectance of a mixed pixel would be the combination of reflectance from the land cover types within the mixed pixel. In order to improve the prediction accuracy of the soil nutrient contents, the soil component reflectance without vegetation effect was retrieved using a linear spectral unmixing analysis. In this study, the reflectance (ρ) of each mixed pixel was considered as the combination of the contribution from the reflectance of the vegetated area (ρ v ) and soil (ρ s ). The land surface reflectance was then obtained from the following equation [43]: where f s is the fractions of the soil area within a mixed pixel. Finally, the lab-derived models were also performed using the soil component spectral variables from the HJ-1A HSI data to map the soil nutrient contents.

The Optimal Spectral Variables for Soil Nutrient Contents
The correlations between the soil nutrient (TN, TP and TK) contents and spectral indices, including the raw spectral reflectance (R), FD, SD and RL, were calculated and are shown in Figure 6. The spectral variables were first selected based on the correlation coefficients that were significantly different from zero at the significant level of 0.05. Then, the VIF was used to reduce the multi-collinearity among the selected spectral variables to acquire the optimal spectral variables in Table 2.
The correlations between the soil nutrient (TN, TP and TK) contents and spectral indices, including the raw spectral reflectance (R), FD, SD and RL, were calculated and are shown in Figure  6. The spectral variables were first selected based on the correlation coefficients that were significantly different from zero at the significant level of 0.05. Then, the VIF was used to reduce the multi-collinearity among the selected spectral variables to acquire the optimal spectral variables in Table 2.  Table 2. The optimal spectral variables selected for prediction of three soil nutrients

Soil Nutrient
The Spectral Variables Correlation Coefficients

Estimation and Accuracy Assessment of Soil Nutrient Contents for Soil Sample Points
In this study, the selected spectral variables were used as independent variables and each of the soil nutrient (TN, TP and TK) contents was used as the dependent variable. The PLSR, BPNN and GA-BPNN models were compared to predict the contents of soil TN, TP and TK, respectively. In order to assess the quality of the prediction models for estimating the soil TN, TP and TK contents, the coefficient of determination (R 2 ) and relative root mean square error (RRMSE) between the estimated and observed values and ratio of performance to deviation (RPD) were calculated based on both the modeling dataset and test dataset. In this study, the RPD is defined as the ratio of standard deviation to RMSE. The higher the RPD value, the higher the quality of the prediction models. Generally, the values of RPD less than 1.4, from 1.4 to 2.0 and greater than 2.0 indicate the poorest, fairly acceptable and accurate predictive performance, respectively [44,45]. The obtained PLSR prediction models are: In this study, moreover, a three-layer BPNN with a single hidden layer was used to predict the soil nutrient contents. The number of neuron nodes in the hidden layer was set up as 13, the number of iterations was set up as 1000, and both the learning rate and learning objective were set up as 0.01. In order to compare the results of the GA optimization, the network structure and parameter configuration were the same as those in the BPNN. In the GA, the number of the maximum runs was set up as 20, and the population size, crossover probability (P c ) and mutation probability (P m ) were respectively set up as 128, 0.9 and 0.02. For the PLSR, BPNN and GA-BPNN, 50 training samples were selected to train the response relationships of the soil nutrient contents with the optimal spectral variables ( Figure 7). Based on the modeling dataset, the explanatory power of the soil TN varied greatly depending on the prediction models and increased from 10.91% by PLSR to 83.89% by BPNN and 87.61% by GA-BPNN. The similar trends happened to soil TP and TK. The explanatory power of the soil TP increased from 30.88% by PLSR to 82.06% by BPNN and 94.25% by GA-BPNN, and the explanatory power of soil TK increased from 21.48% by PLSR to 92.18% by BPNN and 96.25% by GA-BPNN. Given a soil nutrient, the RRMSE values of the model predictions decreased, while the RPD values increased from PLSR to BPNN and GA-BPNN. The differences in the decrease of RRMSE and increase of RPD were great between the PLSR and BPNN or GA-BPNN and small between the BPNN and GA-BPNN. The modeling accuracy of the linear PLSR model was obviously much lower than those by two nonlinear models BPNN and GA-BPNN, implying that there existed a significant nonlinear relationship between the spectral variables and the soil nutrient contents.
Moreover, 25 soil samples were used to validate the prediction accuracy of three models. The validation results are shown in Figure 8, where the predicted values of the soil nutrient contents were plotted against the measured values, implying that GA-BPNN had a more powerful ability for the prediction of the soil nutrient contents because its scatter plot was closer to the 1:1 line than those from PLSR and BPNN. The predicted values of the PLSR model were significantly different from the Based on the modeling dataset, the explanatory power of the soil TN varied greatly depending on the prediction models and increased from 10.91% by PLSR to 83.89% by BPNN and 87.61% by GA-BPNN. The similar trends happened to soil TP and TK. The explanatory power of the soil TP increased from 30.88% by PLSR to 82.06% by BPNN and 94.25% by GA-BPNN, and the explanatory power of soil TK increased from 21.48% by PLSR to 92.18% by BPNN and 96.25% by GA-BPNN. Given a soil nutrient, the RRMSE values of the model predictions decreased, while the RPD values increased from PLSR to BPNN and GA-BPNN. The differences in the decrease of RRMSE and increase of RPD were great between the PLSR and BPNN or GA-BPNN and small between the BPNN and GA-BPNN. The modeling accuracy of the linear PLSR model was obviously much lower than those by two nonlinear models BPNN and GA-BPNN, implying that there existed a significant nonlinear relationship between the spectral variables and the soil nutrient contents.
Moreover, 25 soil samples were used to validate the prediction accuracy of three models. The validation results are shown in Figure 8, where the predicted values of the soil nutrient contents were plotted against the measured values, implying that GA-BPNN had a more powerful ability for the prediction of the soil nutrient contents because its scatter plot was closer to the 1:1 line than those from PLSR and BPNN. The predicted values of the PLSR model were significantly different from the measured values with large biases of the scatter points in the 1:1 line, indicating that obvious overestimations and underestimations occurred for the smaller and larger values, respectively. That is, the prediction ability of the PLSR model was poor for all three soil nutrients.
In Table 3, the PLSR model led to the poorest estimates with the smallest values of R 2 and RPD and the greatest values of RRMSE, while the GA-BPNN model showed the most accurate estimates with largest values of R 2 and RPD and smallest values of RRMSE for all three soil nutrient contents. The validation results further indicated that the GA-BPNN models provided the best performance for estimating the soil nutrient contents.

Estimation and Accuracy Assessment of Soil Nutrient Contents at the Regional Scale
In order to obtain more accurate values of spectral reflectance, Equation (9) was applied to derive the fractions of vegetation canopy and soil using the HJ-1A HSI band 848 nm in Figure 9. Then, the soil spectral reflectance was retrieved according to the area fraction of soil in Figure 10.  Table 3. The prediction accuracies of soil nutrient contents using PLSR, BPNN and GA-BPNN methods based on the validation dataset (R 2 : coefficient of determination; RRMSE: relative root mean square error; RPD: ratio of performance to deviation).

Estimation and Accuracy Assessment of Soil Nutrient Contents at the Regional Scale
In order to obtain more accurate values of spectral reflectance, Equation (9) was applied to derive the fractions of vegetation canopy and soil using the HJ-1A HSI band 848 nm in Figure 9. Then, the soil spectral reflectance was retrieved according to the area fraction of soil in Figure 10.

Estimation and Accuracy Assessment of Soil Nutrient Contents at the Regional Scale
In order to obtain more accurate values of spectral reflectance, Equation (9) was applied to derive the fractions of vegetation canopy and soil using the HJ-1A HSI band 848 nm in Figure 9. Then, the soil spectral reflectance was retrieved according to the area fraction of soil in Figure 10.  Based on the hyperspectral data collected in the laboratory, the lab-derived GA-BPNN had the best performance of the predictions for the soil nutrient contents. In this study, three lab-derived models including PLSR, BPNN and GA-BPNN were further compared by applying them to mapping the contents of the soil nutrients (TN, TP and TK) using the optimal spectral variables from the HJ-1A HIS image for the Conghua district at the regional scale. To obtain the optimal spectral variables, the HJ-1A HSI spectral data were first selected and re-sampled to estimate the soil nutrient contents, which led to the original optimal spectral variables (OOSV), including band 562, band 574 and band 591 for TN; band 603, band 613 and band 650 for TP; and band 625, band 802 and band 892 for TK. From the bands, the soil component reflectance values of mixed pixels were then derived using the linear spectral unmixing analysis, which led to the soil component optimal spectral variables (SCOSV). The results of applying the optimal spectral variables to the lab-derived models showed that the spatial distributions of the soil nutrient contents obtained by BPNN and GA-BPNN presented very similar spatial patterns, while the prediction maps of the soil nutrient contents using PLSR looked very different and unreasonable with much greater values throughout the whole area. Moreover, the prediction accuracies of the lab-derived models at the regional scale were assessed using the measurements of the soil nutrient contents from 33 soil samples in Conghua district. When the set of the SCOSV optimal spectral variables was used, the obtained RRMSE values by PLSR, BPNN and GA-BPNN were 82.37%, 50.56% and 40.41% for TN; 57.50%, 41.65%, 34.71% for TP; and 71.62%, 38.52% and 20.37% for TK. This indicated that the GA-BPNN models created the most accurate estimates for all the soil nutrient contents, and the BPNN models and the PLSR had the Based on the hyperspectral data collected in the laboratory, the lab-derived GA-BPNN had the best performance of the predictions for the soil nutrient contents. In this study, three lab-derived models including PLSR, BPNN and GA-BPNN were further compared by applying them to mapping the contents of the soil nutrients (TN, TP and TK) using the optimal spectral variables from the HJ-1A HIS image for the Conghua district at the regional scale. To obtain the optimal spectral variables, the HJ-1A HSI spectral data were first selected and re-sampled to estimate the soil nutrient contents, which led to the original optimal spectral variables (OOSV), including band 562, band 574 and band 591 for TN; band 603, band 613 and band 650 for TP; and band 625, band 802 and band 892 for TK. From the bands, the soil component reflectance values of mixed pixels were then derived using the linear spectral unmixing analysis, which led to the soil component optimal spectral variables (SCOSV). The results of applying the optimal spectral variables to the lab-derived models showed that the spatial distributions of the soil nutrient contents obtained by BPNN and GA-BPNN presented very similar spatial patterns, while the prediction maps of the soil nutrient contents using PLSR looked very different and unreasonable with much greater values throughout the whole area. Moreover, the prediction accuracies of the lab-derived models at the regional scale were assessed using the measurements of the soil nutrient contents from 33 soil samples in Conghua district. When the set of the SCOSV optimal spectral variables was used, the obtained RRMSE values by PLSR, BPNN and GA-BPNN were 82.37%, 50.56% and 40.41% for TN; 57.50%, 41.65%, 34.71% for TP; and 71.62%, 38.52% and 20.37% for TK. This indicated that the GA-BPNN models created the most accurate estimates for all the soil nutrient contents, and the BPNN models and the PLSR had the poorest performance. Similar results were obtained when the set of the OOSV optimal spectral variables was utilized. However, given a model, the set of the SCOSV optimal spectral variables after the decomposition of mixed pixels resulted in more accurate predictions of all the soil nutrient contents than the set of the OOSV before the decomposition of mixed pixels. Due to the limited space, the figure and table were omitted.
The spatial distributions of the predicted soil nutrient contents from the OOSV-based GA-BPNN model and the SCOSV-based GA-BPNN model were compared in Figure 11. Given a soil nutrient, two sets of the optimal spectral variables, OOSV and SCOSV, led to similar spatial patterns of the predictions. Table 4 showed the comparison of the predicted soil nutrient contents from the OOSV-based GA-BPNN model and the SCOSV-based GA-BPNN model based on the measurements of the soil nutrient contents from the 33 test samples in the Conghua district at the regional scale. The SCOSV-based GA-BPNN model led to more accurate estimates of all the soil nutrient contents than the OOSV-based GA-BPNN model. However, the prediction accuracies of the soil nutrient contents for Conghua district at the regional scale using the GA-BPNN model were obviously lower than those from the GA-BPNN model for Guangdong province at the sample point level. Compatibly, the soil TK content showed higher estimation accuracy with R 2 = 0.80 and RRMSE = 20.37% than the soil TN with R 2 = 0.58 and RRMSE = 40.41% and the soil TP with R 2 = 0.69 and RRMSE = 34.71%, indicating that the SCOSV-based GA-BPNN model was capable of mapping the soil TK content using the HJ-1A HSI data at the regional scale. from the GA-BPNN model for Guangdong province at the sample point level. Compatibly, the soil TK content showed higher estimation accuracy with R 2 = 0.80 and RRMSE = 20.37% than the soil TN with R 2 = 0.58 and RRMSE = 40.41% and the soil TP with R 2 = 0.69 and RRMSE = 34.71%, indicating that the SCOSV-based GA-BPNN model was capable of mapping the soil TK content using the HJ-1A HSI data at the regional scale.

Discussion
Soil nutrients, such as TN, TP, and TK, play a vital role in plant growth. Accurately estimating and mapping soil nutrient contents and monitoring their dynamics become critical. Estimating soil nutrient contents in soils using hyperspectral data is a cost-efficient but challenging method due to the effects of complex landscapes, vegetation canopies, mixed pixels and soil properties [41].
Previous studies [20][21][22][23][24] demonstrated that the BPNN method was a good alternative to estimating the soil nutrient contents with the values of R 2 varying from 0.65 to 0.85 for TN, 0.60 to 0.75 for TP and 0.62 to 0.85 for TK, being often obtained. In this study, the corresponding R 2 values obtained from the BPNN models without the integration of GA were 0.65, 0.74 and 0.81 for the soil nutrient TN, TP and TK contents, respectively (Table 3). This indicated that the research results driven from the BPNN models are in agreement with the findings of previous studies.
However, the large uncertainties of the input initial parameters for BPNN affected the improvement of estimation accuracy for the contents of soil nutrients [25]. In this study, GA was introduced to BPNN, which led to an integrated GA-BPNN method to optimize the BPNN initial input parameters (thresholds and weights) and provide the solution for the problem of being stuck in the local minima [46]. The results of the test datasets in this study showed that compared with the BPNN models without the optimization of the input parameters, the GA-BPNN models significantly improved the estimation accuracies of the soil nutrient contents by decreasing the RRMSE values by 15.9%, 5.6% and 20.2% at the sample point level, and 20.1%, 16.5% and 47.1% at the regional scale for TN, TP and TK, respectively.
In addition, in order to validate the regional scale applicability of the GA-BPNN prediction models, the HJ-1A HSI data (including OOSV and SCOSV) were used to map the soil nutrient contents for Conghua district. The results of validation using the 33 soil samples showed that the GA-BPNN models provided the potential to map the soil nutrient contents, but the TK content was more reliably estimated than the TN and TP contents, implying that the GA-BPNN was capable of estimating the soil TK content at the regional scale using HJ-1A HSI image. The SCOSV-based GA-BPNN model led to higher prediction accuracies than the OOSV-based GA-BPNN model at the regional scale, which may be attributed to reducing the effects of vegetation cover. However, compared with that using the measured hyperspectral data in laboratory, the prediction accuracy of the soil nutrient contents using the HJ-1A HSI image was lower, which might be attributed to the atmospheric conditions, soil surface conditions (e.g., soil moisture), and inconsistent band ranges between the spectral data of lab-measurement and hyperspectral image [47]. Thus, these factors should be considered in future research.
In order to obtain more accurate predictions for soil nutrient contents, we will add other prediction methods (e.g., MLP, deep learning, SVM and RF, etc.). In addition, in this study we only used 50 soil samples to build the prediction models and 25 soil samples to validate the model predictions in the whole Guangdong province. Although the sampling design was conducted based on soil types, the sample sizes were relatively small. However, the study focused on developing high-accuracy prediction methods rather than spatial mapping. Thus, the sample sizes were statistically acceptable. In the future studies, larger sample sizes should be utilized to further build and validate the prediction methods.

Conclusions
This study focused on the development and assessment of the GA-BPNN method for soil spectroscopy analysis to predict the contents of soil nutrients TN, TP and TK, using the hyperspectral measurements from soil samples taken from Guangdong, China, and collected in a laboratory. To validate the results, a comparison with two other commonly used methods (PLSR and BPNN) was made. Moreover, the lab-driven models were assessed by their applications to the Conghua district of Guangzhou City to map the contents of the soil nutrients at the regional scale using HJ-1A HSI image. To the best of our knowledge, this is the first time that GA-was integrated with BPNN for predictions of the soil nutrient contents. The results revealed that (1) Based on the RRMSE values from the validation datasets, the GA-BPNN models of the soil nutrient contents offered the most accurate estimates at both the soil sample point level and the regional scale; (2) Compared with BPNN models without GA, the GA-BPNN models significantly decreased the RRMSE values of all the predicted soil nutrient contents, implying that by integrating GA to optimize the parameters of BPNN, the GA-BPNN provided greater potential to improving the estimation; (3) The prediction accuracies of PLSR models were much lower than those from the BPNN and GA-BPNN models. This implied that there existed a significantly nonlinear relationship between the spectral variables and the soil nutrient contents; (4) The content of TK could be reliably mapped by the GA-BPNN method, with RRMSE values of 20.37% for the Conghua district at the regional scale, while the contents of soil TN and TP were relatively difficult to predict with the RRMSE values of 40.41% and 34.71% at the regional scale. The results provided a reliable reference for the optimization of the spectral prediction models and the improvement for the prediction accuracy of the soil nutrient contents at the regional scale.