An Improved Gradient Boosting Regression Tree Estimation Model for Soil Heavy Metal ( Arsenic ) Pollution Monitoring Using Hyperspectral Remote Sensing

Hyperspectral remote sensing can be used to effectively identify contaminated elements in soil. However, in the field of monitoring soil heavy metal pollution, hyperspectral remote sensing has the characteristics of high dimensionality and high redundancy, which seriously affect the accuracy and stability of hyperspectral inversion models. To resolve the problem, a gradient boosting regression tree (GBRT) hyperspectral inversion algorithm for heavy metal (Arsenic (As)) content in soils based on Spearman’s rank correlation analysis (SCA) coupled with competitive adaptive reweighted sampling (CARS) is proposed in this paper. Firstly, the CARS algorithm is used to roughly select the original spectral data. Second derivative (SD), Gaussian filtering (GF), and min-max normalization (MMN) pretreatments are then used to improve the correlation between the spectra and As in the characteristic band enhancement stage. Finally, the low-correlation bands are removed using the SCA method, and a subset with absolute correlation values greater than 0.6 is retained as the optimal band subset after each pretreatment. For the modeling, the five most representative characteristic bands were selected in the Honghu area of China, and the nine most representative characteristic bands were selected in the Daye area of China. In order to verify the generalization ability of the proposed algorithm, 92 soil samples from the Honghu and Daye areas were selected as the research objects. With the use of support vector machine regression (SVMR), linear regression (LR), and random forest (RF) regression methods as comparative methods, all the models obtained a good prediction accuracy. However, among the different combinations, CARS-SCA-GBRT obtained the highest precision, which indicates that the proposed algorithm can select fewer characteristic bands to achieve a better inversion effect, and can thus provide accurate data support for the treatment and recovery of heavy metal pollution in soils.


Introduction
Heavy metal pollution in soil is caused by human activity that brings those metals into the soil, resulting in degradation of its quality and deterioration of the ecological environment.Heavy metals in soil are difficult to degrade, easy to accumulate, and toxic.Heavy metals in soil can be transferred to all the parts of plants through plant growth and absorption [1,2].In soil they can also accumulate through the food chain and eventually enter the human body.Once in the bloodstream, heavy metals reweighted sampling (CARS) was used.In order to verify the generalization ability of the algorithm, gradient boosting regression tree (GBRT), random forest (RF), linear regression (LR), and support vector machine (SVM) models were used to build the model, and we compared the accuracies to find the best method for the inversion of heavy metals in the soil.This will be of great significance to the rapid monitoring of soil heavy metal content in this region, and will provide a new approach for large-scale real-time monitoring and early warning of soil heavy metal pollution.

Study Areas
We chose the cities of Daye and Honghu as the study areas, which are two typical areas of Jianghan Plain in Hubei province, China.The city of Daye (114 • 31 ~115 • 20 E, 29 • 40 ~30 • 15 N) is located in the southeast of Hubei province, on the south bank of the middle reaches of the Yangtze River.The area features a typical subtropical continental monsoon climate, with four distinct seasons, and sufficient water and heat for crop production.The annual average temperature is 16.9 • C, and the average annual precipitation is 1385.8mm.The city area is mainly hilly, with an altitude of 120-200 m.The zonal soils in this area are mainly red soil and yellow-brown soil.The area is a production base of grain, cotton, and oil-bearing crops.The Daye area is rich in mineral resources, and features a number of coppers, iron, coal, and limestone mines [13].At the same time, the mining has greatly damaged the ecological environment, and the farmland soil near the mining area has been seriously polluted.Long-term mining and metallurgy activities have caused the soil to accumulate different levels of heavy metals.
The city of Honghu (113 • 07 ~114 • 05 E, 29 • 39 ~30 • 12 N) is a county-level city in the municipal region of Jingzhou, Hubei province, in the middle and lower reaches of the Yangtze River, in the southeast of Jianghan Plain.The landform type in this area is mainly alluvial plain.The city has good soil quality, and the soil acidity-alkalinity is moderate, making the area suitable for planting crops.However, excessive application of pesticides and fertilizers has caused serious pollution in the area.The water quality of Honghu Lake has been severely degraded, and much of the cultivated land has been exposed to heavy metals such as Cr, Pb, and Cd.The two study areas are shown in Figure 1.
ector machine (SVM) models were used to build the model, and we compared the accuracies to find he best method for the inversion of heavy metals in the soil.This will be of great significance to the apid monitoring of soil heavy metal content in this region, and will provide a new approach for arge-scale real-time monitoring and early warning of soil heavy metal pollution.

.1. Study Areas
We chose the cities of Daye and Honghu as the study areas, which are two typical areas of ianghan Plain in Hubei province, China.The city of Daye (114°31′～115°20′E, 29°40′～30°15′N) is ocated in the southeast of Hubei province, on the south bank of the middle reaches of the Yangtze iver.The area features a typical subtropical continental monsoon climate, with four distinct seasons, nd sufficient water and heat for crop production.The annual average temperature is 16.9 °C, and he average annual precipitation is 1385.8mm.The city area is mainly hilly, with an altitude of 120-00 m.The zonal soils in this area are mainly red soil and yellow-brown soil.The area is a production ase of grain, cotton, and oil-bearing crops.The Daye area is rich in mineral resources, and features number of coppers, iron, coal, and limestone mines [13].At the same time, the mining has greatly amaged the ecological environment, and the farmland soil near the mining area has been seriously olluted.Long-term mining and metallurgy activities have caused the soil to accumulate different evels of heavy metals.
The city of Honghu (113°07′～114°05′E, 29°39′～30°12′N) is a county-level city in the municipal egion of Jingzhou, Hubei province, in the middle and lower reaches of the Yangtze River, in the outheast of Jianghan Plain.The landform type in this area is mainly alluvial plain.The city has good oil quality, and the soil acidity-alkalinity is moderate, making the area suitable for planting crops.owever, excessive application of pesticides and fertilizers has caused serious pollution in the area.he water quality of Honghu Lake has been severely degraded, and much of the cultivated land has een exposed to heavy metals such as Cr, Pb, and Cd.The two study areas are shown in Figure 1.

The CARS-SCA Characteristic Band Selection Algorithm
CARS is a new algorithm for spectral characteristic screening.Adaptive reweighted sampling (ARS) is used as the fitness function by the PLS method [14].Cross-validation is used to optimize the calculation and select the optimal subset, i.e., the subset of the regression model with the highest precision.In addition, the variables with large error are eliminated, after multiple cycles of sampling to select the characteristic bands.There is a correlation between the characteristic bands of the spectra and the As content of the soil, which can be measured by the correlation coefficients.In this study, SCA was used as a correlation analysis tool between the spectra and the soil samples.Commonly used methods for amplifying characteristic bands are SD, Gaussian filtering (GF), and min-max normalization (MMN) pretreatments.In this study, a new method was used to combine the pretreatment methods of the spectra, to select the most suitable spectral bands.The specific process is as follows: Step 1: The Monte Carlo sampling (MCS) method was used to randomly extract 80% of the sample set as a calibration set, and to establish a PLS regression model.M l×p represents the l sample and p variable spectral matrix of the calibration set.y l×1 represents the As content of the soil of the calibration set.T represents the scoring matrix of M, and B represents the M and T contact coefficient matrix.c represents the regression coefficient vector of the PLS correction model established by y and T. f represents the prediction residual.The following relationships are established: where h = Bc = [h 1 , h 2 , ..., h p ] T is the p-dimensional coefficient vector.The absolute value of the ith element in B, denoted as |h i |(1 ≤ i ≤ p), reflects the ith wavelength's contribution to y.Therefore, the larger the value of |h i |, the more important this variable is.The weight value is defined as: We assume that there are P wavelength points in the entire spectral range, and that there are N MCSs in the CARS.As mentioned before, wavelength selection in CARS is a two-step process.In the first step, an exponentially decreasing function (EDF) is used to force the removal of wavelengths with relatively small absolute regression coefficients, At the ith MCS, the retention rate of the wavelength point can be obtained based on the following EDF: where a and k are two constants: N times of MCS can be used to obtain N PLS models.Each model is used to calculate one root-mean-square error of cross-validation (RMSECV) value.We finally select the subset of variables with the smallest RMSECV value, which is the optimal variable subset.
Step 2: The spectral set matrix M is preprocessed.M = [X 1 , X 2, ..., X i ], M is an i-dimensional vector of X, and X i is the spectral curve of each soil sample.yi represents the As content of the soil of the calibration set.According to the Spearman's rank correlation coefficient [15], the formula is as shown in Equation (6), and the calculation result is obtained as the correlation coefficient row vector.The corresponding ρ is the Spearman's rank correlation coefficient of the corresponding bands, where the bands with higher correlation coefficients are selected.
where N is the number of samples, and d i = x i − y i is the gradation difference of two paired variables.
Step 3: After GF, SD, and gaussian filtering again pretreatments, the processed data and the soil samples are separately subject to SCA to obtain the correlation coefficient row vectors: (1) Gaussian filtering is a kind of linear smoothing filter which chooses weights according to the shape of the Gaussian function.It is very effective for suppressing noise obeying a normal distribution: where χ is the distance of the weight function from the maximum point, and the scale parameter σ represents the width of the Gaussian function, which determines the smoothness of the filtering.
(2) Second derivative pretreatment can eliminate some baseline and other background interference, while improving the spectral resolution and sensitivity.It is widely used in spectral analysis.
where λ i represents the reflectance value of the ith band, and ∆λ is the wavelength interval.
Step 4: After each treatment in step 3, SCA between the processed spectral values and soil heavy metal As is performed.The results of each analysis are then combined to obtain a correlation coefficient matrix C(i × 1).The bands with Where N is the number of samples, and is the gradation difference of two paired variables.
Step 3: After GF, SD, and gaussian filtering again pretreatments, the processed data and the soil samples are separately subject to SCA to obtain the correlation coefficient row vectors: (1) Gaussian filtering is a kind of linear smoothing filter which chooses weights according to the shape of the Gaussian function.It is very effective for suppressing noise obeying a normal distribution: Where  is the distance of the weight function from the maximum point, and the scale parameter  represents the width of the Gaussian function, which determines the smoothness of the filtering.
(2) Second derivative pretreatment can eliminate some baseline and other background interference, while improving the spectral resolution and sensitivity.It is widely used in spectral analysis.

()
where i  represents the reflectance value of the ith band, and   is the wavelength interval.
Step 4: After each treatment in step 3, SCA between the processed spectral values and soil heavy metal As is performed.The results of each analysis are then combined to obtain a correlation coefficient matrix ( 1) Ci .The bands with i C ≧0.6 are then extracted.The data stored in the matrix () F i q  corresponding to the GF, SD, and GF processing are then found, and the matrix

() F i q 
is normalized to [0,1] using MMN, and the processed result is taken as the characteristic band subset.
(1) MMN can reduce the difference between the magnitudes and smooth the wave function of the data sample values.0.6 are then extracted.The data stored in the matrix F(i × q) corresponding to the GF, SD, and GF processing are then found, and the matrix F(i × q) is normalized to [0,1] using MMN, and the processed result is taken as the characteristic band subset.
(1) MMN can reduce the difference between the magnitudes and smooth the wave function of the data sample values.
x = x i − x min x max − x min (9) where x max is the maximum of F i , and x min is the minimum of F i .The entire algorithm flow is shown in Figure 2.

 
Where N is the number of samples, and i i i d x y = − is the gradation difference of two paired variables.
Step 3: After GF, SD, and gaussian filtering again pretreatments, the processed data and the soil samples are separately subject to SCA to obtain the correlation coefficient row vectors: (1) Gaussian filtering is a kind of linear smoothing filter which chooses weights according to the shape of the Gaussian function.It is very effective for suppressing noise obeying a normal distribution: Where χ is the distance of the weight function from the maximum point, and the scale parameter σ represents the width of the Gaussian function, which determines the smoothness of the filtering.
(2) Second derivative pretreatment can eliminate some baseline and other background interference, while improving the spectral resolution and sensitivity.It is widely used in spectral analysis.
[ ] where i λ represents the reflectance value of the ith band, and λ Δ is the wavelength interval.
Step 4: After each treatment in step 3, SCA between the processed spectral values and soil heavy metal As is performed.The results of each analysis are then combined to obtain a correlation coefficient matrix ( 1) C i× .The bands with i C ≧0.6 are then extracted.The data stored in the matrix ( ) F i q × corresponding to the GF, SD, and GF processing are then found, and the matrix is normalized to [0,1] using MMN, and the processed result is taken as the characteristic band subset.
(1) MMN can reduce the difference between the magnitudes and smooth the wave function of the data sample values.

Gradient Boosting Regression Tree
The gradient boosting algorithm is an optimization algorithm based on error function.Gradient boosting is a machine learning technology that is used to solve classification and regression problems.It generates a strong prediction model by integrating weak prediction models, such as the decision tree.The GBRT algorithm was subsequently developed by Friedman [16].The core idea is that each calculation is done by a basic model, and the next calculation is undertaken to reduce the residual of the last model and to create a new basic model in the direction of the gradient with reduced residuals [17].Therefore, by constantly adjusting and optimizing the weight of the weak learner to make it a strong learner, the loss function can be minimized and optimized.The algorithm parameters in this study were set as follows: loss = 'least squres', learning_rate = 0.1, n_estimators = 100, subsample = 1.0, min_samples_split = 2, min_samples_leaf = 1, max_depth = 3, alpha = 0.9, and max_leaf_nodes = 0. Given a set of data points (x i , y i ), i = 1, ..., N, the loss function can be expressed as g m (x).The input space is split into disjoint regions R 1m , R 2m , ..., R jm , and a constant value is estimated for each region b jm .The number of leaf nodes per regression tree is j.The GBRT model and regression tree g m (x) are expressed as follows: Step 1: Model initialization: Step 2: Iterative generation of M regression trees, where, for m= 1 to M, m represents the mth tree: (1) For i = 1 to N, i represents the ith sample.The negative gradient value of the loss function is calculated and then used as an estimate of the residual r im : (2) A regression tree g m (x) is generated for the residual generated in the previous step.The input space of the m-tree is then divided into J disjoint areas R 1m, R 2m, ..., R jm , and the step size of the gradient drop is calculated: Step 3: Update of the model, where lr represents the learning rate, which is designed to prevent model over-fitting, and to reduce the impact of each base model on the final results.

Support Vector Machine Regression
SVM is a nonlinear load forecasting model, the basic principles of which are as follows [18].Given a set of data points (x i , y i ), i = 1, ..., l, x i is a factor that is closely related to the forecast, such as the measured spectra of soil from the darkroom.d is the dimension of the selected input variable, y i is the expected value of the forecast, and l is the total number of known data points.By introducing the Lagrange function, we can express the dual optimization problem as follows: where α i and α i * are Lagrange multipliers.Finally, the regression function can be expressed as:

Random Forest Regression
RF regression is a combined classification model composed of many decision tree regression models, and the parameter set is an independent and identically distributed random vector.Under a given independent variable X, each decision tree regression model will have a prediction result [19,20].Using bootstrap sampling, k samples are extracted from the original measured spectral training set.The sample sizes of these k samples are the same as those of the original training set.k decision tree models are then established for these samples to obtain k regression results.Finally, the k results are averaged to obtain the final prediction.

Accuracy Evaluation
In this paper, the parameters of determination coefficients (R 2 ), root-mean-square error (RMSE), and mean absolute error (MAE) are used to measure the accuracy of the models [21].The closer R 2 is to 1, the better the stability of the model and the higher the degree of fit.RMSE and MAE are used to test the predictive ability of the model.The smaller the RMSE and MAE, the better the predictive ability.
where n is the number of samples, y i is the measured value, ŷi is the predicted value, and y is the average of the measured values.

Soil Collection and Preparation
The soil sampling points were in the areas of the cities of Daye and Honghu, respectively.Different types of cultivated soils (0-20 cm) were collected in the study areas, including 29 samples in Daye and 63 in Honghu.A chessboard-type sampling method was adopted for the soil collection sites.Foreign bodies such as stones and plant roots in the dried soil were removed, and the soil was then crushed.The crushed soil was then passed through a 2-mm aperture sieve, and then taken out by quartering.The soil was then roller-compacted to pass it through a 0.15-mm aperture sieve [22].Each soil sample was then divided into two parts for spectral information collection and physical and chemical analysis.The soil samples were digested with nitric acid/hydrochloric acid/perchloric acid and then measured with potassium borohydride/silver nitrate spectrophotometry.Each soil sample was measured three times, and the arithmetic mean was taken as the final As content in the soil.The modal number of the As concentration of the 29 soil samples from Honghu was 34-36 ug/g, which is higher than the modal number value of 8-9 ug/g for the 63 soil samples collected in Daye.The Histograms of the As concentrations are shown in Figure 3.
Each soil sample was then divided into two parts for spectral information collection and physical and chemical analysis.The soil samples were digested with nitric acid/hydrochloric acid/perchloric acid and then measured with potassium borohydride/silver nitrate spectrophotometry.Each soil sample was measured three times, and the arithmetic mean was taken as the final As content in the soil.The modal number of the As concentration of the 29 soil samples from Honghu was 34-36 ug/g, which is higher than the modal number value of 8-9 ug/g for the 63 soil samples collected in Daye.The Histograms of the As concentrations are shown in Figure 3.

Soil Spectral Reflectance Measurement
In the indoor spectral measurement stage, an SVC HR-1024 field spectrometer was used to measure the spectra of the soil samples from the Honghu area, for which the spectral resolution is as follows: 350 to 1000 nm is 1.5 nm, 1000 to 1900 nm is 3.8 nm , and 1900 to 2500 nm is 2.5 nm.The total number of bands is 990.An ASD FieldSpec 3 field spectrometer was used to measure the spectra of the soil samples from the Daye area, for which the spectral resolution is 1 nm, and the total number of bands is 2151 [23].The wavelength range of the spectrometers is 350-2500 nm.The soil spectral measurement was carried out in a darkroom.The final processed soil sample was placed on a black petri dish, and the soil surface was flattened to make the surface smooth and easy to measure.The light source was a 1000-watt halogen lamp, with the irradiation direction being 45° from the vertical direction [24].The light source was set about 30 cm from the surface of the soil sample, with the probe perpendicular to the soil surface and about 10 cm away from the soil sample.White-board calibration was performed on the spectrometers before the measurement.In order to eliminate the instability of the measurement, four spectral curves were measured for each soil sample, after being rotated three times (90° each time) in the same direction, and the actual reflection data were obtained after arithmetic averaging [25].

Spectral Pretreatment
Due to the inevitable influence of factors such as the test environment, the instruments, the background of the samples, and stray light in the process of spectrum acquisition, the spectral band edge noise was relatively high [26,27].In order to reduce the external noise, the noisy edge bands of 350 to 399 nm and 2400 to 2500 nm were removed, and we retained the 400 to 2399nm spectral range for the modeling analysis [28].In this paper, for the Honghu area, 904 bands were finally retained,

Soil Spectral Reflectance Measurement
In the indoor spectral measurement stage, an SVC HR-1024 field spectrometer was used to measure the spectra of the soil samples from the Honghu area, for which the spectral resolution is as follows: 350 to 1000 nm is 1.5 nm, 1000 to 1900 nm is 3.8 nm, and 1900 to 2500 nm is 2.5 nm.The total number of bands is 990.An ASD FieldSpec 3 field spectrometer was used to measure the spectra of the soil samples from the Daye area, for which the spectral resolution is 1 nm, and the total number of bands is 2151 [23].The wavelength range of the spectrometers is 350-2500 nm.The soil spectral measurement was carried out in a darkroom.The final processed soil sample was placed on a black petri dish, and the soil surface was flattened to make the surface smooth and easy to measure.The light source was a 1000-watt halogen lamp, with the irradiation direction being 45 • from the vertical direction [24].The light source was set about 30 cm from the surface of the soil sample, with the probe perpendicular to the soil surface and about 10 cm away from the soil sample.White-board calibration was performed on the spectrometers before the measurement.In order to eliminate the instability of the measurement, four spectral curves were measured for each soil sample, after being rotated three times (90 • each time) in the same direction, and the actual reflection data were obtained after arithmetic averaging [25].

Spectral Pretreatment
Due to the inevitable influence of factors such as the test environment, the instruments, the background of the samples, and stray light in the process of spectrum acquisition, the spectral band edge noise was relatively high [26,27].In order to reduce the external noise, the noisy edge bands of 350 to 399 nm and 2400 to 2500 nm were removed, and we retained the 400 to 2399nm spectral range for the modeling analysis [28].In this paper, for the Honghu area, 904 bands were finally retained, and 2000 bands were finally retained for the Daye area.The two sets of soil spectra are shown in Figure 4.

Calibration Set and Validation Set
Before the modeling, the samples needed to be grouped.One group, called the calibration set, was used for the establishment of the models, and the other group, called the validation set, was used to test the predictive ability of the models.In this study, 92 soil samples were collected from the Honghu and Daye study areas.Nineteen calibration samples and 10 validation samples were selected in the Honghu area, and 41 calibration samples and 22 validation samples were selected in the Daye area.Referring to the Soil Environmental Quality Standards GB15618-1995 released by the Ministry of Environmental Protection of the People's Republic of China [29], the average value for the Honghu area exceeds the third-class standard (the critical value of soil for the protection of agricultural and forestry production and normal plant growth), and the Honghu area belongs to the polluted area category.The average value for the Daye area is lower than the first-class standard (the limit for soil quality that protects the natural ecology of the area and maintains the natural background), and the Daye area belongs to the unpolluted area category.The difference between the mean, standard deviation, and coefficient of variation of the calibration sets and verification sets is small [30,31].Therefore, this division can be considered as reasonable and can be used for subsequent modeling.The statistical descriptions of the As concentrations in two study areas are given in Table 1.

The CARS-SCA Characteristic Band Selection Algorithm
The CARS algorithm was used to perform rough selection of the original soil spectra, and the number of Monte Carlo samples was set at 100.Figures 5 and 6 show the selection process of spectral variables for the CARS method.In Figures 5a and 6a, the number of variables gradually decreases, and the downward trend is slower and slower; the RMSECV value in Figures 5b and 6b indicates the prediction effect of the PLS model based on the characteristic wavelength of the adaptive reweighted

Calibration Set and Validation Set
Before the modeling, the samples needed to be grouped.One group, called the calibration set, was used for the establishment of the models, and the other group, called the validation set, was used to test the predictive ability of the models.In this study, 92 soil samples were collected from the Honghu and Daye study areas.Nineteen calibration samples and 10 validation samples were selected in the Honghu area, and 41 calibration samples and 22 validation samples were selected in the Daye area.Referring to the Soil Environmental Quality Standards GB15618-1995 released by the Ministry of Environmental Protection of the People's Republic of China [29], the average value for the Honghu area exceeds the third-class standard (the critical value of soil for the protection of agricultural and forestry production and normal plant growth), and the Honghu area belongs to the polluted area category.The average value for the Daye area is lower than the first-class standard (the limit for soil quality that protects the natural ecology of the area and maintains the natural background), and the Daye area belongs to the unpolluted area category.The difference between the mean, standard deviation, and coefficient of variation of the calibration sets and verification sets is small [30,31].Therefore, this division can be considered as reasonable and can be used for subsequent modeling.The statistical descriptions of the As concentrations in two study areas are given in Table 1.

The CARS-SCA Characteristic Band Selection Algorithm
The CARS algorithm was used to perform rough selection of the original soil spectra, and the number of Monte Carlo samples was set at 100.Figures 5 and 6 show the selection process of spectral variables for the CARS method.In Figures 5a and 6a, the number of variables gradually decreases, and the downward trend is slower and slower; the RMSECV value in Figures 5b and 6b indicates the prediction effect of the PLS model based on the characteristic wavelength of the adaptive reweighted sampling selection; In Figures 5c and 6c, each line represents the change trend of the coefficient of reversion of each wavelength, and the * marks the position with the smallest RMSECV value.After the *, the RMSECV value starts to increase because the valid variable is deleted.For the Daye area, the trend of the RMSECV obtained by 10-fold cross-validation (CV) is shown in Figure 5b, where it can be seen that the RMSECV value changes significantly.When the number of samples is 62, the RMSECV value is the smallest, which indicates that the model has the highest accuracy.As the sampling runs increase, the RMSECV increases and the model effect deteriorates.Therefore, when the number of samples is 62, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 28 characteristic bands (401 nm, 402 nm, 420 nm, 566 nm, 577 nm, 676 nm, 677 nm, 688 nm, 689 nm, 690 nm, 697 nm, 705 nm, 706 nm, 707 nm, 822 nm, 826 nm, 827 nm, 871 nm, 957 nm, 984 nm, 1826 nm, 1906 nm, 2224 nm, 2232 nm, 2344 nm, 2382 nm, 2393 nm, 2394 nm) were obtained.For the Honghu area, the variation trend of the RMSECV obtained by 10-fold CV is shown in Figure 6b.When the number of sampling runs is 60, the RMSECV value is the smallest.This shows that the model has the highest accuracy at this point, and as the sampling runs increase, the increase of the RMSECV indicates that some important variables related to the prediction of As in soil have been eliminated.Therefore, when the sampling frequency is 60 times, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 23 characteristic bands (401.9 nm, 405 nm, 408.1 nm, 437.3 nm, 813.6 nm, 817.3 nm, 818.6 nm, 822.3 nm, 823.5 nm, 827.2 nm, 978.9 nm, 997.6 nm, 998.5 nm, 999.4 nm, 1000.3 nm, 1896.8 nm, 1929.2 nm, 1942.9 nm, 2303 nm, 2357.7 nm, 2362.4 nm, 2376.4 nm, 2385.7 nm) were obtained.
Appl.Sci.2019, 9, x 10 of 16 sampling selection; In Figures 5c and 6c, each line represents the change trend of the coefficient of reversion of each wavelength, and the * marks the position with the smallest RMSECV value.After the *, the RMSECV value starts to increase because the valid variable is deleted.For the Daye area, the trend of the RMSECV obtained by 10-fold cross-validation (CV) is shown in Figure 5b, where it can be seen that the RMSECV value changes significantly.When the number of samples is 62, the RMSECV value is the smallest, which indicates that the model has the highest accuracy.As the sampling runs increase, the RMSECV increases and the model effect deteriorates.Therefore, when the number of samples is 62, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 28 characteristic bands (401 nm, 402 nm, 420 nm, 566 nm, 577 nm, 676 nm, 677 nm, 688 nm, 689 nm, 690 nm, 697 nm, 705 nm, 706 nm, 707 nm, 822 nm, 826 nm, 827 nm, 871 nm, 957 nm, 984 nm, 1826 nm, 1906 nm, 2224 nm, 2232 nm, 2344 nm, 2382 nm, 2393 nm, 2394 nm) were obtained.For the Honghu area, the variation trend of the RMSECV obtained by 10-fold CV is shown in Figure 6b.When the number of sampling runs is 60, the RMSECV value is the smallest.This shows that the model has the highest accuracy at this point, and as the sampling runs increase, the increase of the RMSECV indicates that some important variables related to the prediction of As in soil have been eliminated.Therefore, when the sampling frequency is 60 times, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 23 characteristic bands (401.9 nm, 405 nm, 408.1 nm, 437.3 nm, 813.6 nm, 817.3 nm, 818.6 nm, 822.3 nm, 823.5 nm, 827.2 nm, 978.9 nm, 997.6 nm, 998.5 nm, 999.4 nm, 1000.3 nm, 1896.8 nm, 1929.2 nm, 1942.9 nm, 2303 nm, 2357.7 nm, 2362.4 nm, 2376.4 nm, 2385.7 nm) were obtained.sampling selection; In Figures 5c and 6c, each line represents the change trend of the coefficient of reversion of each wavelength, and the * marks the position with the smallest RMSECV value.After the *, the RMSECV value starts to increase because the valid variable is deleted.For the Daye area, the trend of the RMSECV obtained by 10-fold cross-validation (CV) is shown in Figure 5b, where it can be seen that the RMSECV value changes significantly.When the number of samples is 62, the RMSECV value is the smallest, which indicates that the model has the highest accuracy.As the sampling runs increase, the RMSECV increases and the model effect deteriorates.Therefore, when the number of samples is 62, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 28 characteristic bands (401 nm, 402 nm, 420 nm, 566 nm, 577 nm, 676 nm, 677 nm, 688 nm, 689 nm, 690 nm, 697 nm, 705 nm, 706 nm, 707 nm, 822 nm, 826 nm, 827 nm, 871 nm, 957 nm, 984 nm, 1826 nm, 1906 nm, 2224 nm, 2232 nm, 2344 nm, 2382 nm, 2393 nm, 2394 nm) were obtained.For the Honghu area, the variation trend of the RMSECV obtained by 10-fold CV is shown in Figure 6b.When the number of sampling runs is 60, the RMSECV value is the smallest.This shows that the model has the highest accuracy at this point, and as the sampling runs increase, the increase of the RMSECV indicates that some important variables related to the prediction of As in soil have been eliminated.Therefore, when the sampling frequency is 60 times, the selected spectral band subset is the optimal one, and the RMSECV value is the smallest.In total, 23 characteristic bands (401.9 nm, 405 nm, 408.

Characteristic Band Improvement
Because the CARS algorithm uses the PLS method as the fitness function, this linear model faces the problem of high dimensional data, which causes the problem of low accuracy.The CARS algorithm does not change the original data, and it is often impossible to use in the case of low-correlation data.It is therefore necessary to improve the characteristic bands.Through the SD, GF and GF again pretreatment methods, the characteristic bands can be effectively enhanced.By extracting Spearman's rank correlation coefficients with absolute values of greater than 0.6 for the subsequent model building, the first k correlation-optimized bands were obtained as the characteristic band subset.It can be seen from Figure 7 that the spectral correlation is still low after the CARS rough selection, but the correlation coefficient is greatly improved by the pretreatment.Five enhanced characteristic bands (827.2 nm, 1000.3 nm, 1929.2 nm, 1949.2 nm, 2357.7 nm) were obtained for the Honghu area, and nine enhanced characteristic bands were obtained for the Daye area (402 nm, 420 nm, 566 nm, 577 nm, 689 nm, 690 nm, 697 nm, 2382 nm, 2393 nm), for use in the final modeling.

Characteristic Band Improvement
Because the CARS algorithm uses the PLS method as the fitness function, this linear model faces the problem of high dimensional data, which causes the problem of low accuracy.The CARS algorithm does not change the original data, and it is often impossible to use in the case of lowcorrelation data.It is therefore necessary to improve the characteristic bands.Through the SD, GF and GF again pretreatment methods, the characteristic bands can be effectively enhanced.By extracting Spearman's rank correlation coefficients with absolute values of greater than 0.6 for the subsequent model building, the first k correlation-optimized bands were obtained as the characteristic band subset.It can be seen from Figure 7 that the spectral correlation is still low after the CARS rough selection, but the correlation coefficient is greatly improved by the pretreatment.Five enhanced characteristic bands (827.2 nm, 1000.3 nm, 1929.2 nm, 1949.2 nm, 2357.7 nm) were obtained for the Honghu area, and nine enhanced characteristic bands were obtained for the Daye area (402 nm, 420 nm, 566 nm, 577 nm, 689 nm, 690 nm, 697 nm, 2382 nm, 2393 nm), for use in the final modeling.R is to 1, the higher the degree of fitting.The predictive power of the model is measured by RMSE and MAE.The smaller the value, the more accurate the prediction result of the model and the higher the accuracy.The prediction accuracies for the As concentration in the soil obtained with CARS using the four different regression models are given in Table 2.For the Honghu area, the four inversion methods obtain a lower accuracy.For the Daye area, the 2 p R values of the models constructed by the four regression methods are all below 0.5.The reason for this is that the full spectrum contains a lot of redundant and irrelevant information.CARS can remove the redundancy between the characteristic bands, but it cannot improve the correlation between spectral reflectivity and soil heavy metal As, resulting in a lower model accuracy.In the study, the four different regression models were used to establish the prediction model of soil As content.The closer R 2 is to 1, the higher the degree of fitting.The predictive power of the model is measured by RMSE and MAE.The smaller the value, the more accurate the prediction result of the model and the higher the accuracy.The prediction accuracies for the As concentration in the soil obtained with CARS using the four different regression models are given in Table 2.For the Honghu area, the four inversion methods obtain a lower accuracy.For the Daye area, the R 2 p values of the models constructed by the four regression methods are all below 0.5.The reason for this is that the full spectrum contains a lot of redundant and irrelevant information.CARS can remove the redundancy between the characteristic bands, but it cannot improve the correlation between spectral reflectivity and soil heavy metal As, resulting in a lower model accuracy.The CARS-SCA algorithm combines the global search capability of CARS with the inversion capability of PLS, it reuses the feature enhancement methods to improve the correlation, and finally uses Spearman's correlation analysis to identify the optimal subsets.It can be seen from Table 3 that good inversion results are obtained with the GBRT regression method.For the Honghu area, the LR prediction effect is poor, and the R 2 , RMSE, and MAE of the validation set are 0.8914, 1.1638, and 0.9212, respectively.The R 2 , RMSE, and MAE of the GBRT validation set are 0.9711, 0.6007, and 0.4990, respectively, which are the best prediction results for the study area.For the Daye area, the SVMR prediction effect is poor, and the R 2 , RMSE, and MAE of the validation set are 0.8691, 0.3491, and 0.1935, respectively.Meanwhile, the R 2 , RMSE, and MAE of the GBRT validation set are 0.9796, 0.1379, and 0.1078, respectively, which represent the best prediction results for this region.In summary, the R 2 values of the validation sets of all the models are all above 0.85, which shows that the overall accuracy is high enough to meet the actual needs, while GBRT is superior to the other three models in stability and predictive ability, and the inversion effect is better.

Regression Model
In this study, GBRT, RF, LR, and SVMR models were used for the regression.Based on the CARS-SCA characteristic band selection algorithm, the characteristic bands with correlation values greater than 0.6 were used as the independent variables, and the soil As content was used as the dependent variable to construct the soil As model.It can be seen from Figures 8 and 9 that the eight models all obtain a good accuracy.The closer the scatter plot of the measured value and predicted value to the 1:1 line, the higher the inversion accuracy.Most of the samples are closely distributed around the 1:1 line, indicating that the inversion models are accurate.Among them, the GBRT regression model shows the smallest deviation from the 1:1 line, and the degree of fitting is the highest.
the dependent variable to construct the soil As model.It can be seen from Figures 8 and 9 that the eight models all obtain a good accuracy.The closer the scatter plot of the measured value and predicted value to the 1:1 line, the higher the inversion accuracy.Most of the samples are closely distributed around the 1:1 line, indicating that the inversion models are accurate.Among them, the GBRT regression model shows the smallest deviation from the 1:1 line, and the degree of fitting is the highest.

Figure 1 .
Figure 1.Study area locations.Study area I is Daye.Study area II is Honghu.

Figure 1 .
Figure 1.Study area locations.Study area I is Daye.Study area II is Honghu.

Figure 3 .
Figure 3. Histograms of the As concentrations of the soil samples.(a) Soil samples from Honghu.(b) Soil samples from Daye.

Figure 3 .
Figure 3. Histograms of the As concentrations of the soil samples.(a) Soil samples from Honghu.(b) Soil samples from Daye.

Figure 4 .
Figure 4.The spectra of the soil samples from the different regions.(a) Honghu.(b) Daye.

Figure 5 .Figure 6 .
Figure 5. (a) Change in the number of selected variables with the change in the number of samples runs for the CARS algorithm in Daye.(b) Change in the 10-fold root-mean-square error of crossvalidation value with the change in the number of sample runs for the CARS algorithm in Daye.(c) Change in the regression coefficient variable with the change in the number of samples runs for the CARS algorithm in Daye.

Figure 5 .
Figure 5. (a) Change in the number of selected variables with the change in the number of samples runs for the CARS algorithm in Daye.(b) Change in the 10-fold root-mean-square error of cross-validation value with the change in the number of sample runs for the CARS algorithm in Daye.(c) Change in the regression coefficient variable with the change in the number of samples runs for the CARS algorithm in Daye.

Figure 5 .Figure 6 .
Figure 5. (a) Change in the number of selected variables with the change in the number of samples runs for the CARS algorithm in Daye.(b) Change in the 10-fold root-mean-square error of crossvalidation value with the change in the number of sample runs for the CARS algorithm in Daye.(c) Change in the regression coefficient variable with the change in the number of samples runs for the CARS algorithm in Daye.

Figure 6 .
Figure 6.(a) Change in the number of selected variables with the change in the number of samples runs for the CARS algorithm in Honghu.(b) Change in the 10-fold root-mean-square error of cross-validation value with the change in the number of sample runs for the CARS algorithm in Honghu.(c) Change in the regression coefficient variable with the change in the number of samples runs for the CARS algorithm in Honghu.

3. 3 .
Comparative Analysis 3.3.1.Analysis of the Results of the Feature Selection Algorithm Based on CARS In the study, the four different regression models were used to establish the prediction model of soil As content.The closer 2

3. 3 .
Comparative Analysis 3.3.1.Analysis of the Results of the Feature Selection Algorithm Based on CARS

Figure 8 .Figure 8 .Figure 8 .
Figure 8.A comparison between the measured values and predicted values of the different regression models for Honghu.(a) Gradient boosting regression tree model.(b) Random forest regression model.(c) Linear regression model.(d) Support vector machine regression model.

Table 1 .
Statistical characteristics of the soil As content.

Table 1 .
Statistical characteristics of the soil As content.

Table 2 .
Accuracy validation of the different models.

Table 2 .
Accuracy validation of the different models.Analysis of the Results of the Feature Selection Algorithm Based on CARS-SCA

Table 3 .
Accuracy validation of the different models.